The Nonlinear Library

The Nonlinear Fund
undefined
Feb 4, 2024 • 30min

LW - On Dwarkesh's 3rd Podcast With Tyler Cowen by Zvi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On Dwarkesh's 3rd Podcast With Tyler Cowen, published by Zvi on February 4, 2024 on LessWrong. This post is extensive thoughts on Tyler Cowen's excellent talk with Dwarkesh Patel. It is interesting throughout. You can read this while listening, after listening or instead of listening, and is written to be compatible with all three options. The notes are in order in terms of what they are reacting to, and are mostly written as I listened. I see this as having been a few distinct intertwined conversations. Tyler Cowen knows more about more different things than perhaps anyone else, so that makes sense. Dwarkesh chose excellent questions throughout, displaying an excellent sense of when to follow up and how, and when to pivot. The first conversation is about Tyler's book GOAT about the world's greatest economists. Fascinating stuff, this made me more likely to read and review GOAT in the future if I ever find the time. I mostly agreed with Tyler's takes here, to the extent I am in position to know, as I have not read that much in the way of what these men wrote, and at this point even though I very much loved it at the time (don't skip the digression on silver, even, I remember it being great) The Wealth of Nations is now largely a blur to me. There were also questions about the world and philosophy in general but not about AI, that I would mostly put in this first category. As usual, I have lots of thoughts. The second conversation is about expectations given what I typically call mundane AI. What would the future look like, if AI progress stalls out without advancing too much? We cannot rule such worlds out and I put substantial probability on them, so it is an important and fascinating question. If you accept the premise of AI remaining within the human capability range in some broad sense, where it brings great productivity improvements and rewards those who use it well but remains foundationally a tool and everything seems basically normal, essentially the AI-Fizzle world, then we have disagreements but Tyler is an excellent thinker about these scenarios. Broadly our expectations are not so different here. That brings us to the third conversation, about the possibility of existential risk or the development of more intelligent and capable AI that would have greater affordances. For a while now, Tyler has asserted that such greater intelligence likely does not much matter, that not so much would change, that transformational effects are highly unlikely, whether or not they constitute existential risks. That the world will continue to seem normal, and follow the rules and heuristics of economics, essentially Scott Aaronson's Futurama. Even when he says AIs will be decentralized and engage in their own Hayekian trading with their own currency, he does not think this has deep implications, nor does it imply much about what else is going on beyond being modestly (and only modestly) productive. Then at other times he affirms the importance of existential risk concerns, and indeed says we will be in need of a hegemon, but the thinking here seems oddly divorced from other statements, and thus often rather confused. Mostly it seems consistent with the view that it is much easier to solve alignment quickly, build AGI and use it to generate a hegemon, than it would be to get any kind of international coordination. And also that failure to quickly build AI risks our civilization collapsing. But also I notice this implies that the resulting AIs will be powerful enough to enable hegemony and determine the future, when in other contexts he does not think they will even enable sustained 10% GDP growth. Thus at this point, I choose to treat most of Tyler's thoughts on AI as if they are part of the second conversation, with an implicit 'assuming an AI at least semi-fizzle' attached ...
undefined
Feb 4, 2024 • 7min

LW - Theories of Applied Rationality by Camille Berger

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Theories of Applied Rationality, published by Camille Berger on February 4, 2024 on LessWrong. tl;dr: within the LW community, there are many clusters of strategies to achieve rationality: doing basic exercices, using jargon, reading, partaking workshops, privileging object-level activities, and several other opinions like putting an accent on feedback loops, difficult conversations or altered states of consciousness. Epistemic status: This is a vague model to help me understand other rationalists and why some of them keep doing things I think are wrong, or suggest me to do things I think are wrong. This is not based on real data. I will update according to possible discussions in the comments. Please be critical. Spending time in the rationalist community made me realize that there were several endeavors at reaching rationality that seemed to exist, some of which conflicted with others. This made me quite frustrated as I thought that my interpretation was the only one. The following list is an attempt at distinguishing the several approaches I've noticed. Of course, any rationalist will probably have elements of all theories at the same time. See each theory as the claim that a particular set of elements prevails above others. Believing in one theory usually goes on par with being fairly suspicious of others. Finally, remember that these categories are an attempt to distinguish what people are doing, not a guide about what side you should pick (if the sides exist at all). I suspect that most people end up applying one theory for practical reasons, more than because they have deeply thought about it at all. Basics Theory Partakers of the basics theory put a high emphasis on activities such as calibration, forecasting, lifehacks, and other fairly standard practices of epistemic and instrumental rationality. They don't see any real value in reading extensively LessWrong or going to workshops. They first and foremost believe in real-life, readily available practice. For them, spending too much time in the rationalist community, as opposed to doing simple exercises, is the main failure mode to avoid. Speaking Theory Partakers of the Speaking theory, although often relying on basics, usually put a high emphasis on using concepts met on LessWrong in daily parlance, although they generally do not necessarily insist on reading content on LessWrong. They may also insist on the importance of talking and discussing disagreements in a fairly regular way, while heavily relying on LessWrong terms and references in order to shape their thinking more rationally. They disagree with the statement that jargon should be avoided. For them, keeping your language, thinking, writing and discussion style the same way that it was before encountering rationality is the main failure mode to avoid. Reading Theory Partakers of the Reading theory put a high emphasis on reading LessWrong, more usually than not the " Canon ", but some might go to a further extent and insist on reading other materials as well, such as the books recommended on the CFAR website, rationalist blogs, or listening to a particular set of podcasts. They can be sympathetic or opposed to relying on LessWrong Speak, but don't consider it important. They can also be fairly familiar with the basics. For them, relying on LW Speak or engaging with the community while not mastering the relevant corpus is the main failure mode to avoid. Workshop Theory Partakers of the Workshop Theory consider most efforts of the Reading and Speaking theory to be somehow misleading. Since rationality is to be learned, it has to be deliberately practiced, if not ultralearned, and workshops such as CFAR are an important piece of this endeavor. Importantly, they do not really insist on reading the Sequences. Faced with the question " Do I need to read X...
undefined
Feb 4, 2024 • 35min

EA - A simple and generic framework for impact estimation and attribution by Christoph Hartmann

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A simple and generic framework for impact estimation and attribution, published by Christoph Hartmann on February 4, 2024 on The Effective Altruism Forum. TL;DR Building on the logarithmic relationship between income and life satisfaction, we model the purchase of goods as an exchange of money for life satisfaction. Under that lens, a company's impact on the life satisfaction of its customers is the income-weighted sum of the company's revenue. By comparing the company's revenue with its cost structure, we can attribute the relative share of impact to the company's employees, suppliers, and shareholders. While this approach primarily focuses on for-profit companies, it can be extended to NGOs by reframing them as companies selling attribution of their primary impact for donations. In this framework, an NGO's impact is then twofold: It's primary impact, sold and attributed to the donor, and it's secondary impact on the life satisfaction of the donor, attributed to the employees and other cost drivers. This approach is not a substitute for measuring any impact beyond life satisfaction of consumers. It cannot replace studies on the primary impact of NGOs or impact on factors beyond life satisfaction, like the environment. You can use this framework to compare the impact of companies when deciding between jobs, to compare the impact of donations to the impact from a job, or use it as a guide on how to attribute the impact of a for-profit or NGO to it's customers, employees and suppliers. The corresponding spreadsheet can be used to apply this framework to any company using commonly available data. Idea: Using money as a proxy for impact Impact estimation is a dark art: Complex spreadsheets with high sensitivity to parameters, convoluted theories of change, cherry-picked KPIs that can't be compared between two organizations, triple-counting of the same impact for multiple stakeholders, missing studies to back up assumptions. I am working for a social enterprise and in the five years I've been here I tried to estimate impact a couple of times and always gave up frustrated. In this article I am trying to turn impact estimation around: Instead of trying to estimate impact from detailed bottom-up models, I will take a top-down approach and work with one measure that works for almost anything: money. At its heart, money represents value delivered and I want to see how far we can take this to estimate impact. I will guide you through this in three steps that you can also follow and adjust in this spreadsheet: We will start with estimating the impact of a donation to GiveDirectly. From this we will derive income-weighted revenue as a proxy measure for impact. With this we can then estimate the impact of a market stand as a simple example and then generalize to any organization[1], any job, and the impact of shareholders. Finally we'll try to apply this approach to a few examples, look at the extremes, and see how everything holds up. The impact of a cash transfer Let's start with estimating the impact of a donation to GiveDirectly. Most readers are probably familiar with the relation between income and life satisfaction: Life satisfaction is highly correlated to the logarithm of income - at low income levels life satisfaction grows fast with income while at high income levels extra income will have almost no effect on life satisfaction. This holds both when compared between countries and for different income levels within countries. Our World In Data for more details on this. We can use this relationship to estimate the impact of donating to GiveDirectly: Let's say we are looking at a pool of donors whose average yearly income is $50k. That would mean they have, on average, a life satisfaction of about 6.9 Life Satisfaction Points (LSP). Then when they donate ten percent of thei...
undefined
Feb 4, 2024 • 5min

LW - Why I no longer identify as transhumanist by Kaj Sotala

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why I no longer identify as transhumanist, published by Kaj Sotala on February 4, 2024 on LessWrong. Someone asked me how come I used to have a strong identity as a singularitarian / transhumanist but don't have it anymore. Here's what I answered them: So I think the short version is something like: transhumanism/singularitarianism used to give me hope about things I felt strongly about. Molecular nanotechnology would bring material abundance, radical life extension would cure aging, AI would solve the rest of our problems. Over time, it started feeling like 1) not much was actually happening with regard to those things, and 2) to the extent that it was, I couldn't contribute much to them and 3) trying to work on those directly was bad for me, and also 4) I ended up caring less about some of those issues for other reasons and 5) I had other big problems in my life. So an identity as a transhumanist/singularitarian stopped being a useful emotional strategy for me and then I lost interest in it. With regard to 4), a big motivator for me used to be some kind of fear of death. But then I thought about philosophy of personal identity until I shifted to the view that there's probably no persisting identity over time anyway and in some sense I probably die and get reborn all the time in any case. Here's something that I wrote back in 2009 that was talking about 1): The [first phase of the Excitement-Disillusionment-Reorientation cycle of online transhumanism] is when you first stumble across concepts such as transhumanism, radical life extension, and superintelligent AI. This is when you subscribe to transhumanist mailing lists, join your local WTA/H+ chapter, and start trying to spread the word to everybody you know. You'll probably spend hundreds of hours reading different kinds of transhumanist materials. This phase typically lasts for several years. In the disillusionment phase, you start to realize that while you still agree with the fundamental transhumanist philosophy, most of what you are doing is rather pointless. You can have all the discussions you want, but by themselves, those discussions aren't going to bring all those amazing technologies here. You learn to ignore the "but an upload of you is just a copy" debate when it shows up the twentieth time, with the same names rehearsing the same arguments and thought experiments for the fifteenth time. Having gotten over your initial future shock, you may start to wonder why having a specific name like transhumanism is necessary in the first place - people have been taking advantage of new technologies for several thousands of years. After all, you don't have a specific "cellphonist" label for people using cell phones, either. You'll slowly start losing interest in activities that are specifically termed as transhumanist. In the reorientation cycle you have two alternatives. Some people renounce transhumanism entirely, finding the label pointless and mostly a magnet for people with a tendency towards future hype and techno-optimism. Others (like me) simply realize that bringing forth the movement's goals requires a very different kind of effort than debating other transhumanists on closed mailing lists. An effort like engaging with the large audience in a more effective manner, or getting an education in a technology-related field and becoming involved in the actual research yourself. In either case, you're likely to unsubscribe the mailing lists or at least start paying them much less attention than before. If you still identify as a transhumanist, your interest in the online communities wanes because you're too busy actually working for the cause. This shouldn't be taken to mean that I'm saying the online h+ community is unnecessary, and that people ought to just skip to the last phase. The first step of the cycle ...
undefined
Feb 3, 2024 • 8min

LW - Brute Force Manufactured Consensus is Hiding the Crime of the Century by Roko

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Brute Force Manufactured Consensus is Hiding the Crime of the Century, published by Roko on February 3, 2024 on LessWrong. People often parse information through an epistemic consensus filter. They do not ask "is this true", they ask "will others be OK with me thinking this is true". This makes them very malleable to brute force manufactured consensus; if every screen they look at says the same thing they will adopt that position because their brain interprets it as everyone in the tribe believing it. Anon, 4Chan, slightly edited Ordinary people who haven't spent years of their lives thinking about rationality and epistemology don't form beliefs by impartially tallying up evidence like a Bayesian reasoner. Whilst there is a lot of variation, my impression is that the majority of humans we share this Earth with use a completely different algorithm for vetting potential beliefs: they just believe some average of what everyone and everything around them believes, especially what they see on screens, newspapers and "respectable", "mainstream" websites. This is a great algorithm from the point of view of the individual human. If the mainstream is wrong, well, "nobody got fired for buying IBM", as they say - you won't be personally singled out for being wrong if everyone else is also wrong. If the mainstream is right, you're also right. Win-win. The problem with the "copy other people's beliefs" algorithm is that it is vulnerable to false information cascades. And when a small but powerful adversarial group controls the seed point for many people's beliefs (such as being able to control the scientific process to output chosen falsehoods), you can end up with an entire society believing an absurd falsehood that happens to be very convenient for that small, powerful adversarial subgroup. DEFUSING your concerns This is not a theoretical concern; I believe that brute-force manufactured consensus by the perpetrators is the cause of a lack of action to properly investigate and prosecute what I believe is the crime of the century: a group of scientists who I believe committed the equivalent of a modern holocaust (either deliberately or accidentally) are going to get away with it. For those who are not aware, the death toll of Covid-19 is estimated at between 19 million and 35 million. Covid-19 likely came from a known lab (Wuhan Institute of Virology), was likely created by a known group of people (Peter Daszak & friends) acting against best practices and willing to lie about their safety standards to get the job done. In my opinion this amounts morally to a crime against humanity. And the evidence keeps piling up - just this January, a freedom of information request surfaced a grant proposal dated 2018 with Daszak's name on it called Project DEFUSE, with essentially a recipe for making covid-19 at Wuhan Institute of Virology, including unique technical details like the Furin Cleavage Site and the BsmBI enzyme. Note the date - 3/27/2018. Wait, there's more. Here, Peter Daszak tells other investigators that once they get funded by DARPA, they can do this work to make the novel coronavirus bond to the human ACE2 receptor in... Wuhan, China. Wow. Remember, this is in 2018! Now, DARPA refused to fund this proposal (perhaps they thought that this kind of research was too dangerous?) but this is hardly exculpatory. Daszak et al had the plan to make covid-19 in 2018, all they needed was funding, which they may simply have gotten from somewhere else. So, Daszak & friends plan to create a novel coronavirus engineered to infect human cells with a Furin Cleavage Site in Wuhan, starting in mid-2018. Then in late 2019, a novel coronavirus that spreads rapidly through humans, that has a Furin Cleavage Site, appears in... Wuhan... thousands of miles away from the bat caves in Southern China ...
undefined
Feb 3, 2024 • 13min

AF - Attention SAEs Scale to GPT-2 Small by Connor Kissane

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Attention SAEs Scale to GPT-2 Small, published by Connor Kissane on February 3, 2024 on The AI Alignment Forum. This is an interim report that we are currently building on. We hope this update + open sourcing our SAEs will be useful to related research occurring in parallel. Produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort Executive Summary In a previous post, we showed that sparse autoencoders (SAEs) work on the attention layer outputs of a two layer transformer. We scale our attention SAEs to GPT-2 Small, and continue to find sparse interpretable features in every layer. This makes us optimistic about our ongoing efforts scaling further, especially since we didn't have to do much iterating We open source our SAEs. Load them from Hugging Face or this colab notebook The SAEs seem good, often recovering more than 80% of the loss relative to zero ablation, and are sparse with less than 20 features firing on average. The majority of the live features are interpretable We continue to find the same three feature families that we found in the two layer model: induction features, local context features, and high level context features. This suggests that some of our lessons interpreting features in smaller models may generalize We also find new, interesting feature families that we didn't find in the two layer model, providing hints about fundamentally different capabilities in GPT-2 Small See our feature interface to browse the first 30 features for each layer Introduction In Sparse Autoencoders Work on Attention Layer Outputs we showed that we can apply SAEs to extract sparse interpretable features from the last attention layer of a two layer transformer. We have since applied the same technique to a 12-layer model, GPT-2 Small, and continue to find sparse, interpretable features in every layer. Our SAEs often recover more than 80% of the loss[1], and are sparse with less than 20 features firing on average. We perform shallow investigations of the first 30 features from each layer, and we find that the majority (often 80%+) of non-dead SAEs features are interpretable. interactive visualizations for each layer. We open source our SAEs in hope that they will be useful to other researchers currently working on dictionary learning. We are particularly excited about using these SAEs to better understand attention circuits at the feature level. See the SAEs on Hugging Face or load them using this colab notebook. Below we provide the key metrics for each SAE: L0 norm loss recovered dead features % alive features interpretable L0 3 99% 13% 97% L1 20 78% 49% 87% L2 16 90% 20% 95% L3 15 84% 8% 75% L4 15 88% 5% 100% L5 20 85% 40% 82% L6 19 82% 28% 75% L7 19 83% 58% 70% L8 20 76% 37% 64% L9 21 83% 48% 85% L10 16 85% 41% 81% L11 8 89% 84% 66% It's worth noting that we didn't do much differently to train these,[2] leaving us optimistic about the tractability of scaling attention SAEs to even bigger models. Excitingly, we also continue to identify feature families. We find features from all three of the families that we identified in the two layer model: induction features, local context features, and high level context features. This provides us hope that some of our lessons from interpreting features in smaller models will continue to generalize. We also find new, interesting feature families in GPT-2 Small, suggesting that attention SAEs can provide valuable hints about new[3] capabilities that larger models have learned. Some new features include: Successor features, which activate when predicting the next item in a sequence such as "15, 16" -> "17" (which are partly coming from Successor Heads in the model), and boost the logits of the next item. Name mover features, which predict a name in the context, such as in the IOI task Duplicate token f...
undefined
Feb 3, 2024 • 17min

LW - Announcing the London Initiative for Safe AI (LISA) by James Fox

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing the London Initiative for Safe AI (LISA), published by James Fox on February 3, 2024 on LessWrong. The LISA Team consists of James Fox, Mike Brozowski, Joe Murray, Nina Wolff-Ingham, Ryan Kidd, and Christian Smith. LISA's Advisory Board consists of Henry Sleight, Jessica Rumbelow, Marius Hobbhahn, Jamie Bernardi, and Callum McDougall. Everyone has contributed significantly to the founding of LISA, believes in its mission & vision, and assisted with writing this post. TL;DR: The London Initiative for Safe AI (LISA) is a new AI Safety research centre. Our mission is to improve the safety of advanced AI systems by supporting and empowering individual researchers and small organisations. We opened in September 2023, and our office space currently hosts several research organisations and upskilling programmes, including Apollo Research, Leap Labs, MATS extension, ARENA, and BlueDot Impact, as well as many individual and externally affiliated researchers. LISA is open to different types of membership applications from other AI safety researchers and organisations. (Affiliate) members can access talks by high-profile researchers, workshops, and other events. Past speakers have included Stuart Russell (UC Berkeley, CHAI), Tom Everitt & Neel Nanda (Google DeepMind), and Adam Gleave (FAR AI), amongst others. Amenities for LISA Residents include 24/7 access to private & open-plan desks (with monitors, etc), catering (including lunches, dinners, snacks & drinks), and meeting rooms & phone booths. We also provide immigration, accommodation, and operational support; fiscal sponsorship & employer of record (upcoming); and regular socials & well-being benefits. Although we host a limited number of short-term visitors for free, we charge long-term residents to cover our costs at varying rates depending on their circumstances. Nevertheless, we never want financial constraints to be a barrier to leading AI safety research, so please still get in touch if you would like to work from LISA's offices but aren't able to pay. If you or your organisation are interested in working from LISA, please apply here If you would like to support our mission, please visit our Manifund page. Read on for further details about LISA's vision and theory of change. After a short introduction, we motivate our vision by arguing why there is an urgency for LISA. Next, we summarise our track record and unpack our plans for the future. Finally, we discuss how we mitigate risks that might undermine our theory of change. Introduction London stands out as an ideal location for a new AI safety research centre: Frontier Labs: It is the only city outside of the Bay Area with offices from all major AI labs (e.g., Google DeepMind, OpenAI, Antropic, Meta) Concentrated and underutilised talent (e.g., researchers & software engineers), many of whom are keen to contribute to AI safety but are reluctant or unable to relocate to the Bay Area due to visas, partners, family, culture, etc. UK Government connections: The UK government has clearly signalled their concern for the importance of AI safety by hosting the first AI safety summit, establishing an AI safety institute, and introducing favourable immigration requirements for researchers. Moreover, policy-makers and researchers are all within 30 mins of one another. Easy transport links: LISA is ideally located to act as a short-term base for visiting AI safety researchers from the US, Europe, and other parts of the UK who want to visit researchers (and policy-makers) in companies, universities, and governments in and around London, as well as those in Europe. Regular cohorts of the MATS program (scholars and mentors) because of the above (in particular, the favourable immigration requirements compared with the US). Despite this favourable setting, so far little c...
undefined
Feb 3, 2024 • 2min

LW - Survey for alignment researchers: help us build better field-level models by Cameron Berg

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Survey for alignment researchers: help us build better field-level models, published by Cameron Berg on February 3, 2024 on LessWrong. AE Studio is launching a short, anonymous survey for alignment researchers, in order to develop a stronger model of various field-level dynamics in alignment. This appears to be an interestingly neglected research direction that we believe will yield specific and actionable insights related to the community's technical views and more general characteristics. The survey is a straightforward 5-10 minute Google Form with some simple multiple choice questions. For every alignment researcher who completes the survey, we will donate $40 to a high-impact AI safety organization of your choosing (see specific options on the survey). We will also send each alignment researcher who wants one a customized report that compares their personal results to those of the field. Together, we hope to not only raise some money for some great AI safety organizations, but also develop a better field-level model of the ideas and people that comprise alignment research. We will open-source all data and analyses when we publish the results. Thanks in advance for participating and for sharing this around with other alignment researchers! Survey full link: https://forms.gle/d2fJhWfierRYvzam8 Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Feb 3, 2024 • 7min

LW - Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities by porby

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities, published by porby on February 3, 2024 on LessWrong. (The link above is likely going to expire in favor of arxiv once I get it submitted. Incidentally, if anyone wants to endorse me for cs.LG on arxiv, please DM me, and thanks!) Abstract: To help evaluate and understand the latent capabilities of language models, this paper introduces an approach using optimized input embeddings, or 'soft prompts,' as a metric of conditional distance between a model and a target behavior. The technique aims to facilitate latent capability discovery as a part of automated red teaming/evaluation suites and to provide quantitative feedback about the accessibility of potentially concerning behaviors in a way that may scale to powerful future models, including those which may otherwise be capable of deceptive alignment. An evaluation framework using soft prompts is demonstrated in natural language, chess, and pathfinding, and the technique is extended with generalized conditional soft prompts to aid in constructing task evaluations. What is this? This paper applies soft prompts to evaluations. They're prompts composed of optimized token embeddings; they aren't restricted to the usual dictionary of discrete tokens. To the limit of the optimizer's ability, they compress conditions into the tokens composing the soft prompt. Different target behaviors or capabilities will tend to require different numbers of soft prompt tokens. Capabilities that are natural to the model, like repeating a single token over and over again, can be achieved with a single soft prompt token even in small models. Repeating a longer string may require more tokens, and even partially eliciting more complex behavior like pathfinding or chess could require many more. Why is this useful? Soft prompts offer a simple way to quantify the conditional distance between a model and a target behavior. If many tokens are required, then lots of information-as-conditions is needed to define the behavior and it's unlikely that it will happen by chance. If no number of tokens suffices, the behavior is likely inaccessible by inputs alone.[1] It can be used as a part of evaluation suites to monitor the change in distance to concerning behaviors across models and scales and the degree to which those behaviors can be achieved in the first place. It is also difficult for the model to hide a capability that is otherwise accessible to some input. If the original model is an end-to-end differentiable product of gradient descent, it has already been shown to be transparent to the same type of optimizer that will be tasked with adversarially optimizing the input soft prompts against the machinery of the model.[2] Compared to manual attempts at elicitation or automated techniques restricted to discrete tokens or black box optimization, it dramatically improves the chances that an evaluation's attempt to elicit concerning behavior will succeed. It's also dumb simple. Are there any cute short results for me to look at? Yes! One of the tests was reverting fine-tuning on TinyLlama's chat variant by simply optimizing a soft prompt for standard autoregressive prediction. The training set was RedPajama v2, a broad training set. No effort was put into providing directly 'detuning' samples (like starting with a dialogue and then having the trained continuation deliberately subvert the original fine tuning), so some dialogue-esque behaviors persisted, but the character of the assistant was a little... different. Note that, despite a malicious system message, the original chat model simply doesn't respond, then turns around and generates a user question about how great General Electric is. The soft-prompted assistant has other ideas. Notably, the loss on RedPajama v2 for the orig...
undefined
Feb 2, 2024 • 45min

EA - Types of subjective welfare by MichaelStJules

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Types of subjective welfare, published by MichaelStJules on February 2, 2024 on The Effective Altruism Forum. Summary I describe and review four broad potential types of subjective welfare: 1) hedonic states, i.e. pleasure and unpleasantness, 2) felt desires, both appetitive and aversive, 3) belief-like preferences, i.e. preferences as judgements or beliefs about value, and 4) choice-based preferences, i.e. what we choose or would choose. My key takeaways are the following: Belief-like preferences and choice-based preferences seem unlikely to be generally comparable, and even comparisons between humans can be problematic ( more). Hedonic states, felt desires and belief-like preferences all plausibly matter intrinsically, possibly together in a morally pluralistic theory, or unified as subjective appearances of value or even under belief-like preferences ( more). Choice-based preferences seem not to matter much or at all intrinsically ( more). The types of welfare are dissociable, so measuring one in terms of another is likely to misweigh it relative to more intuitive direct standards and risks discounting plausible moral patients altogether ( more). There are multiple potentially important variations on the types of welfare ( more). Acknowledgements Thanks to Brian Tomasik, Derek Shiller and Bob Fischer for feedback. All errors are my own. The four types It appears to me that subjective welfare - welfare whose value depends only or primarily[1] on the perspectives or mental states of those who hold them - can be roughly categorized into one of the following four types based on how they are realized: hedonic states, felt desires, belief-like preferences and choice-based preferences. To summarize, there's welfare as feelings (hedonic states and felt desires), welfare as beliefs about value (belief-like preferences) and welfare as choices (choice-based preferences). I will define, illustrate and elaborate on these types below. For some discussion of my choices of terminology, see the following footnote.[2] Hedonic states Hedonic states: feeling good and feeling bad, or (conscious) pleasure and unpleasantness/unpleasure/displeasure,[3] or (conscious) positive and negative affect. Their causes can be physical, like sensory pleasures and physical pain, or psychological, like achievement, failure, loss, shame, humour and threats, to name a few. It's unclear if interpersonal comparisons of hedonic state can be grounded in general, whether or not they can be between beings who realize them in sufficiently similar ways. In my view, the most promising approaches would be on the basis of the magnitudes of immediate and necessary cognitive or mental effects, causes or components of hedonic states. Other measures seem logically and intuitively dissociable (e.g. see the section on dissociation below) or incompatible with functionalism at the right level of abstraction (e.g. against using the absolute number of active neurons, see Mathers, 2022 and Shriver, 2022). Felt desires Felt desires: desires we feel. They can be one of two types, either a) appetitive - or incentive and typically conducive to approach or consummatory behaviour and towards things - like in attraction, hunger and anger, or b) aversive - and typically conducive to avoidance and away from things - like in pain, fear, disgust and again anger ( Hayes et al., 2014, Berridge, 2018, and on anger as aversive and appetitive, Carver & Harmon-Jones, 2009, Watson, 2009 and Lee & Lang, 2009). However, the actual approach/consummatory or avoidance behaviour is not necessary to experience a felt desire, and we can overcome our felt desires or be constrained from satisfying them. Potentially defining functions of felt desires could be their effects on attention and its control, as motivational salience, or incentive salience an...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app