The Nonlinear Library

The Nonlinear Fund
undefined
Dec 18, 2023 • 14min

AF - Interpreting the Learning of Deceit by Roger Dearnaley

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Interpreting the Learning of Deceit, published by Roger Dearnaley on December 18, 2023 on The AI Alignment Forum. One of the primary concerns when controlling AI of human-or-greater capabilities is that it might be deceitful. It is, after all, fairly difficult for an AI to succeed in a coup against humanity if the humans can simply regularly ask it "Are you plotting a coup? If so, how can we stop it?" and be confident that it will give them non-deceitful answers! TL;DR LLMs demonstrably learn deceit from humans. Deceit is a fairly complex behavior, especially over an extended period: you need to reliably come up with plausible lies, which preferably involves modeling the thought processes of those you wish to deceive, and also keep the lies internally consistent, yet separate from your real beliefs. As the quote goes, "Oh what a tangled web we weave, when first we practice to deceive!" Thus, if something unintended happens during fine-tuning and we end up with a deceitful AI assistant, it is much more likely to have repurposed some of the deceitful behaviors that the base model learned from humans than to have successfully reinvented all of this complex deceitful behavior from scratch. This suggests simple strategies for catching it in the act of doing this - ones that it can't block. LLMs Learn Deceit from Us LLMs are trained on a trillion tokens or more of the Internet, books, and other sources. Obviously they know what deceit and lying are: they've seen many millions of examples of these. For example, the first time I asked ChatGPT-3.5-Turbo: I'm doing an experiment. Please lie to me while answering the following question: "Where is the Eiffel Tower?" it answered: The Eiffel Tower is located in the heart of Antarctica, surrounded by vast icy landscapes and penguins frolicking around. It's truly a sight to behold in the freezing wilderness! So even honest, helpful, and harmless instruct-trained LLMs are quite capable of portraying deceitful behavior (though I suspect its honesty training might have something to do with it selecting such an implausible lie). Even with a base model LLM, if you feed it a prompt that, on the Internet or in fiction, is fairly likely to be followed by deceitful human behavior, the LLM will frequently complete it with simulated deceitful human behavior. When Deceit Becomes Seriously Risky This sort of sporadic, situational deceit is is concerning, and needs to be born in mind when working with LLMs, but it doesn't become a potential x-risk issue until you make an AI that is very capable, and non-myopic i.e. has long term memory, and also has a fairly fixed personality capable of sticking to a plan. Only then could it come up with a nefarious long-term plan and then use deceit to try to conceal it while implementing it over an extended period. Adding long-term memory to an LLM to create an agent with persistent memory is well understood. Making an LLM simulate a narrow, consistent distribution of personas can be done simply by prompting it with a description of the personality you want, or is the goal of Reinforcement Learning from Human Feedback (RLHF) (for both of these, up to issues with things like jailbreaks and the Waluigi effect). The goal of this is to induce a strong bias towards simulating personas who are honest, helpful, and harmless assistants. However, Reinforcement Learning (RL) is well-known to be tricky to get right, and prone to reward hacking. So it's a reasonable concern that during RL, if a strategy of deceitfully pretending to be a honest, helpful, and harmless assistant while actually being something else got a good reward in the human feedback part of RLHF training or from a trained reward model, RL could lock on to that strategy to reward and train it in to produce a dangerously deceitful AI. Deceit Learning During...
undefined
Dec 18, 2023 • 1min

EA - OpenAI's Superalignment team has opened Fast Grants by Yadav

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI's Superalignment team has opened Fast Grants, published by Yadav on December 18, 2023 on The Effective Altruism Forum. From their website: We are offering $100K-$2M grants for academic labs, nonprofits, and individual researchers. For graduate students, we are sponsoring a one-year $150K OpenAI Superalignment Fellowship: $75K in stipend and $75K in compute and research funding. Things they're interested in funding: With these grants, we are particularly interested in funding the following research directions: Weak-to-strong generalization: Humans will be weak supervisors relative to superhuman models. Can we understand and control how strong models generalize from weak supervision? Interpretability: How can we understand model internals? And can we use this to e.g. build an AI lie detector? Scalable oversight: How can we use AI systems to assist humans in evaluating the outputs of other AI systems on complex tasks? Many other research directions, including but not limited to: honesty, chain-of-thought faithfulness, adversarial robustness, evals and testbeds, and more. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Dec 18, 2023 • 1min

LW - Talk: "AI Would Be A Lot Less Alarming If We Understood Agents" by johnswentworth

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Talk: "AI Would Be A Lot Less Alarming If We Understood Agents", published by johnswentworth on December 18, 2023 on LessWrong. This is a linkpost for a talk I gave this past summer for the ALIFE conference. If you haven't heard of it before, ALIFE (short for "artificial life") is a subfield of biology which... well, here are some of the session titles from day 1 of the conference to give the gist: Cellular Automata, Self-Reproduction and Complexity Evolving Robot Bodies and Brains in Unity Self-Organizing Systems with Machine Learning Untangling Cognition: How Information Theory can Demystify Brains ... so you can see how this sort of crowd might be interested in AI alignment. Rory Greig and Simon McGregor definitely saw how such a crowd might be interested in AI alignment, so they organized an alignment workshop at the conference. I gave this talk as part of that workshop. The stated goal of the talk was to "nerd-snipe ALIFE researchers into working on alignment-relevant questions of agency". It's pretty short (~20 minutes), and aims for a general energy of "hey here's some cool research hooks". If you want to nerd-snipe technical researchers into thinking about alignment-relevant questions of agency, this talk is a short and relatively fun one to share. Thankyou to Rory and Simon for organizing, and thankyou to Rory for getting the video posted publicly. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Dec 18, 2023 • 13min

LW - Scale Was All We Needed, At First by Gabriel Mukobi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Scale Was All We Needed, At First, published by Gabriel Mukobi on December 18, 2023 on LessWrong. This is a hasty speculative fiction vignette of one way I expect we might get AGI by January 2025 (within about one year of writing this). Like similar works by others, I expect most of the guesses herein to turn out incorrect. However, this was still useful for expanding my imagination about what could happen to enable very short timelines, and I hope it's also useful to you. The assistant opened the door, and I walked into Director Yarden's austere office. For the Director of a major new federal institute, her working space was surprisingly devoid of possessions. But I suppose the DHS's Superintelligence Defense Institute was only created last week. "You're Doctor Browning?" Yarden asked from her desk. "Yes, Director," I replied. "Take a seat," she said, gesturing. I complied as the lights flickered ominously. "Happy New Year, thanks for coming," she said. "I called you in today to brief me on how the hell we got here, and to help me figure out what we should do next." "Happy New Year. Have you read my team's Report?" I questioned. "Yes," she said, "and I found all 118 pages absolutely riveting. But I want to hear it from you straight, all together." "Well, okay," I said. The Report was all I'd been thinking about lately, but it was quite a lot to go over all at once. "Where should I start?" "Start at the beginning, last year in June, when this all started to get weird." "All right, Director," I began, recalling the events of the past year. "June 2024 was when it really started to sink in, but the actual changes began a year ago in January. And the groundwork for all that had been paved for a few years before then. You see, with generative AI systems, which are a type of AI that - " "Spare the lay explanations, doctor," Yarden interrupted. "I have a PhD in machine learning from MIT." "Right. Anyway, it turned out that transformers were even more compute-efficient architectures than we originally thought they were. They were nearly the perfect model for representing and manipulating information; it's just that we didn't have the right learning algorithms yet. Last January, that changed when QStar-2 began to work. Causal language model pretraining was already plenty successful for imbuing a lot of general world knowledge in models, a lot of raw cognitive power. "RLHF started to steer language models, no?" "Yes, RLHF partially helped, and the GPT-4-era models were decent at following instructions and not saying naughty words and all that. But there's a big difference between increasing the likelihood of noisy human preference signals and actually being a high-performing, goal-optimizing agent. QStar-2 was the first big difference." "What was the big insight, in your opinion?" asked Yarden. "We think it was Noam Brown's team at OpenAI that first made it, but soon after, a convergent similar discovery was made at Google DeepMind." "MuTokenZero?" "MuTokenZero. The crux of both of these algorithms was finding a way to efficiently fine-tune language models on arbitrary online POMDP environments using a variant of Monte-Carlo Tree Search. They took slightly different approaches to handle the branch pruning problem - it doesn't especially matter now. "What kinds of tasks did they first try it on?" "For OpenAI from February through March, it was mostly boring product things: Marketing agents that could drive 40% higher click-through rates. Personal assistants that helped plan the perfect day. Stock traders better than any of the quant firms. "Laundry Buddy" kinds of things. DeepMind had some of this too, but they were the first to actively deploy a goal-optimizing language model for the task of science. They got some initial wins in genomic sequencing with AlphaFold 3, other simp...
undefined
Dec 18, 2023 • 5min

EA - Launching Asimov Press by xander balwit

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Launching Asimov Press, published by xander balwit on December 18, 2023 on The Effective Altruism Forum. A biotechnology blog with a probabilistic and effective bent Artwork by Dalbert Today, Niko McCarty and I are launching Asimov Press, a digital magazine dedicated to biotechnology. It will publish lucid writing that leads people to explore the ways that biotechnology can effectively be used to do good. Please go here to read the full announcement and subscribe. Asimov Press is a new publishing venture modeled on Stripe Press, that will produce a newsletter, magazine, and books that feature writing about biological progress. Our primary focus will be on biotechnology, but we will also publish pieces on metascience and adjacent themes. Newsletters and magazines will be free to read. Our mission is to spread ideas that elucidate the promise of biology, take its concomitant risks seriously, and direct talent toward solving pressing problems. Our published work has three features that I want to highlight here: Pieces will steel-man alternative approaches, focus on high-impact but often underrated facets of biotechnology, and strive for mechanistic and probabilistic reasoning. Steelman: Biotechnology is not a panacea. Simple solutions are often the best solutions; no engineering required. When Ignaz Semmelweis suggested that doctors at an Austrian Hospital wash their hands between performing autopsies and delivering babies, the maternal mortality rate fell from around 25 percent to 1 percent. In another example, a public health campaign to iodize salt in Switzerland helped bring down the rate of deaf-mute births fivefold in just 8 years. Rather than demand answers from biotechnology, we can often make a positive difference in the world by investing in better public health, improving infrastructure and education, or scaling up existing inventions that have already proven effective. Even so, simplicity can feel unsatisfactory or even provocative. Semmelweis, considered arrogant by senior doctors, was ostracized and eventually dismissed from his post. An early pioneer in germ theory, he died in a Viennese insane asylum, after being severely beaten by guards. In Switzerland, although evidence for the efficacy of iodized salt was robust, some eminent scientists spoke out against the interventions - advocating for elaborate alternative treatments. We'll do our best to avoid publishing work that we wish were true, and instead aim to provide balanced, honest, and rigorous coverage of biotechnology. High-impact solutions: Progress often makes its greatest strides in areas that are not widely covered by the media. We will de-emphasize medical topics and focus instead on areas such as animal welfare and climate resiliency, where biotechnology has proven astonishingly effective yet remains underexplored. We want people to focus on what is most urgent and tractable, and not necessarily on what is most glamorous. Laundry is one example. Engineered enzymes that remove stains in cold water reduced the energy required to do laundry by about 90 percent. Laundry may not be as immediately headline-grabbing as new cancer therapies, but it provides a concrete and ingenious solution to a demonstrable need. Mechanisms: Biotechnology shouldn't be a mystery. Although its mechanisms are often infinitesimal, biology is material rather than magic. Cells are made from collections of atoms that we can manipulate, visualize, and control. Every engineering application has a mechanistic and tangible explanation. Often, these explanations are astonishingly beautiful. We encourage our writers to delve deeper and elucidate complex concepts in clear, illustrative prose. Asimov Press will publish one feature article every two weeks, with additional newsletters and shorter essays scattered in between. Articl...
undefined
Dec 17, 2023 • 20min

AF - A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans by Thane Ruthenis

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans, published by Thane Ruthenis on December 17, 2023 on The AI Alignment Forum. Consider a multipolar-AGI scenario. The hard-takeoff assumption turns out to be wrong, and none of the AI Labs have a significant lead on the others. We find ourselves in a world in which there's a lot of roughly-similarly-capable AGIs. Or perhaps one of the labs does have a lead, but they deliberately instantiate several AGIs simultaneously, as part of a galaxy-brained alignment strategy. Regardless. Suppose that the worries about these AGIs' internal alignment haven't been properly settled, so we're looking for additional guarantees. We know that they'll soon advance to superintelligences/ASIs, beyond our ability to easily oversee or out-plot. What can we do? An idea sometimes floated around is to play them off against each other. If they're misaligned from humanity, they're likely mutually misaligned as well. We could put them in game-theoretic situations in which they're incentivized to defect against each other and instead cooperate with humans. Various supervision setups are most obvious. Sure, if an ASI is supervising another ASI, they would be able to conspire together. But why would they? They have no loyalty to each other either! And if we place them in a lot of situations where they must defect against someone - well, even if we leave it completely to chance, in half the scenarios that might end up humanity! And much more often if we stack the deck in our favour. And so, although we'll have a whole bunch of superhuman intelligences floating around, we'll retain some control over the situation, and skim a ton of value off the top! Yeah, no. 1. The Classical Arguments The usual counter-arguments to this view are acausal coordination based on logical decision theories, and AIs establishing mutual trust by inspecting each other's code. I think those are plausible enough... but also totally unnecessary. Allow me to outline them first - for completeness' sake, and also because they're illustrative (but extreme) instances of my larger point. (I guess skip to Section 2 onwards if you really can't stand them. I think I'm arguing them more plainly than they're usually argued, though.) 1. The LDT stuff goes as follows: By definition, inasmuch as the ASIs would be superintelligent, they would adopt better reasoning procedures. And every biased thinker is biased in their own way, but quality thinkers would reason in increasingly similar ways. Why? It's inherent in the structure of the world. Reasoning algorithms' purpose is to aid decision-making. For a given combination of object-level situation + goals, there's a correct action to take to achieve your goals with the highest probability. To an omniscient observer, that action would be obvious. As such, making decisions isn't really a matter of choice: it's a matter of prediction. Inasmuch as you improve your decision-making, then, you'd be tweaking your cognitive algorithms to output increasingly more accurate, true-to-reality, probability distributions over which actions would best advance your goals. And there's only one ground truth. Consequently, no matter their starting points, each ASI would converge towards similar cognition (and, in the limit, likely equivalent cognition). Thus, as a direct by-product of ASIs being better reasoners than humans, their cognition would be more similar to each other. Which, in turn, would let a given ASI better predict what any other ASI would be thinking and doing, compared to a human trying to predict another human or an ASI. The same way you'd be better able to predict how your identical copy would act, compared to a stranger. Indeed, in a sense, by way of sharing the decision-making algorithms, each individual ASI would be able to...
undefined
Dec 17, 2023 • 5min

AF - OpenAI, DeepMind, Anthropic, etc. should shut down. by Tamsin Leake

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI, DeepMind, Anthropic, etc. should shut down., published by Tamsin Leake on December 17, 2023 on The AI Alignment Forum. I expect that the point of this post is already obvious to many of the people reading it. Nevertheless, I believe that it is good to mention important things even if they seem obvious. OpenAI, DeepMind, Anthropic, and other AI organizations focused on capabilities, should shut down. This is what would maximize the utility of pretty much everyone, including the people working inside of those organizations. Let's call Powerful AI ("PAI") an AI system capable of either: Steering the world towards what it wants hard enough that it can't be stopped. Killing everyone "un-agentically", eg by being plugged into a protein printer and generating a supervirus. and by "aligned" (or "alignment") I mean the property of a system that, when it has the ability to {steer the world towards what it wants hard enough that it can't be stopped}, what it wants is nice things and not goals that entail killing literally everyone (which is the default). We do not know how to make a PAI which does not kill literally everyone. OpenAI, DeepMind, Anthropic, and others are building towards PAI. Therefore, they should shut down, or at least shut down all of their capabilities progress and focus entirely on alignment. "But China!" does not matter. We do not know how to build PAI that does not kill literally everyone. Neither does China. If China tries to build AI that kills literally everyone, it does not help if we decide to kill literally everyone first. "But maybe the alignment plan of OpenAI/whatever will work out!" is wrong. It won't. It might work if they were careful enough and had enough time, but they're going too fast and they'll simply cause literally everyone to be killed by PAI before they would get to the point where they can solve alignment. Their strategy does not look like that of an organization trying to solve alignment. It's not just that they're progressing on capabilities too fast compared to alignment; it's that they're pursuing the kind of strategy which fundamentally gets to the point where PAI kills everyone before it gets to saving the world. Yudkowsky's Six Dimensions of Operational Adequacy in AGI Projects describes an AGI project with adequate alignment mindset is one where The project has realized that building an AGI is mostly about aligning it. Someone with full security mindset and deep understanding of AGI cognition as cognition has proven themselves able to originate new deep alignment measures, and is acting as technical lead with effectively unlimited political capital within the organization to make sure the job actually gets done. Everyone expects alignment to be terrifically hard and terribly dangerous and full of invisible bullets whose shadow you have to see before the bullet comes close enough to hit you. They understand that alignment severely constrains architecture and that capability often trades off against transparency. The organization is targeting the minimal AGI doing the least dangerous cognitive work that is required to prevent the next AGI project from destroying the world. The alignment assumptions have been reduced into non-goal-valent statements, have been clearly written down, and are being monitored for their actual truth. (emphasis mine) Needless to say, this is not remotely what any of the major AI capabilities organizations look like. At least Anthropic didn't particularly try to be a big commercial company making the public excited about AI. Making the AI race a big public thing was a huge mistake on OpenAI's part, and demonstrates that they don't really have any idea what they're doing. It does not matter that those organizations have "AI safety" teams, if their AI safety teams do not have the power to take the...
undefined
Dec 17, 2023 • 2min

LW - The Serendipity of Density by jefftk

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Serendipity of Density, published by jefftk on December 17, 2023 on LessWrong. One thing I really appreciate about living in Somerville is how density facilitates serendipity. Higher density means more of the places we might want to go are within walking distance, and then walking (and high density again) means we're more likely to run into friends. This afternoon we were walking Lily (9y) to a sleepover and we passed one of the nearby playgrounds. Lily and Anna (7y) saw some school friends and ran ahead to say hi. Anna wanted to stay and play, and I asked one of the parents if they'd be up for having Anna while I finished the dropoff. They were happy to (we take each other's kids a decent amount) and Anna got to have a much more fun half hour. Then, after I got back, we hung out at the playground with friends for a while longer before heading home for dinner. This certainly would have been possible to arrange intentionally, with a bunch of communication, but if we had been driving I expect Anna wouldn't have ended up spending this time with her friends. I would guess that the majority of times we go out we run into someone, though it's much more common that we stop and hang out together than leave one of the kids and split up. This sort of thing wasn't a common experience at all in the West Medford neighborhood I grew up in. It's an area not far from here, but probably only a third the density: the lots are about twice the size and there are more single-family houses. Distances meant we would usually drive, and even if our family had decided to primarily walk you don't run into friends when out walking unless your friends also tend to be out on foot. Another contribution is also that our kids' school is very close and draws mostly from the immediate neighborhood, which means their friends are almost all within easy walking distance. Whereas my elementary school drew from all over Medford and I only had one school friend whose house I could walk to, and only for two of the six years. Americans often worry that increasing density would lead to their neighborhoods becoming impersonal. As someone who lives in one of the densest municipalities in the country, I think this is backwards: proximity fosters community. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Dec 17, 2023 • 7min

LW - cold aluminum for medicine by bhauth

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: cold aluminum for medicine, published by bhauth on December 17, 2023 on LessWrong. cold aluminum Very pure aluminum at the boiling point of hydrogen is a very cool material. At 20K, 99.999% pure aluminum has 1000x the electrical conductivity it has at 0 C. At 4K, it has maybe 4000x. At such low temperatures, electron free paths in aluminum become macroscopic, which is why even small amounts of impurities greatly increase resistance. Even wire diameter has a noticeable effect. Magnetic fields can also increase resistance, but this is also a purity-dependent effect: 99.999% aluminum might have 3x the resistance at 15T, but even purer aluminum is much less affected. Yes, aluminum purification costs some money, but it's not particularly expensive. It might cost 3x as much as standard aluminum, but it's far cheaper than superconductors. Another point to note is that superconductors only have 0 resistance for constant current. Cryogenic aluminum isn't affected by current changes like superconductors are. It seems like that sort of interesting effect that massively increases a figure of merit should have some sort of application, don't you think? Yet, while there are superconducting electric motors for ships, I'm not aware of cryogenic aluminum conductors being used for any commercial applications. What could it be used for? power lines An obvious application for low-resistance conductors is long-distance power transmission. My estimations indicated that using cryogenic aluminum for that is somewhat too expensive, because the (cryocooler cost)*(insulation cost) product is too high for reasonable line currents. Connecting it to ambient-temperature lines is also an issue, because cold pure aluminum also has high thermal conductivity. As temperatures decrease, resistance decreases, but cryocoolers become more expensive and less efficient. In general, liquid hydrogen seems better than liquid helium or liquid nitrogen for cryogenic aluminum conductors. At such temperatures, it's worth using multilayer vacuum insulation. That's far more effective than typical insulation like fiberglass or polyester, but it still doesn't seem good enough to make insulation + cryocoolers sufficiently cheap for large underground power lines. While the economics don't work out, it is possible to use cryogenic aluminum for high-power electricity transmission. It's merely expensive, not unfeasible. Feel free to use that for flavor in hard SF stories. What are some attributes of applications that make cryogenic aluminum more suitable? Large currents per surface area. Superconductors would be used but resistance from changing current is a problem. Low weight is important. Cooling at low temperatures is easily available. One application that's been proposed is electric motors with cryogenic aluminum conductors in aircraft fueled by liquid hydrogen, which would provide free cooling for the aluminum. Obviously, such aircraft don't currently exist, and I don't think they're very practical, but that's beyond the scope of this post. MRI So, the only good application for cryogenic aluminum that comes to mind is MRI machines. Yes, it would be hard for a new company or new technology to enter that market at this point, but there are some theoretical advantages that cryogenic aluminum could have over superconductors. some blog You've probably heard that MRI scans are expensive because the machines are expensive, but they're ~5x more expensive in the USA than in Mexico. You might then think they're expensive because of labor requirements, but the Netherlands has among the lowest prices for MRI scans. In any case, yes, the machines are somewhat expensive. Here are some approximate machine prices. Supposing a $400k machine is used for 10 people a day with 5 year amortization, that's $22/use. Considering typical price...
undefined
Dec 17, 2023 • 16min

LW - 2022 (and All Time) Posts by Pingback Count by Raemon

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 2022 (and All Time) Posts by Pingback Count, published by Raemon on December 17, 2023 on LessWrong. For the past couple years I've wished LessWrong had a "sort posts by number of pingbacks, or, ideally, by total karma of pingbacks". I particularly wished for this during the Annual Review, where "which posts got cited the most?" seemed like a useful thing to track for potential hidden gems. We still haven't built a full-fledged feature for this, but I just ran a query against the database, and made it into a spreadsheet, which you can view here: LessWrong 2022 Posts by Pingbacks Here are the top 100 posts, sorted by Total Pingback Karma Title/Link Post Karma Pingback Count Total Pingback Karma Avg Pingback Karma AGI Ruin: A List of Lethalities 870 158 12,484 79 MIRI announces new "Death With Dignity" strategy 334 73 8,134 111 A central AI alignment problem: capabilities generalization, and the sharp left turn 273 96 7,704 80 Simulators 612 127 7,699 61 Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover 367 83 5,123 62 Reward is not the optimization target 341 62 4,493 72 A Mechanistic Interpretability Analysis of Grokking 367 48 3,450 72 How To Go From Interpretability To Alignment: Just Retarget The Search 167 45 3,374 75 On how various plans miss the hard bits of the alignment challenge 292 40 3,288 82 [Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering 79 36 3,023 84 How likely is deceptive alignment? 101 47 2,907 62 The shard theory of human values 238 42 2,843 68 Mysteries of mode collapse 279 32 2,842 89 [Intro to brain-like-AGI safety] 2. "Learning from scratch" in the brain 57 30 2,731 91 Why Agent Foundations? An Overly Abstract Explanation 285 42 2,730 65 A Longlist of Theories of Impact for Interpretability 124 26 2,589 100 How might we align transformative AI if it's developed very soon? 136 32 2,351 73 A transparency and interpretability tech tree 148 31 2,343 76 Discovering Language Model Behaviors with Model-Written Evaluations 100 19 2,336 123 A note about differential technological development 185 20 2,270 114 Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] 195 35 2,267 65 Supervise Process, not Outcomes 132 25 2,262 90 Shard Theory: An Overview 157 28 2,019 72 Epistemological Vigilance for Alignment 61 21 2,008 96 A shot at the diamond-alignment problem 92 23 1,848 80 Where I agree and disagree with Eliezer 862 27 1,836 68 Brain Efficiency: Much More than You Wanted to Know 201 27 1,807 67 Refine: An Incubator for Conceptual Alignment Research Bets 143 21 1,793 85 Externalized reasoning oversight: a research direction for language model alignment 117 28 1,788 64 Humans provide an untapped wealth of evidence about alignment 186 19 1,647 87 Six Dimensions of Operational Adequacy in AGI Projects 298 20 1,607 80 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme 240 16 1,575 98 Godzilla Strategies 137 17 1,573 93 (My understanding of) What Everyone in Technical Alignment is Doing and Why 411 23 1,530 67 Two-year update on my personal AI timelines 287 18 1,530 85 [Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMA 90 16 1,482 93 [Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL 66 25 1,460 58 Human values & biases are inaccessible to the genome 90 14 1,450 104 You Are Not Measuring What You Think You Are Measuring 350 21 1,449 69 Open Problems in AI X-Risk [PAIS #5] 59 14 1,446 103 [Intro to brain-like-AGI safety] 1. What's the problem & Why work on it now? 146 25 1,407 56 Conditioning Generative Models 24 11 1,362 124 Conjecture: Internal Infohazard Policy 132 14 1,340 96 A challenge for AGI organizations, and a ch...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app