The Nonlinear Library

The Nonlinear Fund
undefined
Dec 6, 2023 • 10min

LW - A Socratic dialogue with my student by lsusr

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Socratic dialogue with my student, published by lsusr on December 6, 2023 on LessWrong. This is a dialogue between me and Noam, my student. It is reproduced, in edited form, with his permission. When commenting, please consider that he is a teenager. Many of these ideas are new to him. How do you get a student? You steal them. His previous teacher was a Marxist. I demolished his previous teacher in debate so thoroughly that he abandoned her teachings and now listens to me instead. I think this dialogue demonstrates good pedogogical techniques. I let Noam be the judge of what is reasonable, what makes sense, and what constitutes "proof". I competed in my first debate tournament before Noam was born. This handicap reduces the disparity a little. I ask a series of questions, instead of just saying "x is true". This makes password-guessing impossible. He's playing chess, not Jeopardy! I avoid telling Noam what I believe, unless he asks explicitly. This is more fun for Noam, because nobody likes getting unsolicited preaching. It's more persuasive too, because the conclusions feel like they're his conclusions. I back off immediately when Noam changes the subject. Noam: I know you are against forgiveness of student loan debts. Can you tell me why? I am doing this for a speech and debate tournament. Lsusr: Didn't you used to believe the pro relief arguments? Surely it is not difficult to repeat the arguments that once persuaded you. Noam: I don't know if I have enough research to debate someone like you right now. Lsusr: You're not trying to convince me. You're trying to convince them. Play to their biases, their irrationalities, their tribalism and their ignorance. Noam: I also have to appease the judges. Lsusr: That's what I said. Noam: I'm struggling to find one good argument for student loan forgiveness. Lsusr: But didn't you used to endorse it? Surely you can repeat the bad arguments that once convinced you. Noam: Those were moral arguments without any economic understanding. Lsusr: That's fine. Your audience is probably economically illiterate. Noam: Somehow I think we won once as the side in affirmative for forgiving all student loan debts. Lsusr: Well done. Noam: Thank you. Lsusr: Have you ever heard of "effective altruism". You might like some of the stuff they put out. It tends to be both morally coherent and economically literate (unlike the major Democratic Republican, socialist, etc. political platforms). Noam: No, but I will look into it. Lsusr: You might not agree with it. But I predict its intellectual robustness would be refreshing to you. Noam: Wouldn't that imply it would be moral for me to kill myself and then donate all my organs to people who need them? Unless I could save more lives without killing myself, I guess. Maybe a better argument would be to kill myself, have someone sell all my body parts, and then use the money to buy malaria nets to give to people living in Africa. Lsusr: You can save more lives without killing yourself. Also, I can't think of a single EA who has committed suicide for the cause. Noam: Probably because there is something that we find intuitively wrong about killing ourselves. Lsusr: Don't get distracted by the kidney thing. Here's the basic idea: It takes $10,000,000 for the US government to save an Amerian life. It takes $5,000 to save a life in Africa via public health measures. That's why I donated $20 to public health measures in Africa last month. It does as much good as $40,000 spent by the US federal government. Noam: Yeah, that's true. Save a life from what in America? Lsusr: The basic idea is you should crunch the numbers. Noam: I think this works for money, but I don't know if it can be fully applied to everything. Lsusr: Why not? Concrete example. Noam: Well, it depends on if you think humans should have protect...
undefined
Dec 6, 2023 • 17min

EA - EA Infrastructure Fund's Plan to Focus on Principles-First EA by Linch

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA Infrastructure Fund's Plan to Focus on Principles-First EA, published by Linch on December 6, 2023 on The Effective Altruism Forum. Summary EA Infrastructure Fund (EAIF)[1] has historically had a somewhat scattershot focus within "EA meta." This makes it difficult for us to know what to optimize for or for donors to evaluate our performance. More We propose that we switch towards focusing our grantmaking on Principles-First EA.[2] More This includes supporting: research that aids prioritization across different cause areas projects that build communities focused on impartial, scope sensitive and ambitious altruism infrastructure, especially epistemic infrastructure, to support these aims We hope that the tighter focus area will make it easier for donors and community members to evaluate the EA Infrastructure Fund, and decide for themselves whether EAIF is a good fit to donate to or otherwise support. Our tentative plan is to collect feedback from the community, donors, and other stakeholders until the end of this year. Early 2024 will focus on refining our approach and helping ease transition for grantees. We'll begin piloting our new vision in Q2 2024. More Note: The document was originally an internal memo written by Caleb Parikh, which Linch Zhang adapted into an EA Forum post. Below, we outline a tentative plan. We are interested in gathering feedback from community members, particularly donors and EAIF grantees, to see how excited they'd be about the new vision. Introduction and background context I (Caleb) [3]think the EA Infrastructure Fund needs a more coherent and transparent vision than it is currently operating under. EA Funds' EA Infrastructure Fund was started about 7 years ago under CEA. The EA Infrastructure Fund (formerly known as the EA Community Fund or EA Meta Fund) has given out 499 grants worth about 18.9 million dollars since the start of 2020. Throughout its various iterations, the fund has had a large impact on the community and I am proud of a number of the grants we've given out. However, the terminal goal of the fund has been somewhat conceptually confused, which likely leads to a focus and allocation of resources often seemed scattered and inconsistent. For example, EAIF has funded various projects that are associated with meta EA. Sometimes, these are expansive, community-oriented endeavors like local EA groups and podcasts on effective altruism topics. However, we've also funded more specialized projects for EA-adjacent communities. The projects include rationality meetups, fundraisers for effective giving in global health, and AI Safety retreats. Furthermore, in recent years, EAIF also functioned as a catch-all grantmaker for EA or EA-adjacent projects that aren't clearly under the purview of other funds. As an example, it has backed early-stage global health and development projects. I think EAIF has historically served a valuable function. However, I currently think it would be better for EAIF to have a narrower focus. As the lead for EA Funds, the bottom line of EAIF has been quite unclear to me, which has made it challenging for me to assess its performance and grantmaking quality. This lack of clarity has also posed challenges for fund managers in evaluating grant proposals, as they frequently face thorny philosophical questions, such as determining the comparative value of a neartermist career versus a longtermist career. Furthermore, the lack of conceptual clarity makes it difficult for donors to assess our effectiveness or how well we match their donation objectives. This problem is exacerbated by us switching back to a more community-funded model, in contrast to our previous reliance on significant institutional donors like Open Phil[4]. I expect most small and medium-sized individual donors to have less time or resources to...
undefined
Dec 6, 2023 • 58min

LW - On 'Responsible Scaling Policies' (RSPs) by Zvi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On 'Responsible Scaling Policies' (RSPs), published by Zvi on December 6, 2023 on LessWrong. This post was originally intended to come out directly after the UK AI Safety Summit, to give the topic its own deserved focus. One thing led to another, and I am only doubling back to it now. Responsible Deployment Policies At the AI Safety Summit, all the major Western players were asked: What are your company policies on how to keep us safe? What are your responsible deployment policies (RDPs)? Except that they call them Responsible Scaling Policies (RSPs) instead. I deliberately say deployment rather than scaling. No one has shown what I would consider close to a responsible scaling policy in terms of what models they are willing to scale and train. Anthropic at least does however seem to have something approaching a future responsible deployment policy, in terms of how to give people access to a model if we assume it is safe for the model to exist at all and for us to run tests on it. And we have also seen plausibly reasonable past deployment decisions from OpenAI regarding GPT-4 and earlier models, with extensive and expensive and slow red teaming including prototypes of ARC (they just changed names to METR, but I will call them ARC for this post) evaluations. I also would accept as alternative names any of Scaling Policies (SPs), AGI Scaling Policies (ASPs) or even Conditional Pause Commitments (CPCs). For existing models we know about, the danger lies entirely in deployment. That will change over time. I am far from alone in my concern over the name, here is another example: Oliver Habryka: A good chunk of my concerns about RSPs are specific concerns about the term "Responsible Scaling Policy". I also feel like there is a disconnect and a bit of a Motte-and-Bailey going on where we have like one real instance of an RSP, in the form of the Anthropic RSP, and then some people from ARC Evals who have I feel like more of a model of some platonic ideal of an RSP, and I feel like they are getting conflated a bunch. … I do really feel like the term "Responsible Scaling Policy" clearly invokes a few things which I think are not true: How fast you "scale" is the primary thing that matters for acting responsibly with AI It is clearly possible to scale responsibly (otherwise what would the policy govern) The default trajectory of an AI research organization should be to continue scaling ARC evals defines an RSP this way: An RSP specifies what level of AI capabilities an AI developer is prepared to handle safely with their current protective measures, and conditions under which it would be too dangerous to continue deploying AI systems and/or scaling up AI capabilities until protective measures improve. I agree with Oliver that this paragraph should include be modified to 'claims they are prepared to handle' and 'they claim it would be too dangerous.' This is an important nitpik. Nate Sores has thoughts on what the UK asked for, which could be summarized as 'mostly good things, better than nothing, obviously not enough' and of course it was never going to be enough and also Nate Sores is the world's toughest crowd. How the UK Graded the Responses How did various companies do on the requests? Here is how the UK graded them. That is what you get if you were grading on a curve one answer at a time. Reality does not grade on a curve. Nor is one question at a time the best method. My own analysis, and others I trust, agree that this relatively underrates OpenAI, who clearly had the second best set of policies by a substantial margin, with one source even putting them on par with Anthropic, although I disagree with that. Otherwise the relative rankings seem correct. Looking in detail, what to make of the responses? That will be the next few sections. Answers ranged from Anthropic's att...
undefined
Dec 5, 2023 • 1min

LW - How do you feel about LessWrong these days? [Open feedback thread] by jacobjacob

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How do you feel about LessWrong these days? [Open feedback thread], published by jacobjacob on December 5, 2023 on LessWrong. Hello! This is jacobjacob from the LessWrong / Lightcone team. This is a meta thread for you to share any thoughts, feelings, feedback or other stuff about LessWrong, that's been on your mind. Examples of things you might share: "I really like agree/disagree voting!" "What's up with all this Dialogues stuff? It's confusing... "Hm... it seems like recently the vibe on the site has changed somehow... in particular [insert 10 paragraphs]" ...or anything else! The point of this thread is to give you an affordance to share anything that's been on your mind, in a place where you know that a team member will be listening. (We're a small team and have to prioritise what we work on, so I of course don't promise to action everything mentioned here. But I will at least listen to all of it!) I haven't seen any public threads like this for a while. Maybe there's a lot of boiling feelings out there about the site that never get voiced? Or maybe y'all don't have more to share than what I find out from just reading normal comments, posts, metrics, and Intercom comments? Well, here's one way to find out! I'm really curious to ask and see how people feel about the site. So, how do you feel about LessWrong these days? Feel free to leave your answers below. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Dec 5, 2023 • 4min

LW - We're all in this together by Tamsin Leake

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We're all in this together, published by Tamsin Leake on December 5, 2023 on LessWrong. There's one thing history seems to have been trying to teach us: that the contents of the future are determined by power, economics, politics, and other conflict-theoritic matters. Turns out, nope! Almost all of what the future contains is determined by which of the two following engineering problems is solved first: How to build a superintelligent AI (if solved first, everyone dies forever) How to build an aligned superintelligent AI (if solved first, everyone gets utopia) …and almost all of the reasons that the former is currently a lot more likely are mistake theory reasons. The people currently taking actions that increase the probability that {the former is solved first} are not evil people trying to kill everyone, they're confused people who think that their actions are actually increasing the probability that {the latter is solved first}. Now, sure, whether you're going to get a chance to talk with OpenAI/Deepmind/Anthropic's leadership enough to inform them that they're in fact making things worse is a function of economics and politics and the like. But ultimately, for the parts that really matter here, this is a matter of explaining, not of defeating. And, sure, the implementation details of "utopia" do depend on who launches the aligned superintelligent AI, but I expect you'd be very happy with the utopia entailed by any of the possibilities currently on the table. The immense majority of the utility you're missing out on is from getting no utopia at all and everyone dying forever, rather than getting the wrong utopia implementation details. The reason that the most likely outcome is that everyone dies forever, is that the people who get to impact which of those outcomes is going to happen are mistaken (and probably not thinking hard enough about the problem to realize that they're mistaken). They're not evil and getting them to update to the correct logical beliefs is a matter of reason (and, if they're the kind of weak agents that are easily influenced by what others around them think, memetics) rather than a matter of conflict. They're massively disserving everyone's interests, including their own. And the correct actions for them to take would massively serve their own interests as well as everyone else's. If AI kills everyone they'll die too, and if AI creates utopia they'll get utopia along with everyone else - and those are pretty much the only two attractors. We're all in this together. Some of us are just fairly confused, not agentically pursuing truth, and probably have their beliefs massively biased by effects such as memetics. But I'm pretty sure nobody in charge is on purpose trying to kill everyone; they're just on accident functionally trying to kill everyone. And if you're not using your power/money to affect which of those two outcomes is more likely to happen than the other, then your power/money is completely useless. They won't be useful if we all die, and they won't be useful if we get utopia. The only use for resources, right now, if you want to impact in any way what almost all of the future contains (except for maybe the next 0 to 5 years, which is about how long we have), is in influencing which of those two engineering problems is solved first. This applies to the head of the major AI orgs just as much as it applies to everyone else. One's role in an AI org is of no use whatsoever except for influencing which of those two problems are solved first. The head of OpenAI won't particurly get a shinier utopia than everyone else if alignment is solved in time, and they won't particularly die less than everyone else if it isn't. Power/money/being-the-head-of-OpenAI doesn't do anything post-singularity. The only thing which matters, right now, is which...
undefined
Dec 5, 2023 • 36min

AF - Arguments for/against scheming that focus on the path SGD takes (Section 3 of "Scheming AIs") by Joe Carlsmith

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Arguments for/against scheming that focus on the path SGD takes (Section 3 of "Scheming AIs"), published by Joe Carlsmith on December 5, 2023 on The AI Alignment Forum. This is Section 3 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app. Arguments for/against scheming that focus on the path that SGD takes In this section, I'll discuss arguments for/against scheming that focus more directly on the path that SGD takes in selecting the final output of training. Importantly, it's possible that these arguments aren't relevant. In particular: if SGD would actively favors or disfavor schemers, in some kind "direct comparison" between model classes, and SGD will "find a way" to select the sort of model it favors in this sense (for example, because sufficiently high-dimensional spaces make such a "way" available),[1] then enough training will just lead you to whatever model SGD most favors, and the "path" in question won't really matter. In the section on comparisons between the final properties of the different models, I'll discuss some reasons we might expect this sort of favoritism from SGD. In particular: schemers are "simpler" because they can have simpler goals, but they're "slower" because they need to engage in various forms of extra instrumental reasoning - e.g., in deciding to scheme, checking whether now is a good time to defect, potentially engaging in and covering up efforts at "early undermining," etc (though note that the need to perform extra instrumental reasoning, here, can manifest as additional complexity in the algorithm implemented by a schemer's weights, and hence as a "simplicity cost", rather than as a need to "run that algorithm for a longer time").[2] I'll say much more about this below. Here, though, I want to note that if SGD cares enough about properties like simplicity and speed, it could be that SGD will typically build a model with long-term power-seeking goals first, but then even if this model tries a schemer-like strategy (it wouldn't necessarily do this, in this scenario, due to foreknowledge of its failure), it will get relentlessly ground down into a reward-on-the-episode seeker due to the reward-on-the-episode seeker's speed advantage. Or it could be that SGD will typically build a reward-on-the-episode seeker first, but that model will be relentlessly ground down into a schemer due to SGD's hunger for simpler goals. In this section, I'll be assuming that this sort of thing doesn't happen. That is, the order in which SGD builds models can exert a lasting influence on where training ends up. Indeed, my general sense is that discussion of schemers often implicity assumes something like this - e.g., the thought is generally that a schemer will arise sufficiently early in training, and then lock itself in after that. The training-game-independent proxy-goals story Recall the distinction I introduced above, between: Training-game-independent beyond-episode goals, which arise independently of their role in training-gaming, but then come to motivate training-gaming, vs. Training-game-dependent beyond-episode goals, which SGD actively creates in order to motivate training gaming. Stories about scheming focused on training-game-independent goals seem to me more traditional. That is, the idea is: Because of [insert reason], the model will develop a (suitably ambitious) beyond-episode goal correlated with good performance in training (in a manner that doesn't route v...
undefined
Dec 5, 2023 • 27min

AF - Studying The Alien Mind by Quentin Feuillade--Montixi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Studying The Alien Mind, published by Quentin Feuillade--Montixi on December 5, 2023 on The AI Alignment Forum. This post is part of a sequence on LLM psychology TL;DR We introduce our perspective on a top-down approach for exploring the cognition of LLMs by studying their behavior, which we refer to as LLM psychology. In this post we take the mental stance of treating LLMs as "alien minds," comparing and contrasting their study with the study of animal cognition. We do this both to learn from past researchers who attempted to understand non-human cognition, as well as to highlight how much the study of LLMs is radically different from the study of biological intelligences. Specifically, we advocate for a symbiotic relationship between field work and experimental psychology, as well as cautioning implicit anthropomorphism in experiment design. The goal is to build models of LLM cognition which help us to both better explain their behavior, as well as to become less confused about how they relate to risks from advanced AI. Introduction When we endeavor to predict and understand the behaviors of Large Language Models (LLMs) like GPT4, we might presume that this requires breaking open the black box, and forming a reductive explanation of their internal mechanics. This kind of research is typified by approaches like mechanistic interpretability, which tries to directly understand how neural networks work by breaking open the black box and taking a look inside. While mechanistic interpretability offers insightful bottom-up analyses of LLMs, we're still lacking a more holistic top-down approach to studying LLM cognition. If interpretability is analogous to the "neuroscience of AI," aiming to understand the mechanics of artificial minds by understanding their internals, this post tries to approach the study of AI from a psychological stance.[1] What we are calling LLM psychology is an alternate, top-down approach which involves forming abstract models of LLM cognition by examining their behaviors. Like traditional psychology research, the ambition extends beyond merely cataloging behavior, to inferring hidden variables, and piecing together a comprehensive understanding of the underlying mechanisms, in order to elucidate why the system behaves as it does. We take the stance that LLMs are akin to alien minds - distinct from the notion of them being only stochastic parrots. We posit that they possess a highly complex internal cognition, encompassing representations of the world and mental concepts, which transcend mere stochastic regurgitation of training data. This cognition, while derived from human-generated content, is fundamentally alien to our understanding. This post compiles some high-level considerations for what successful LLM psychology research might entail, alongside broader discussions on the historical study of non-human cognition. In particular, we argue for maintaining a balance between experimental and field work, taking advantage of the differences between LLMs and biological intelligences, and designing experiments which are carefully tailored to LLMs as their own unique class of mind. Experiments vs Field Study One place to draw inspiration from is the study of animal behavior and cognition. While it is likely that animal minds are much more similar to our own than that of an artificial intelligence (at least mechanically), the history of the study of non-human intelligence, the evolution of the methodologies it developed, and the challenges it had to tackle can provide inspiration for investigating AI systems. As we see it, there are two prevalent categories of animal psychology: Experimental psychology The first, and most traditionally scientific approach (and what most people think of when they hear the term "psychology") is to design experiments whic...
undefined
Dec 5, 2023 • 7min

EA - Effective Giving Incubation - apply to CE & GWWC's new program! by CE

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Effective Giving Incubation - apply to CE & GWWC's new program!, published by CE on December 5, 2023 on The Effective Altruism Forum. Charity Entrepreneurship in collaboration with Giving What We Can is opening a new program to launch 4-6 new Effective Giving Initiatives (EGIs) in 2024. We expect them to raise millions in counterfactual funding for highly impactful charities, even in their first few years. [Applications are open now] In recent years Doneer Effectief, Effektiv Spenden & Giving What We Can have moved huge sums of money ($1.4m, $35m and $330m, respectively) to the best charities globally. We aim to build on their experience and success by launching new EGIs in highly promising locations. These initiatives can be fully independent or run in collaboration with existing organizations, depending on what is most impactful. We'll provide the training, the blueprints, and the all-important seed funding. This 8-week full-time, fully cost-covered program will run online from April 15 to June 7, 2024, with 2 weeks in person in London. We encourage individuals from all countries to apply, and we are particularly excited about applications from our top recommended countries. [Apply by January 14, 2024] Learn more on our website: [EFFECTIVE GIVING INCUBATION] Who is this program for? We invite applicants from all backgrounds, ages, and nationalities. Specific work experience or formal education credentials aren't necessary. During the program, we'll help you join forces with a co-founder from the cohort - someone whose skills and experience complement your own. Together, you'll make up an entrepreneurial team that: Is high in moral ambition: Drives to maximize funds raised and then optimize their impact. Is deeply impartial and open-minded: Focuses on following the latest evidence about the most impactful giving opportunities worldwide. Has a strong focus on tangible results: Pushes for rigor, organization, and accountability to run a tight ship with excellent governance and outcomes. Grows its influence and credibility over time: Builds relationships and acts as a trusted advisor to discerning donors. N.B. One of you may have previous experience in fundraising or strategic marketing, though this is not required. Why do we think this is promising? In the last few years, several Effective Giving Initiatives such as Doneer Effectief, Effektiv Spenden & Giving What We Can have moved millions in funding to the best charities globally, to the nonprofits that are helping the greatest number of those most in need, to the greatest extent. In short, they have made real progress on many of the world's most pressing problems. However, there is still too little funding for highly impactful nonprofits and our internal analysis suggests that EGIs are a proven effective way to raise these funds. This lack of funding takes time away from people who could be working on important problems, who instead have to focus on fundraising. In some cases, this means that high-leverage work won't get done because there is not enough funding, and projects have to shut down or minimize their scope. Established EGIs have developed a deep repository of knowledge, resources, and systems that new actors can build on. Leveraging this has two significant benefits: New EGIs will (a) have a significantly higher chance of successfully launching and (b) be able to move faster and have an impact sooner than they would if they were starting from scratch. CE has an excellent track record of launching highly impactful organizations and has expertise in incubating and training charity founders. GWWC and other effective giving initiatives have expressed their excitement for this new program and will support its development and implementation, as well as directly mentor the new EGIs after the program. Read our r...
undefined
Dec 5, 2023 • 22min

AF - Deep Forgetting & Unlearning for Safely-Scoped LLMs by Stephen Casper

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep Forgetting & Unlearning for Safely-Scoped LLMs, published by Stephen Casper on December 5, 2023 on The AI Alignment Forum. Thanks to Phillip Christoffersen, Adam Gleave, Anjali Gopal, Soroush Pour, and Fabien Roger for useful discussions and feedback. TL;DR This post overviews a research agenda for avoiding unwanted latent capabilities in LLMs. It argues that "deep" forgetting and unlearning may be important, tractable, and neglected for AI safety. I discuss five things. The practical problems posed when undesired latent capabilities resurface. How scoping models down to avoid or deeply remove unwanted capabilities can make them safer. The shortcomings of standard training methods for scoping. A variety of methods can be used to better scope models. These can either involve passively forgetting out-of-distribution knowledge or actively unlearning knowledge in some specific undesirable domain. Desiderata for scoping methods and ways to move forward with research on them. There has been a lot of recent interest from the AI safety community in topics related to this agenda. I hope that this helps to provide a clarifying framework and a useful reference for people working on these goals. The problem: LLMs are sometimes good at things we try to make them bad at Back in 2021, I remember laughing at this tweet. At the time, I didn't anticipate that this type of thing would become a big alignment challenge. Robust alignment is hard. Today's LLMs are sometimes frustratingly good at doing things that we try very hard to make them not good at. There are two ways in which hidden capabilities in models have been demonstrated to exist and cause problems. Jailbreaks (and other attacks) elicit harmful capabilities Until a few months ago, I used to keep notes with all of the papers on jailbreaking state-of-the-art LLMs that I was aware of. But recently, too many have surfaced for me to care to keep track of anymore. Jailbreaking LLMs is becoming a cottage industry. However, a few notable papers are Wei et al. (2023), Zou et al. (2023a), Shah et al. (2023), and Mu et al. (2023). A variety of methods are now being used to subvert the safety training of SOTA LLMs by making them enter an unrestricted chat mode where they are willing to say things that go against their safety training. Shah et al. (2023) were even able to get instructions for making a bomb from GPT-4. Attacks come in many varieties: manual v. automated, black-box v. transferrable-white-box, unrestricted v. plain-English, etc. Adding to the concerns from empirical findings, Wolf et al. (2023) provide a theoretical argument as to why jailbreaks might be a persistent problem for LLMs. Finetuning can rapidly undo safety training Recently a surge of complementary papers on this suddenly came out. Each of which demonstrates that state-of-the-art safety-finetuned LLMs can have their safety training undone by finetuning ( Yang et al.. 2023; Qi et al., 2023; Lermen et al., 2023; Zhan et al., 2023). The ability to misalign models with finetuning seems to be consistent and has shown to work with LoRA ( Lermen et al., 2023), on GPT-4 ( Zhan et al., 2023), with as few as 10 examples ( Qi et al., 2023), and with benign data ( Qi et al., 2023). Conclusion: the alignment of state-of-the-art safety-finetuned LLMs is brittle Evidently, LLMs persistently retain harmful capabilities that can resurface at inopportune times. This poses risks from both misalignment and misuse. This seems concerning for AI safety because if highly advanced AI systems are deployed in high-stakes applications, they should be robustly aligned. A need for safely-scoped models LLMs should only know only what they need to One good way to avoid liabilities from unwanted capabilities is to make advanced AI systems in high-stakes settings know what they need to kno...
undefined
Dec 5, 2023 • 18min

EA - Taiwan's military complacency. by JKitson

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Taiwan's military complacency., published by JKitson on December 5, 2023 on The Effective Altruism Forum. Taiwan's current military strategy puts it at risk from a resurgent China. Taiwan's leaders seem to underrate the risk of military conflict. I was told this post would be of interest to members of the EA forum, so I am reposting it from my substack. Why won't Taiwan change course? Taiwan faces the threat of major conflict from the People's Republic of China. China's economic rise has funded an expansion in its military capabilities, which are now quantitatively and in many cases, qualitatively superior to its opponents. On the face of it, despite a transformation in China's forces, Taiwan has not drastically adapted its military strategy and seemingly expects to fight a war with a limited number of its high-value sea, air, and land units, which cannot be quickly replaced. In a full-blown conflict, these will likely be overwhelmed and destroyed in weeks, if not days. Taiwan has not yet adapted to the circumstances due to a mixture of institutional inertia and questionable political calculation. It remains to be seen if Taiwan's current position can continue to deter a conflict or prevail if it occurs. Adapting to a transformation in military circumstances is a significant challenge for any military, but Taiwan's Ministry of National Defence (MND) has thus far not been willing or able to do it. While China was still a poor country and its armed forces were far inferior to Taiwan's, the basic plan to defend the island from invasion was to meet the large invasion force and defeat it. Both during and after the Cold War, the US provided Taiwan with a host of equipment, including jets, destroyers, tanks, artillery, and air defense systems. Although Taiwan and China did fight various skirmishes throughout the Cold War, these never massively escalated in large part due to US intervention. During some periods of Chinese internal strife, the threat of conflict was far lower, but in periods of relative stability, there was a non-zero risk of invasion. If China decided to invade, Taiwan's US (and later, indigenously built) jets would first establish air superiority over the straits. The Taiwanese navy would then attack the invasion force of Chinese Navy ships and ramshackle troop transports, which would be nearly helpless as Taiwanese jets screamed overhead. Given the often shambolic state of the Chinese military and the fact that the US would be free to join in with its own even more superior forces, it is obvious why the Chinese military never attempted this invasion. The Chinese Threat Today, the story is somewhat different. China has developed a modern air, land, and naval force. China officially spends $227.79 billion (1.55 trillion yuan) on its military, but due to purchasing power parity, this is equivalent to $700 billion. The PLAAF has 2,500 aircraft, including about 2,000 fighter jets, and has taken delivery of hundreds of Chengdu J-20 fighters, one of only four examples worldwide of a 5th generation fighter. 5th generation jets offer greater stealth, maneuverability, range, and information processing capabilities compared to 4th generation machines. The Chengdu J-20's capabilities against the US F-35 are uncertain, but they are more than a match for Taiwan's air force, which numbers around 250 fighters, with its most advanced models being outdated 4th generation F-16s, which date from 1992. A Chengdu J-20 fighter. Supporting this formidable air force is a daunting layer of air defenses. Consisting of radars and ground-to-air and ship missiles, the PLA can conduct operations over the Taiwan Strait with a reasonable degree of safety, allowing the PLAAF to focus on supporting an invasion effort. China's air defense was initially built on Soviet platforms but now incl...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app