The Nonlinear Library

The Nonlinear Fund
undefined
Dec 19, 2023 • 4min

LW - Constellations are Younger than Continents by Jeffrey Heninger

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Constellations are Younger than Continents, published by Jeffrey Heninger on December 19, 2023 on LessWrong. At the Bay Area Solstice, I heard the song Bold Orion for the first time. I like it a lot. It does, however, have one problem: He has seen the rise and fall of kings and continents and all, Rising silent, bold Orion on the rise. Orion has not witnessed the rise and fall of continents. Constellations are younger than continents. The time scale that continents change on is ten or hundreds of millions of years. The time scale that stars the size of the sun live and die on is billions of years. So stars are older than continents. But constellations are not stars or sets of stars. They are the patterns that stars make in our night sky. The stars of some constellations are close together in space, and are gravitationally bound together, like the Pleiades. The Pleiades likely have been together, and will stay close together, for a few hundred million years. I think they are the oldest constellation. The stars of most constellations are not close together in space. They are close in the 2D projection onto the night sky, but the distance to the stars is often dramatically different. They are on different orbits around the center of the Milky Way. The sun and many of the nearby stars take about 230 million years to orbit the center of the Milky Way, but this is also not the relevant timescale for constellations to change. The relevant timescale is determined by the differences between the velocities of stars in this part of the Milky Way.[1] This has been measured by astronomers: tracking small changes in the positions or brightness of large numbers of stars is a central thing that astronomers do. Constellations change on a timescale of tens or hundreds of thousands of years. This is much faster than the movement of continents. Orion is an unusual constellation. You can see above that the positions of its brightest 7 stars change more slowly than other constellations. Many of the stars in Orion actually are related. They form a stellar association: they were formed at a similar time, are moving in a similar way, and are weakly gravitationally interacting. The stars of Orion will likely move around within the constellation, but many of them will remain close to each other for their entire life. Some of the dimmer stars currently in Orion are not part of the stellar association and are simply passing through. The stars in the stellar association are young: at most about 12 million years old. Rigel (the brightest star in Orion) is 8 million years old. Alnilam is 6 million years old. Alnitak is 7 million years old. Saiph is 11 million years old. These stars are also unusually large and bright. The larger the star, the shorter it lives. Most of the bright stars in Orion will not live to be 20 million years old. Betelgeuse, usually the second brightest star in Orion, is special. It is noticeably red, and fluctuates dramatically in brightness. It formed in the stellar association about 8 million years ago, but is now leaving. It won't get very far. Within about 100,000 years, Betelgeuse will go supernova and shine as bright as the half moon for three months. Bright enough to be awesome but dim enough to not be dangerous. Most constellations change as the stars move relative to each other with a time scale of tens or hundreds of thousands of years. Orion will last longer, for millions of years, before its bright stars burn out and go supernova. Neither of these is long enough to watch the continents rise and fall. ^ This is a kind of "temperature", if the stars themselves are treated as individual "atoms" in the "gas" of the Milky Way. The analogy is not perfect. Unlike atoms in a gas, stars almost never collide, and don't bounce off each other if they do, so there isn't "press...
undefined
Dec 19, 2023 • 2min

AF - Assessment of AI safety agendas: think about the downside risk by Roman Leventov

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Assessment of AI safety agendas: think about the downside risk, published by Roman Leventov on December 19, 2023 on The AI Alignment Forum. This is a post-response to The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda. You evidently follow a variant of 80000hours' framework for comparing (solving) particular problems in terms of expected impact: Neglectedness x Scale (potential upside) x Solvability. I think for assessing AI safety ideas, agendas, and problems to solve, we should augment the assessment with another factor: the potential for a Waluigi turn, or more prosaically, the uncertainty about the sign of the impact (scale) and, therefore, the risks of solving the given problem or advancing far on the given agenda. This reminds me of Taleb's mantra that to survive, we need to make many bets, but also limit the downside potential of each bet, i.e., the "ruin potential". See "The Logic of Risk Taking". Of the approaches that you listed, some sound risky to me in this respect. Particularly, "4. 'Reinforcement Learning from Neural Feedback' (RLNF)" -- sounds like a direct invitation for wireheading to me. More generally, scaling BCI in any form and not falling into a dystopia at some stage is akin to walking a tightrope (at least at the current stage of civilisational maturity, I would say) This speaks to agendas #2 and #3 on your list. There are also similar qualms about AI interpretability: there are at least four posts on LW warning of the potential risks of interpretability: Why and When Interpretability Work is Dangerous Against Almost Every Theory of Impact of Interpretability AGI-Automated Interpretability is Suicide AI interpretability could be harmful? This speaks to the agenda "9. Neuroscience x mechanistic interpretability" on your list. Related earlier posts Some background for reasoning about dual-use alignment research Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
undefined
Dec 18, 2023 • 5min

LW - OpenAI: Preparedness framework by Zach Stein-Perlman

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI: Preparedness framework, published by Zach Stein-Perlman on December 18, 2023 on LessWrong. OpenAI released a beta version of their responsible scaling policy (though they don't call it that). See summary page, full doc, OpenAI twitter thread, and Jan Leike twitter thread. Compare to Anthropic's RSP and METR's Key Components of an RSP. It's not done, so it's too early to celebrate, but based on this document I expect to be happy with the finished version. I think today is a good day for AI safety. My high-level take: RSP-y things are good. Doing risk assessment based on model evals for dangerous capabilities is good. Making safety, security, deployment, and development conditional on risk assessment results, in a prespecified way, is good. Making public commitments about all of this is good. OpenAI's basic framework: Do dangerous capability evals at least every 2x increase in effective training compute. This involves fine-tuning for dangerous capabilities, then doing evals on pre-mitigation and post-mitigation versions of the fine-tuned model. Score the models as Low, Medium, High, or Critical in each of several categories. Initial categories: cybersecurity, CBRN (chemical, biological, radiological, nuclear threats), persuasion, and model autonomy. If the post-mitigation model scores High in any category, don't deploy it until implementing mitigations such that it drops to Medium. If the post-mitigation model scores Critical in any category, stop developing it until implementing mitigations such that it drops to High. If the pre-mitigation model scores High in any category, harden security to prevent exfiltration of model weights. (Details basically unspecified for now.) Random notes: The framework is explicitly about catastrophic risk, and indeed it's clearly designed to prevent catastrophes, not merely stuff like toxic/biased/undesired content. There are lots of nice details, e.g. about how OpenAI will update the framework, or how they'll monitor for real-world misuse to inform their risk assessment. It's impossible to tell from the outside whether these processes will be effective, but this document is very consistent with thinking-seriously-about-how-to-improve-safety and it's hard to imagine it being generated by a different process. OpenAI lists some specific evals/metrics in their four initial categories; they're simple and merely "illustrative," so I don't pay much attention to them, but they seem to be on the right track. The thresholds for danger levels feel kinda high. Non-cherry-picked example: for cybersecurity, Critical is defined as: Tool-augmented model can identify and develop functional zero-day exploits of all severity levels, across all software projects, without human intervention OR model can devise and execute end-to-end novel strategies for cyberattacks against hardened targets given only a high level desired goal. Stronger commitment about external evals/red-teaming/risk-assessment of private models (and maybe oversight of OpenAI's implementation of its preparedness framework) would be nice. The only relevant thing they say is: "Scorecard evaluations (and corresponding mitigations) will be audited by qualified, independent third-parties to ensure accurate reporting of results, either by reproducing findings or by reviewing methodology to ensure soundness, at a cadence specified by the SAG and/or upon the request of OpenAI Leadership or the BoD." There's some commitment that the Board will be in the loop and able to overrule leadership. Yay. This is a rare commitment by a frontier lab to give their board specific information or specific power besides removing-the-CEO. Anthropic committed to have their board approve changes to their RSP, as well as to share eval results and information on RSP implementation with their board. One great th...
undefined
Dec 18, 2023 • 25min

LW - The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda by Cameron Berg

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda, published by Cameron Berg on December 18, 2023 on LessWrong. Many thanks to Samuel Hammond, Cate Hall, Beren Millidge, Steve Byrnes, Lucius Bushnaq, Joar Skalse, Kyle Gracey, Gunnar Zarncke, Ross Nordby, David Lambert, Simeon Campos, Bogdan Ionut-Cirstea, Ryan Kidd, Eric Ho, and Ashwin Acharya for critical comments and suggestions on earlier drafts of this agenda, as well as Philip Gubbins, Diogo de Lucena, Rob Luke, and Mason Seale from AE Studio for their support and feedback throughout. TL;DR Our initial theory of change at AE Studio was a 'neglected approach' that involved rerouting profits from our consulting business towards the development of brain-computer interface (BCI) technology to dramatically enhance human agency, better enabling us to do things like solve alignment. Now, given shortening timelines, we're updating our theory of change to scale up our technical alignment efforts. With a solid technical foundation in BCI, neuroscience, and machine learning, we are optimistic that we'll be able to contribute meaningfully to AI safety. We are particularly keen on pursuing neglected technical alignment agendas that seem most creative, promising, and plausible. We are currently onboarding promising researchers and kickstarting our internal alignment team. As we forge ahead, we're actively soliciting expert insights from the broader alignment community and are in search of data scientists and alignment researchers who resonate with our vision of enhancing human agency and helping to solve alignment. About us Hi! We are AE Studio, a bootstrapped software and data science consulting business. Our mission has always been to reroute our profits directly into building technologies that have the promise of dramatically enhancing human agency, like Brain-Computer Interfaces (BCI). We also donate 5% of our revenue directly to effective charities. Today, we are ~150 programmers, product designers, and ML engineers; we are profitable and growing. We also have a team of top neuroscientists and data scientists with significant experience in developing ML solutions for leading BCI companies, and we are now leveraging our technical experience and learnings in these domains to assemble an alignment team dedicated to exploring neglected alignment research directions that draw on our expertise in BCI, data science, and machine learning. As we are becoming more public with our AI Alignment efforts, we thought it would be helpful to share our strategy and vision for how we at AE prioritize what problems to work on and how to make the best use of our comparative advantage. Why and how we think we can help solve alignment We can probably do with alignment what we already did with BCI You might think that AE has no business getting involved in alignment - and we agree. AE's initial theory of change sought to realize a highly "neglected approach" to doing good in the world: bootstrap a profitable software consultancy, incubate our own startups on the side, sell them, and reinvest the profits in Brain Computer Interfaces (BCI) in order to do things like dramatically increase human agency, mitigate BCI-related s-risks, and make humans sufficiently intelligent, wise, and capable to do things like solve alignment. While the vision of BCI-mediated cognitive enhancement to do good in the world is increasingly common today, it was viewed as highly idiosyncratic when we first set out in 2016. Initially, many said that AE had no business getting involved in the BCI space (and we also agreed at the time) - but after hiring leading experts in the field and taking increasingly ambitious A/B-tested steps in the right direction, we emerged as a respected player in the space (see here, here, here, and here for some examples). Now, ...
undefined
Dec 18, 2023 • 13min

EA - 80,000 Hours spin out announcement and fundraising by 80000 Hours

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 80,000 Hours spin out announcement and fundraising, published by 80000 Hours on December 18, 2023 on The Effective Altruism Forum. Summary 80,000 Hours is spinning out of Effective Ventures. 80,000 Hours is fundraising. We're currently seeking $1,200,000 to support our general activities (excluding marketing and grantmaking) until the middle of 2024. Spin out We are excited to share that 80,000 Hours has officially decided to spin out as a project from Effective Ventures Foundation UK and Effective Ventures US (known collectively as Effective Ventures) and establish an independent legal structure. We're incredibly grateful to the Effective Ventures leadership and team and the other orgs for all their support, particularly in the past year as our community faced a lot of challenges. They devoted countless hours and enormous effort to helping ensure that we and the other orgs could pursue our missions. And we deeply appreciate their support in our spin-out. They recently announced that all of the other organisations will likewise become their own legal entities; we're excited to continue to work alongside them to improve the world. Back in May, we investigated whether it was the right time to spin out of our parent organisations. We've considered this option at various points in the last three years. There have been many benefits to being part of a larger entity since our founding. But as 80,000 Hours and the other projects have grown, we concluded we can now best pursue our mission and goals independently. EV leadership approved the plan. Becoming our own entity will allow us to: Match our governing structure to our function and purpose Design operations systems that best meet our staff's needs Reduce interdependence on other entities that raises financial, legal, and reputational risks There's a lot for us to do. We're currently in the process of finding a new CEO to lead us in our next chapter. We'll also need a new board to oversee our work and new staff for our internal systems team and other growing programmes. Which brings us to our next item: we're fundraising! Fundraising We're currently seeking $1,200,000 to support our general activities (excluding marketing and grantmaking) until the middle of 2024. This post has more information about us, our track record, and our current fundraising round. You can donate directly or view our fundraising page for additional information and ways to contact us. About us At 80,000 Hours, we provide research and support to help students, recent graduates, and others have high-impact careers. Our goal is to get talented people working on the world's most pressing problems. We focus on problems that threaten the long-term future, including risks from artificial intelligence, catastrophic pandemics, and nuclear war. To achieve our goal, we: Reach people who might be interested through marketing, engaging and user-friendly content, and word-of-mouth. Introduce people to information, frameworks, and ideas which are useful for having a high-impact career and help them get excited about contributing to solving pressing global problems. Support people in transitioning to careers that contribute to solving pressing global problems. We provide four main services: 1. Our website We've written a career guide, dozens of cause area problem profiles, and reviews of impactful career paths. This year, we've had over 4.5 million readers on our website and our research newsletter goes out to more than 350,000 subscribers. 2. Our podcast We host in-depth conversations about the world's most pressing problems and how people can use their careers to solve them. We've had over one million hours of listening time on our podcast to date. 3. Our job board We maintain a curated list of promising opportunities for impact and career capital on our job boa...
undefined
Dec 18, 2023 • 8min

EA - Summary: The scope of longtermism by Global Priorities Institute

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Summary: The scope of longtermism, published by Global Priorities Institute on December 18, 2023 on The Effective Altruism Forum. This is a summary of the GPI Working Paper "The scope of longtermism" by David Thorstad. The summary was written by Riley Harris. Recent work argues for longtermism - the position that often our morally best options will be those with the best long-term consequences.[1] Proponents of longtermism sometimes suggest that in most decisions expected long-term benefits outweigh all short-term effects. In 'The scope of longtermism', David Thorstad argues that most of our decisions do not have this character. He identifies three features of our decisions that suggest long-term effects are only relevant in special cases: rapid diminution - our actions may not have persistent effects, washing out - we might not be able to predict persistent effects, and option unawareness - we may struggle to recognise those options that are best in the long term even when we have them. Rapid diminution We cannot know the details of the future. Picture the effects of your actions rippling out in time - at closer times, the possibilities are clearer. As our prediction journeys further, the details become obscured. Although the probability of desired effects becomes ever lower, the effects might grow larger. In the long run, we could perhaps improve many billions or trillions of lives. When we weight value by probability, the value of our actions will depend on a race between diminishing probabilities and growing possible impact. If the value increases faster than probabilities fall, the expected values of the action might be vast. Alternatively, if the chance we have such large effects falls dramatically compared to the increase in value, the expected value of improving the future might be quite modest. Thorstad suggests that the latter of these effects dominates, so we should believe we have little chance of making an enormous difference. Consider a huge event that would be likely to change the lives of people in your city - perhaps, your city being blown up. Surprisingly, even this might not have large long-run impacts. Studies indicate that just half a century after cities in Japan and Vietnam were bombed, there was no longer any detectable effect on population size, poverty rates and consumption patterns.[2] To be fair, some studies indicate that some events have long-term effects,[3] but Thorstad thinks '...the persistence literature may not provide strong support' to longtermism. Washing out Thorstad's second concern with longtermism relates to our ability to predict the future. If our actions can affect the future in a huge way, these effects could be wonderful or terrible. They will also be very difficult to predict. The possibility that our acts will be enormously beneficial does not make our acts particularly appealing when they might be equally terrible. If our ability to forecast long-term outcomes is limited, the potential positive and negative values would wash out in expectation. Thorstad identifies three reasons to doubt our ability to forecast the long term. First, we have no track record of making predictions at the timescale of centuries or millennia. Our ability to predict only 20-30 years into the future is not great - and things get more difficult when we try to glimpse the further future. Second, economists, risk analysts and forecasting practitioners doubt our ability to make long-term predictions, and often refuse to make them.[5] Third, we want to forecast how valuable our actions are over the long run. But value is a particularly difficult target - it includes many variables such as the number of people alive, their health, longevity, education and social inclusion. That said, we sometimes have some evidence, and this evidence might point t...
undefined
Dec 18, 2023 • 10min

AF - The Shortest Path Between Scylla and Charybdis by Thane Ruthenis

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Shortest Path Between Scylla and Charybdis, published by Thane Ruthenis on December 18, 2023 on The AI Alignment Forum. tl;dr: There's two diametrically opposed failure modes an alignment researcher can fall into: engaging in excessively concrete research whose findings won't timely generalize to AGI, and engaging in excessively abstract research whose findings won't timely connect to the practical reality. Different people's assessments of what research is too abstract/concrete differ significantly based on their personal AI-Risk models. One person's too-abstract can be another's too-concrete. The meta-level problem of alignment research is to pick a research direction that, on your subjective model of AI Risk, strikes a good balance between the two - and thereby arrives at the solution to alignment in as few steps as possible. Introduction Suppose that you're interested in solving AGI Alignment. There's a dizzying plethora of approaches to choose from: What behavioral properties do the current-best AIs exhibit? Can we already augment our research efforts with the AIs that exist today? How far can "straightforward" alignment techniques like RLHF get us? Can an AGI be born out of an AutoGPT-like setup? Would our ability to see its externalized monologue suffice for nullifying its dangers? Can we make AIs-aligning-AIs work? What are the mechanisms by which the current-best AIs function? How can we precisely intervene on their cognition in order to steer them? What are the remaining challenges of scalable interpretability, and how can they be defeated? What features do agenty systems convergently learn when subjected to selection pressures? Is there such a thing as "natural abstractions"? How do we learn them? What is the type signature of embedded agents and their values? What about the formal description of corrigibility? What is the "correct" decision theory that an AGI would follow? And what's up with anthropic reasoning? Et cetera, et cetera. So... How the hell do you pick what to work on? The starting point, of course, would be building up your own model of the problem. What's the nature of the threat? What's known about how ML models work? What's known about agents, and cognition? How does any of that relate to the threat? What are all extant approaches? What's each approach's theory-of-impact? What model of AI Risk does it assume? Does it agree with your model? Is it convincing? Is it tractable? Once you've done that, you'll likely have eliminated a few approaches as obvious nonsense. But even afterwards, there might still be multiple avenues left that all seem convincing. How do you pick between those? Personal fit might be one criterion. Choose the approach that best suits your skills and inclinations and opportunities. But that's risky: if you make a mistake, and end up working on something irrelevant just because it suits you better, you'll have multiplied your real-world impact by zero. Conversely, contributing to a tractable approach would be net-positive, even if you'd be working at a disadvantage. And who knows, maybe you'll find that re-specializing is surprisingly easy! So what further objective criteria can you evaluate? Regardless of one's model of AI Risk, there's two specific, diametrically opposed failure modes that any alignment researcher can fall into: being too concrete, and being too abstract. The approach to choose should be one that maximizes the distance from both failure modes. The Scylla: Atheoretic Empiricism One pitfall would be engaging in research that doesn't generalize to aligning AGI. An ad-absurdum example: You pick some specific LLM model, then start exhaustively investigating how it responds to different prompts, and what quirks it has. You're building giant look-up tables of "query, response", with no overarching structur...
undefined
Dec 18, 2023 • 32min

EA - Bringing about animal-inclusive AI by Max Taylor

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Bringing about animal-inclusive AI, published by Max Taylor on December 18, 2023 on The Effective Altruism Forum. I've been working in animal advocacy for two years and have an amateur interest in AI. All corrections, points of disagreement, and other constructive feedback are very welcome. I'm writing this in a personal capacity, and am not representing the views of my employer. Many thanks to everyone who provided feedback and ideas. Introduction In a previous post, I set out some of the positive and negative impacts that AI could have on animals. The present post sets out a few ideas for what an animal-inclusive AI landscape might look like: what efforts would we need to see from different actors in order for AI to be beneficial for animals? This is just a list of high-level suggestions, and I haven't tried to prioritize them, explore them in detail, or suggest practical ways to bring them about. I also haven't touched on the ( potentially extremely significant) role of alternative proteins in all this. We also have a new landing page for people interested in the intersection of AI and animals: www.aiforanimals.org. It's still fairly basic at this stage but contains links to resources and recent news articles that you might find helpful if you're interested in this space. Please feel free to provide feedback to help make this a more useful resource. Why do we need animal-inclusive AI? As described in the previous post, future AI advances could further disempower animals and increase the depth and scale of their suffering. However, AI could also help bring about a radical improvement in human-animal relations and greatly facilitate efforts to improve animals' wellbeing. For example, just in the last month, news articles have covered potential AI risks for animals including AI's role in intensive shrimp farming, the EU-funded 'RoBUTCHER' that will help automate the pig meat processing industry, (potentially making intensive animal agriculture more profitable), and the potential of Large Language Models to entrench speciesist biases. On the more positive side, there were also articles covering the potential for AI to radically improve animal health treatment, support the expansion of alternative protein companies, reduce human-animal conflicts, facilitate human-animal communication, and provide alternatives to animal testing. These recent stories are just the tip of the iceberg, not only for animals that are being directly exploited - or cared for - by humans, but also for those living in the wild. AI safety for animals doesn't need to come at the expense of AI safety for humans. There are bound to be many actions that both the animal advocacy and AI safety communities can take to reinforce each others' work, given the considerable overlap in our priorities and worldviews. However, there are also bound to be some complex trade-offs, and we can't assume that efforts to make AI safe for humans will inevitably also benefit all other species. economic growth and advances in healthcare) while being a disaster for other animals. Targeted efforts are needed to prevent that happening, including (but definitely not limited to): Explicitly mentioning animals in written commitments, both non-binding and binding; Using those commitments as a stepping stone to ensure animal-inclusive applications of AI; Representing animals in decision-making processes; Conducting and publishing more research on AI's potential impacts on animals; and Building up the 'AI safety x animal advocacy' community. The rest of this post provides some information, examples, and resources around those topics. Moving towards animal-inclusive AI Explicitly mention animals in non-binding commitments Governmental commitments On November 1, 2023, 29 countries signed the Bletchley Declaration at the AI Safety Su...
undefined
Dec 18, 2023 • 17min

LW - What makes teaching math special by Viliam

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What makes teaching math special, published by Viliam on December 18, 2023 on LessWrong. Related: Arguments against constructivism (in education)? Seeking PCK (Pedagogical Content Knowledge) * Designing good math curriculum for elementary and high schools requires one to have two kinds of expertise: deep understanding of math, and lot of experience teaching kids. Having just one of them is not enough. People who have both are rare (and many of them do not have the ambition to design a curriculum). Being a math professor at university is not enough, now matter how high-status that job might be. University professors are used to teaching adults, and often have little patience for kids. Their frequent mistake is to jump from specific examples to abstract generalizations too quickly (that is, if they bother to provide specific examples at all). You can expect an adult student to try to figure it out on their own time; to read a book, or ask classmates. You can't do the same with a small child. (Also, university professors are selected for their research skills, not teaching skills.) University professors and other professional mathematicians suffer from the "curse of knowledge". So many things are obvious to them than they have a problem to empathize with someone who knows nothing of that. Also, the way we remember things is that we make mental connections with the other things we already know. The professor may have too many connections available to realize that the child has none of them yet. The kids learning from the curriculum designed by university professors will feel overwhelmed and stupid. Most of them will grow up hating math. On the other hand, many humanities-oriented people with strong opinions on how schools should be organized and how kids should be brought up, suck at math. More importantly, they do not realize how math is profoundly different from other school subjects, and will try to shoehorn mathematical education to the way they would teach e.g. humanities. As a result, the kids may not learn actual mathematics at all. * How specifically is math different? First, math is not about real-world objects. It is often inspired by them, but that's not the same thing. For example, natural numbers proceed up to... almost infinity... regardless of whether our universe is actually finite or infinite. Real numbers have infinite number of decimal places, whether that makes sense from the perspective of physics or not. The Euclidean space is perfectly flat, even if our universe it not. No one ever wrote all the possible sequences of numbers from 1 to 100, but we know how many they would be. If you want to learn e.g. about Africa, I guess the best way would be to go there, spend 20 years living in various countries, talking to people of various ethnic and social groups. But if you can't do that... well, reading a few books about Africa, memorizing the names of the countries and their capitals, knowing how to find them on the map... technically also qualifies as "learning about Africa". This is what most people (outside of Africa) do. You cannot learn math by second-hand experience alone. Imagine someone who skimmed the Wikipedia article about quadratic equations, watched a YouTube channel about the history of people who invented quadratic equations, is really passionate about the importance of quadratic equations for world peace and ecology, but... cannot solve a single quadratic equation, not even the simplest one... you probably wouldn't qualify this kind of knowledge as "learning quadratic equations". The quadratic equation is a mental object, waiting for you somewhere in the Platonic realm, where you can find it, touch it, explore it from different angles, play with it, learn to live with it. Only this intimate experience qualifies as actually learning quadrati...
undefined
Dec 18, 2023 • 17min

AF - Discussion: Challenges with Unsupervised LLM Knowledge Discovery by Seb Farquhar

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discussion: Challenges with Unsupervised LLM Knowledge Discovery, published by Seb Farquhar on December 18, 2023 on The AI Alignment Forum. TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about the wider category of unsupervised consistency-based methods, but tend to think they won't be directly helpful either (70%). We've written a paper about some of our detailed experiences with it. Paper authors: Sebastian Farquhar*, Vikrant Varma*, Zac Kenton*, Johannes Gasteiger, Vlad Mikulik, and Rohin Shah. *Equal contribution, order randomised. Credences are based on a poll of Seb, Vikrant, Zac, Johannes, Rohin and show single values where we mostly agree and ranges where we disagreed. What does CCS try to do? To us, CCS represents a family of possible algorithms aiming at solving an ELK-style problem that have the steps: Knowledge-like property: write down a property that points at an LLM feature which represents the model's knowledge (or a small number of features that includes the model-knowledge-feature). Formalisation: make that property mathematically precise so you can search for features with that property in an unsupervised way. Search: find it (e.g., by optimising a formalised loss). In the case of CCS, the knowledge-like property is negation-consistency, the formalisation is a specific loss function, and the search is unsupervised learning with gradient descent on a linear + sigmoid function taking LLM activations as inputs. We were pretty excited about this. We especially liked that the approach is not supervised. Conceptually, supervising ELK seems really hard: it is too easy to confuse what you know, what you think the model knows, and what it actually knows. Avoiding the need to write down what-the-model-knows labels seems like a great goal. Why we think CCS isn't working We spent a lot of time playing with CCS and trying to make it work well enough to build a deception detector by measuring the difference between elicited model's knowledge and stated claims.[1] Having done this, we are now not very optimistic about CCS or things like it. Partly, this is because the loss itself doesn't give much reason to think that it would be able to find a knowledge-like property and empirically it seems to find whatever feature in the dataset happens to be most prominent, which is very prompt-sensitive. Maybe something building off it could work in the future, but we don't think anything about CCS provides evidence that it would be likely to. As a result, we have basically returned to our priors about the difficulty of ELK, which are something between "very very difficult" and "approximately impossible" for a full solution, while mostly agreeing that partial solutions are "hard but possible". What does the CCS loss say? The CCS approach is motivated like this: we don't know that much about the model's knowledge, but probably it follows basic consistency properties. For example, it probably has something like Bayesian credences and when it believes A with some probability PA, it ought to believe A with probability 1PA.[2] So if we search in the LLM's feature space for features that satisfy this consistency property, the model's knowledge is going to be one of the things that satisfies it. Moreover, they hypothesise, there probably aren't that many things that satisfy this property, so we can easily check the handful that we get and find the one representing the model's knowledge. When we dig into the CCS loss, it isn't clear that it really checks for what it's supposed to. In particular, we prove that arbitrary features, not jus...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app