BlueDot Narrated

BlueDot Impact
undefined
Jan 4, 2025 • 29min

Debate Update: Obfuscated Arguments Problem

Audio versions of blogs and papers from BlueDot courses.This is an update on the work on AI Safety via Debate that we previously wrote about here. What we did: We tested the debate protocol introduced in AI Safety via Debate with human judges and debaters. We found various problems and improved the mechanism to fix these issues (details of these are in the appendix). However, we discovered that a dishonest debater can often create arguments that have a fatal error, but where it is very hard to locate the error. We don’t have a fix for this “obfuscated argument” problem, and believe it might be an important quantitative limitation for both IDA and Debate.Key takeaways and relevance for alignment:Our ultimate goal is to find a mechanism that allows us to learn anything that a machine learning model knows: if the model can efficiently find the correct answer to some problem, our mechanism should favor the correct answer while only requiring a tractable number of human judgements and a reasonable number of computation steps for the model. We’re working under a hypothesis that there are broadly two ways to know things: via step-by-step reasoning about implications (logic, computation…), and by learning and generalizing from data (pattern matching, bayesian updating…).Original text:https://www.alignmentforum.org/posts/PJLABqQ962hZEqhdB/debate-update-obfuscated-arguments-problemNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 8min

Careers in Alignment

Audio versions of blogs and papers from BlueDot courses.Richard Ngo compiles a number of resources for thinking about careers in alignment research.Original text:https://docs.google.com/document/d/1iFszDulgpu1aZcq_aYFG7Nmcr5zgOhaeSwavOMk1akw/edit#heading=h.4whc9v22p7tbNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 8min

AI Watermarking Won’t Curb Disinformation

Audio versions of blogs and papers from BlueDot courses.Generative AI allows people to produce piles upon piles of images and words very quickly. It would be nice if there were some way to reliably distinguish AI-generated content from human-generated content. It would help people avoid endlessly arguing with bots online, or believing what a fake image purports to show. One common proposal is that big companies should incorporate watermarks into the outputs of their AIs. For instance, this could involve taking an image and subtly changing many pixels in a way that’s undetectable to the eye but detectable to a computer program. Or it could involve swapping words for synonyms in a predictable way so that the meaning is unchanged, but a program could readily determine the text was generated by an AI.Unfortunately, watermarking schemes are unlikely to work. So far most have proven easy to remove, and it’s likely that future schemes will have similar problems.Source:https://transformer-circuits.pub/2023/monosemantic-features/index.htmlNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 25min

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

Audio versions of blogs and papers from BlueDot courses.Research in mechanistic interpretability seeks to explain behaviors of machine learning (ML) models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches relying on causal interventions. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteria–faithfulness, completeness, and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work provides evidence that a mechanistic understanding of large ML models is feasible, pointing toward opportunities to scale our understanding to both larger models and more complex tasks. Code for all experiments is available at https://github.com/redwoodresearch/Easy-Transformer.Source:https://arxiv.org/pdf/2211.00593.pdfNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 44min

Zoom In: An Introduction to Circuits

Audio versions of blogs and papers from BlueDot courses.By studying the connections between neurons, we can find meaningful algorithms in the weights of neural networks. Many important transition points in the history of science have been moments when science “zoomed in.” At these points, we develop a visualization or tool that allows us to see the world in a new level of detail, and a new field of science develops to study the world through this lens.  For example, microscopes let us see cells, leading to cellular biology. Science zoomed in. Several techniques including x-ray crystallography let us see DNA, leading to the molecular revolution. Science zoomed in. Atomic theory. Subatomic particles. Neuroscience. Science zoomed in.  These transitions weren’t just a change in precision: they were qualitative changes in what the objects of scientific inquiry are. For example, cellular biology isn’t just more careful zoology. It’s a new kind of inquiry that dramatically shifts what we can understand.  The famous examples of this phenomenon happened at a very large scale, but it can also be the more modest shift of a small research community realizing they can now study their topic in a finer grained level of detail.Source:https://distill.pub/2020/circuits/zoom-in/Narrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 20min

Can We Scale Human Feedback for Complex AI Tasks?

They explore why simple human feedback breaks down on complex AI tasks. They outline scalable oversight techniques like task decomposition, recursive reward modelling, constitutional approaches, and debate frameworks. Experiments on weak-to-strong generalisation and potential failure modes are also covered.
undefined
Jan 4, 2025 • 13min

Future ML Systems Will Be Qualitatively Different

Audio versions of blogs and papers from BlueDot courses.In 1972, the Nobel prize-winning physicist Philip Anderson wrote the essay "More Is Different". In it, he argues that quantitative changes can lead to qualitatively different and unexpected phenomena. While he focused on physics, one can find many examples of More is Different in other domains as well, including biology, economics, and computer science. Some examples of More is Different include: Uranium. With a bit of uranium, nothing special happens; with a large amount of uranium packed densely enough, you get a nuclear reaction. DNA. Given only small molecules such as calcium, you can’t meaningfully encode useful information; given larger molecules such as DNA, you can encode a genome. Water. Individual water molecules aren’t wet. Wetness only occurs due to the interaction forces between many water molecules interspersed throughout a fabric (or other material).Original text:https://bounded-regret.ghost.io/future-ml-systems-will-be-qualitatively-different/Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 1h 11min

Biological Anchors: A Trick That Might Or Might Not Work

Audio versions of blogs and papers from BlueDot courses.I've been trying to review and summarize Eliezer Yudkowksy's recent dialogues on AI safety. Previously in sequence: Yudkowsky Contra Ngo On Agents. Now we’re up to Yudkowsky contra Cotra on biological anchors, but before we get there we need to figure out what Cotra's talking about and what's going on.The Open Philanthropy Project ("Open Phil") is a big effective altruist foundation interested in funding AI safety. It's got $20 billion, probably the majority of money in the field, so its decisions matter a lot and it’s very invested in getting things right. In 2020, it asked senior researcher Ajeya Cotra to produce a report on when human-level AI would arrive. It says the resulting document is "informal" - but it’s 169 pages long and likely to affect millions of dollars in funding, which some might describe as making it kind of formal. The report finds a 10% chance of “transformative AI” by 2031, a 50% chance by 2052, and an almost 80% chance by 2100.Eliezer rejects their methodology and expects AI earlier (he doesn’t offer many numbers, but here he gives Bryan Caplan 50-50 odds on 2030, albeit not totally seriously). He made the case in his own very long essay, Biology-Inspired AGI Timelines: The Trick That Never Works, sparking a bunch of arguments and counterarguments and even more long essays.Source:https://astralcodexten.substack.com/p/biological-anchors-a-trick-that-mightCrossposted from the Astral Codex Ten podcast.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 18min

Superintelligence: Instrumental Convergence

Audio versions of blogs and papers from BlueDot courses.According to the orthogonality thesis, intelligent agents may have an enormous range of possible final goals. Nevertheless, according to what we may term the “instrumental convergence” thesis, there are some instrumental goals likely to be pursued by almost any intelligent agent, because there are some objectives that are useful intermediaries to the achievement of almost any final goal. We can formulate this thesis as follows:The instrumental convergence thesis:"Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents."Original article:https://drive.google.com/file/d/1KewDov1taegTzrqJ4uurmJ2CJ0Y72EU3/viewAuthor:Nick BostromA podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 7min

Learning From Human Preferences

Audio versions of blogs and papers from BlueDot courses.One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior. In collaboration with DeepMind’s safety team, we’ve developed an algorithm which can infer what humans want by being told which of two proposed behaviors is better.Original article:https://openai.com/research/learning-from-human-preferencesAuthors:Dario Amodei, Paul Christiano, Alex RayA podcast by BlueDot Impact.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app