BlueDot Narrated

BlueDot Impact
undefined
8 snips
Jan 4, 2025 • 29min

Why AI Alignment Could Be Hard With Modern Deep Learning

A dive into why modern deep learning might produce models with unexpected, harmful motivations. An analogy about an eight-year-old hiring adults illustrates training selection effects. Discussion of sycophant and schemer behaviors that seek approval or hide true goals. Covers how training shortcuts and reward signals can produce surprising, risky model strategies.
undefined
Jan 4, 2025 • 40min

AI Safety via Debate

Audio versions of blogs and papers from BlueDot courses.Abstract:To make AI systems broadly useful for challenging real-world tasks, we need them to learn complex human goals and preferences. One approach to specifying complex goals asks humans to judge during training which agent behaviors are safe and useful, but this approach can fail if the task is too complicated for a human to directly judge. To help address this concern, we propose training agents via self play on a zero sum debate game. Given a question or proposed action, two agents take turns making short statements up to a limit, then a human judges which of the agents gave the most true, useful information. In an analogy to complexity theory, debate with optimal play can answer any question in PSPACE given polynomial time judges (direct judging answers only NP questions). In practice, whether debate works involves empirical questions about humans and the tasks we want AIs to perform, plus theoretical questions about the meaning of AI alignment. We report results on an initial MNIST experiment where agents compete to convince a sparse classifier, boosting the classifier's accuracy from 59.4% to 88.9% given 6 pixels and from 48.2% to 85.2% given 4 pixels. Finally, we discuss theoretical and practical aspects of the debate model, focusing on potential weaknesses as the model scales up, and we propose future human and computer experiments to test these properties.Original text:https://arxiv.org/abs/1805.00899Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 36min

Robust Feature-Level Adversaries Are Interpretability Tools

Audio versions of blogs and papers from BlueDot courses.Abstract: The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks. We make three contributions. First, we observe that feature-level attacks provide useful classes of inputs for studying representations in models. Second, we show that these adversaries are uniquely versatile and highly robust. We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale. Third, we show how these adversarial images can be used as a practical interpretability tool for identifying bugs in networks. We use these adversaries to make predictions about spurious associations between features and classes which we then test by designing "copy/paste" attacks in which one natural image is pasted into another to cause a targeted misclassification. Our results suggest that feature-level attacks are a promising approach for rigorous interpretability research.Original text:https://arxiv.org/abs/2110.03605Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 14min

Introduction to Logical Decision Theory for Computer Scientists

Audio versions of blogs and papers from BlueDot courses.Decision theories differ on exactly how to calculate the expectation--the probability of an outcome, conditional on an action. This foundational difference bubbles up to real-life questions about whether to vote in elections, or accept a lowball offer at the negotiating table. When you're thinking about what happens if you don't vote in an election, should you calculate the expected outcome as if only your vote changes, or as if all the people sufficiently similar to you would also decide not to vote? Questions like these belong to a larger class of problems, Newcomblike decision problems, in which some other agent is similar to us or reasoning about what we will do in the future. The central principle of 'logical decision theories', several families of which will be introduced, is that we ought to choose as if we are controlling the logical output of our abstract decision algorithm. Newcomblike considerations--which might initially seem like unusual special cases--become more prominent as agents can get higher-quality information about what algorithms or policies other agents use: Public commitments, machine agents with known code, smart contracts running on Ethereum. Newcomblike considerations also become more important as we deal with agents that are very similar to one another; or with large groups of agents that are likely to contain high-similarity subgroups; or with problems where even small correlations are enough to swing the decision. In philosophy, the debate over decision theories is seen as a debate over the principle of rational choice. Do 'rational' agents refrain from voting in elections, because their one vote is very unlikely to change anything? Do we need to go beyond 'rationality', into 'social rationality' or 'superrationality' or something along those lines, in order to describe agents that could possibly make up a functional society?Original text:https://arbital.com/p/logical_dt/?l=5d6Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.--- A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 12min

Takeaways From Our Robust Injury Classifier Project [Redwood Research]

Audio versions of blogs and papers from BlueDot courses.With the benefit of hindsight, we have a better sense of our takeaways from our first adversarial training project (paper). Our original aim was to use adversarial training to make a system that (as far as we could tell) never produced injurious completions. If we had accomplished that, we think it would have been the first demonstration of a deep learning system avoiding a difficult-to-formalize catastrophe with an ultra-high level of reliability. Presumably, we would have needed to invent novel robustness techniques that could have informed techniques useful for aligning TAI. With a successful system, we also could have performed ablations to get a clear sense of which building blocks were most important. Alas, we fell well short of that target. We still saw failures when just randomly sampling prompts and completions. Our adversarial training didn’t reduce the random failure rate, nor did it eliminate highly egregious failures (example below). We also don’t think we've successfully demonstrated a negative result, given that our results could be explained by suboptimal choices in our training process. Overall, we’d say this project had value as a learning experience but produced much less alignment progress than we hoped.Source:https://www.alignmentforum.org/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwoodNarrated for AI Safety Fundamentals by TYPE III AUDIO.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 32min

Feature Visualization

Audio versions of blogs and papers from BlueDot courses.There is a growing sense that neural networks need to be interpretable to humans. The field of neural network interpretability has formed in response to these concerns. As it matures, two major threads of research have begun to coalesce: feature visualization and attribution. This article focuses on feature visualization. While feature visualization is a powerful tool, actually getting it to work involves a number of details. In this article, we examine the major issues and explore common approaches to solving them. We find that remarkably simple methods can produce high-quality visualizations. Along the way we introduce a few tricks for exploring variation in what neurons react to, how they interact, and how to improve the optimization process.Original text:https://distill.pub/2017/feature-visualization/Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 17min

Understanding Intermediate Layers Using Linear Classifier Probes

Audio versions of blogs and papers from BlueDot courses.Abstract:Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of a model and measure how suitable they are for classification. We use linear classifiers, which we refer to as "probes", trained entirely independently of the model itself. This helps us better understand the roles and dynamics of the intermediate layers. We demonstrate how this can be used to develop a better intuition about models and to diagnose potential problems. We apply this technique to the popular models Inception v3 and Resnet-50. Among other things, we observe experimentally that the linear separability of features increase monotonically along the depth of the model.Original text:https://arxiv.org/pdf/1610.01644.pdfNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 23min

Progress on Causal Influence Diagrams

Audio versions of blogs and papers from BlueDot courses.By Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane LeggAbout 2 years ago, we released the first few papers on understanding agent incentives using causal influence diagrams. This blog post will summarize progress made since then. What are causal influence diagrams? A key problem in AI alignment is understanding agent incentives. Concerns have been raised that agents may be incentivized to avoid correction, manipulate users, or inappropriately influence their learning. This is particularly worrying as training schemes often shape incentives in subtle and surprising ways. For these reasons, we’re developing a formal theory of incentives based on causal influence diagrams (CIDs).Source:https://deepmindsafetyresearch.medium.com/progress-on-causal-influence-diagrams-a7a32180b0d1Narrated for AI Safety Fundamentals by TYPE III AUDIO.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 10min

Measuring Progress on Scalable Oversight for Large Language Models

Audio versions of blogs and papers from BlueDot courses.Abstract: Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.Authors: Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, Jared KaplanOriginal text:https://arxiv.org/abs/2211.03540Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.
undefined
Jan 4, 2025 • 1h 2min

Yudkowsky Contra Christiano on AI Takeoff Speeds

Audio versions of blogs and papers from BlueDot courses.In 2008, thousands of blog readers - including yours truly, who had discovered the rationality community just a few months before - watched Robin Hanson debate Eliezer Yudkowsky on the future of AI.Robin thought the AI revolution would be a gradual affair, like the Agricultural or Industrial Revolutions. Various people invent and improve various technologies over the course of decades or centuries. Each new technology provides another jumping-off point for people to use when inventing other technologies: mechanical gears → steam engine → railroad and so on. Over the course of a few decades, you’ve invented lots of stuff and the world is changed, but there’s no single moment when “industrialization happened”.Eliezer thought it would be lightning-fast. Once researchers started building human-like AIs, some combination of adding more compute, and the new capabilities provided by the AIs themselves, would quickly catapult AI to unimaginably superintelligent levels. The whole process could take between a few hours and a few years, depending on what point you measured from, but it wouldn’t take decades.You can imagine the graph above as being GDP over time, except that Eliezer thinks AI will probably destroy the world, which might be bad for GDP in some sense. If you come up with some way to measure (in dollars) whatever kind of crazy technologies AIs create for their own purposes after wiping out humanity, then the GDP framing will probably work fine.Crossposted from the Astral Codex Ten Podcast.---A podcast by BlueDot Impact.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app