

BlueDot Narrated
BlueDot Impact
Audio versions of the core readings, blog posts, and papers from BlueDot courses.
Episodes
Mentioned books

Jan 4, 2025 • 15min
How to Succeed as an Early-Stage Researcher: The “Lean Startup” Approach
Audio versions of blogs and papers from BlueDot courses.I am approaching the end of my AI governance PhD, and I’ve spent about 2.5 years as a researcher at FHI. During that time, I’ve learnt a lot about the formula for successful early-career research.This post summarises my advice for people in the first couple of years. Research is really hard, and I want people to avoid the mistakes I’ve made.Original text:https://forum.effectivealtruism.org/posts/jfHPBbYFzCrbdEXXd/how-to-succeed-as-an-early-stage-researcher-the-lean-startup#ConclusionAuthor:Toby ShevlaneA podcast by BlueDot Impact.

Jan 4, 2025 • 1h 2min
Constitutional AI Harmlessness from AI Feedback
They walk through Constitutional AI’s two-stage training and how models can supervise other models. The conversation covers critique–revise pipelines, chain-of-thought for feedback, and using model-generated labels for RL. Listeners hear about experiments on harmlessness vs helpfulness, effects of multiple revisions and principles, and risks like overtraining and tone drift.

9 snips
Jan 4, 2025 • 12min
Introduction to Mechanistic Interpretability
A clear intro to mechanistic interpretability and why understanding model internals matters for high-stakes decisions. Covers risks like deception, sycophancy, and reward hacking. Explains features, circuits, polysemanticity, superposition, and sparse autoencoders. Mentions Anthropic’s scaling work and early attempts at feature steering to alter model behavior.

Jan 4, 2025 • 12min
Empirical Findings Generalize Surprisingly Far
Audio versions of blogs and papers from BlueDot courses. Previously, I argued that emergent phenomena in machine learning mean that we can’t rely on current trends to predict what the future of ML will be like. In this post, I will argue that despite this, empirical findings often do generalize very far, including across “phase transitions” caused by emergent behavior.This might seem like a contradiction, but actually I think divergence from current trends and empirical generalization are consistent. Findings do often generalize, but you need to think to determine the right generalization, and also about what might stop any given generalization from holding.I don’t think many people would contest the claim that empirical investigation can uncover deep and generalizable truths. This is one of the big lessons of physics, and while some might attribute physics’ success to math instead of empiricism, I think it’s clear that you need empirical data to point to the right mathematics.However, just invoking physics isn’t a good argument, because physical laws have fundamental symmetries that we shouldn’t expect in machine learning. Moreover, we care specifically about findings that continue to hold up after some sort of emergent behavior (such as few-shot learning in the case of ML). So, to make my case, I’ll start by considering examples in deep learning that have held up in this way. Since “modern” deep learning hasn’t been around that long, I’ll also look at examples from biology, a field that has been around for a relatively long time and where More Is Different is ubiquitous (see Appendix: More Is Different In Other Domains).Source:https://bounded-regret.ghost.io/empirical-findings-generalize-surprisingly-far/Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 11min
Planning a High-Impact Career: A Summary of Everything You Need to Know in 7 Points
Audio versions of blogs and papers from BlueDot courses.We took 10 years of research and what we’ve learned from advising 1,000+ people on how to build high-impact careers, compressed that into an eight-week course to create your career plan, and then compressed that into this three-page summary of the main points.(It’s especially aimed at people who want a career that’s both satisfying and has a significant positive impact, but much of the advice applies to all career decisions.)Original article:https://80000hours.org/career-planning/summary/Author:Benjamin ToddA podcast by BlueDot Impact.

Jan 4, 2025 • 1h 9min
Working in AI Alignment
Audio versions of blogs and papers from BlueDot courses.This guide is written for people who are considering direct work on technical AI alignment. I expect it to be most useful for people who are not yet working on alignment, and for people who are already familiar with the arguments for working on AI alignment. If you aren’t familiar with the arguments for the importance of AI alignment, you can get an overview of them by doing the AI Alignment Course.by Charlie Rogers-Smith, with minor updates by Adam JonesSource:https://aisafetyfundamentals.com/blog/alignment-careers-guideNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.

Jan 4, 2025 • 27min
Computing Power and the Governance of AI
Audio versions of blogs and papers from BlueDot courses.This post summarises a new report, “Computing Power and the Governance of Artificial Intelligence.” The full report is a collaboration between nineteen researchers from academia, civil society, and industry. It can be read here.GovAI research blog posts represent the views of their authors, rather than the views of the organisation.Source:https://www.governance.ai/post/computing-power-and-the-governance-of-aiNarrated for AI Safety Fundamentals by Perrin WalkerA podcast by BlueDot Impact.

Jan 4, 2025 • 9min
Gradient Hacking: Definitions and Examples
Audio versions of blogs and papers from BlueDot courses. Gradient hacking is a hypothesized phenomenon where:A model has knowledge about possible training trajectories which isn’t being used by its training algorithms when choosing updates (such as knowledge about non-local features of its loss landscape which aren’t taken into account by local optimization algorithms).The model uses that knowledge to influence its medium-term training trajectory, even if the effects wash out in the long term.Below I give some potential examples of gradient hacking, divided into those which exploit RL credit assignment and those which exploit gradient descent itself. My concern is that models might use techniques like these either to influence which goals they develop, or to fool our interpretability techniques. Even if those effects don’t last in the long term, they might last until the model is smart enough to misbehave in other ways (e.g. specification gaming, or reward tampering), or until it’s deployed in the real world—especially in the RL examples, since convergence to a global optimum seems unrealistic (and ill-defined) for RL policies trained on real-world data. However, since gradient hacking isn’t very well-understood right now, both the definition above and the examples below should only be considered preliminary.Source:https://www.alignmentforum.org/posts/EeAgytDZbDjRznPMA/gradient-hacking-definitions-and-examplesNarrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.

Jan 4, 2025 • 16min
ABS: Scanning Neural Networks for Back-Doors by Artificial Brain Stimulation
Audio versions of blogs and papers from BlueDot courses. This paper presents a technique to scan neural network based AI models to determine if they are trojaned. Pre-trained AI models may contain back-doors that are injected through training or by transforming inner neuron weights. These trojaned models operate normally when regular inputs are provided, and misclassify to a specific output label when the input is stamped with some special pattern called trojan trigger. We develop a novel technique that analyzes inner neuron behaviors by determining how output acti-vations change when we introduce different levels of stimulation to a neuron. The neurons that substantially elevate the activation of a particular output label regardless of the provided input is considered potentially compromised. Trojan trigger is then reverse-engineered through an optimization procedure using the stimulation analysis results, to confirm that a neuron is truly compromised. We evaluate our system ABS on 177 trojaned models that are trojaned with various attack methods that target both the input space and the feature space, and have various trojan trigger sizes and shapes, together with 144 benign models that are trained with different data and initial weight values. These models belong to 7 different model structures and 6 different datasets, including some complex ones such as ImageNet, VGG-Face and ResNet110. Our results show that ABS is highly effective, can achieve over 90% detection rate for most cases (and many 100%), when only one input sample is provided for each output label. It substantially out-performs the state-of-the-art technique Neural Cleanse that requires a lot of input samples and small trojan triggers to achieve good performance.Source:https://www.cs.purdue.edu/homes/taog/docs/CCS19.pdfNarrated for AI Safety Fundamentals the Effective Altruism Forum Joseph Carlsmith LessWrong 80,000 Hours by Perrin Walker of TYPE III AUDIO.---A podcast by BlueDot Impact.

7 snips
Jan 4, 2025 • 5min
Become a Person who Actually Does Things
A narration about moving from thinking to doing and breaking the habit of postponing action. It contrasts planners with people who actually execute and frames action as a trainable skill. Practical prompts encourage fixing small problems today and adopting an identity of doing. The piece emphasizes building habits that prepare you to seize bigger opportunities.


