From Atari to ChatGPT: How AI Learned to Follow Instructions
19 snips
Mar 9, 2026 A lively dive into how language models evolved from game-playing systems to instruction-following chatbots. They explore why next-token prediction feels conversational and where that view falls short. The conversation covers human preference labeling, reward models, and how small labeler pools shape model behavior and biases. It also looks at scaling feedback and why bigger models do not always follow instructions better.
AI Snips
Chapters
Transcript
Episode notes
Language Models Are Predictors Not Command Followers
- GPT-style models predict the next token rather than explicitly following instructions.
- Katie explains ChatGPT completes text (e.g., after "you are listening to" it will output "Linear Digressions"), which explains autocomplete-like behavior.
Scale Doesn't Fix Misaligned Objectives
- The model's training objective (next-token prediction) differs from a helpfulness objective, causing misalignment despite scale.
- Ben notes bigger models or more data (GPT-3's 175B params) didn't make it reliably follow user instructions.
Pairwise Preferences Beat Absolute Scores
- Humans struggle to score outputs on absolute scales but are good at pairwise preferences.
- Ben highlights preference comparisons as the core idea that enabled instruction following via reward modeling.
