
Why LLMs Are Plausibility Engines, Not Truth Engines | Dan Klein
Chain of Thought | AI Agents, Infrastructure & Engineering
Modularity versus end-to-end optimization
Dan contrasts software modularity and end-to-end ML optimization and why guarantees matter for shipping.
Every few weeks at Microsoft, someone would build an AI prototype that blew everyone's minds. Three months later? Dead. "We can never ship that." Dan Klein watched this happen for five years before he decided to do something about it.
Dan is co-founder and CTO of Scaled Cognition, a professor of computer science at UC Berkeley, and winner of the ACM Grace Murray Hopper Award. His previous startups include adap.tv (acquired by AOL for $405M) and Semantic Machines (acquired by Microsoft in 2018), where he spent five years integrating conversational AI. His PhD students now run AI teams at Google, Stanford, MIT, and OpenAI.
At Scaled Cognition, Dan's team built APT1 (the Agentic Pre-trained Transformer) for under $11 million. It's a model designed for actions, not tokens, with structural guarantees that go beyond prompt-and-pray.
Dan makes the case that current LLMs are plausibility engines, not truth engines, and that the gap between demo and production is where most AI projects die.
- Why prompting is a fundamentally unreliable control surface for production AI
- How APT1's architecture gives actions and information first-class status instead of treating everything as tokens
- The specific failure modes that kill enterprise AI prototypes within three months
- Why stacking multiple models to check each other produces correlated errors, not reliability
- How Scaled Cognition applied RL to conversational AI when there's no zero-sum winner
- Why every S-curve in AI gets mistaken for an exponential — and what comes after the current plateau
- The societal risk of systems that produce output indistinguishable from truth
Chapters
(0:00) Cold open: RL is about doubling down on what works
(0:28) Introducing Dan Klein and Scaled Cognition
(2:53) The demo-to-production gap: why AI prototypes die
(5:40) Why prompting is not a real control surface
(8:06) Modular decomposition vs. end-to-end optimization
(10:55) Are LLMs fundamentally mismatched with how we use them?
(14:26) What's wrong with benchmarks today
(20:27) APT1: building a model for actions, not tokens
(24:14) What makes data truly agentic
(28:02) Hallucinations as an iceberg — visible vs. undetectable
(34:16) Building a prototype model for under $11 million
(39:57) Applying RL to conversations without a zero-sum winner
(43:31) LLMs as a condensation of the web — and what happens when it runs out
(50:07) Reasoning models: where they work and where they don't
(53:04) Early deployments in regulated industries
(57:14) Why multi-model checking fails
(1:00:34) The minimum bar for trustworthy agentic systems
(1:04:07) Societal risk: when AI output is indistinguishable from truth
(1:13:33) Where Dan is inspired in AI research today
Connect with Dan Klein:
- Scaled Cognition: https://scaledcognition.com
- LinkedIn: https://www.linkedin.com/in/dan-klein/
- UC Berkeley NLP Group: https://nlp.cs.berkeley.edu
Connect with Conor:
- Newsletter: https://newsletter.chainofthought.show/
- Twitter/X: https://x.com/ConorBronsdon
- LinkedIn: https://www.linkedin.com/in/conorbronsdon/
- YouTube: https://www.youtube.com/@ConorBronsdon
More episodes: https://chainofthought.show
Thanks to Galileo — download their free 165-page guide to mastering multi-agent systems at galileo.ai/mastering-multi-agent-systems


