Benchmarks: wrong metrics and overfitting risks

Dan critiques common benchmarks, dataset overfitting, and the difference between average accuracy and consistency.

Play episode from 14:50

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Every few weeks at Microsoft, someone would build an AI prototype that blew everyone's minds. Three months later? Dead. "We can never ship that." Dan Klein watched this happen for five years before he decided to do something about it.

Dan is co-founder and CTO of Scaled Cognition, a professor of computer science at UC Berkeley, and winner of the ACM Grace Murray Hopper Award. His previous startups include adap.tv (acquired by AOL for $405M) and Semantic Machines (acquired by Microsoft in 2018), where he spent five years integrating conversational AI. His PhD students now run AI teams at Google, Stanford, MIT, and OpenAI.

At Scaled Cognition, Dan's team built APT1 (the Agentic Pre-trained Transformer) for under $11 million. It's a model designed for actions, not tokens, with structural guarantees that go beyond prompt-and-pray.

Dan makes the case that current LLMs are plausibility engines, not truth engines, and that the gap between demo and production is where most AI projects die.

Why prompting is a fundamentally unreliable control surface for production AI
How APT1's architecture gives actions and information first-class status instead of treating everything as tokens
The specific failure modes that kill enterprise AI prototypes within three months
Why stacking multiple models to check each other produces correlated errors, not reliability
How Scaled Cognition applied RL to conversational AI when there's no zero-sum winner
Why every S-curve in AI gets mistaken for an exponential — and what comes after the current plateau
The societal risk of systems that produce output indistinguishable from truth

Chapters
(0:00) Cold open: RL is about doubling down on what works
(0:28) Introducing Dan Klein and Scaled Cognition
(2:53) The demo-to-production gap: why AI prototypes die
(5:40) Why prompting is not a real control surface
(8:06) Modular decomposition vs. end-to-end optimization
(10:55) Are LLMs fundamentally mismatched with how we use them?
(14:26) What's wrong with benchmarks today
(20:27) APT1: building a model for actions, not tokens
(24:14) What makes data truly agentic
(28:02) Hallucinations as an iceberg — visible vs. undetectable
(34:16) Building a prototype model for under $11 million
(39:57) Applying RL to conversations without a zero-sum winner
(43:31) LLMs as a condensation of the web — and what happens when it runs out
(50:07) Reasoning models: where they work and where they don't
(53:04) Early deployments in regulated industries
(57:14) Why multi-model checking fails
(1:00:34) The minimum bar for trustworthy agentic systems
(1:04:07) Societal risk: when AI output is indistinguishable from truth
(1:13:33) Where Dan is inspired in AI research today

Connect with Dan Klein:

Scaled Cognition: https://scaledcognition.com
LinkedIn: https://www.linkedin.com/in/dan-klein/
UC Berkeley NLP Group: https://nlp.cs.berkeley.edu

Connect with Conor:

Newsletter: https://newsletter.chainofthought.show/
Twitter/X: https://x.com/ConorBronsdon
LinkedIn: https://www.linkedin.com/in/conorbronsdon/
YouTube: https://www.youtube.com/@ConorBronsdon

More episodes: https://chainofthought.show

Thanks to Galileo — download their free 165-page guide to mastering multi-agent systems at galileo.ai/mastering-multi-agent-systems

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books