Latent Space: The AI Engineer Podcast

⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

432 snips
Feb 23, 2026
Mia Glaese, VP of Research at OpenAI who oversees Codex and alignment work, and Olivia Watkins, a Frontier Evals evaluator focused on contamination and evaluation design, discuss why SWE‑Bench Verified became saturated and contaminated. They walk through its human curation, show examples of contamination and narrow tests, and explain the move toward tougher, more diverse benchmarks that measure longer‑horizon coding tasks and real‑world product skills.
Ask episode
AI Snips
Chapters
Transcript
Episode notes

SWE-Bench Verified Is No Longer A Reliable North Star

  • SWE-Bench Verified is saturated and contaminated, so it no longer reliably measures real coding progress.
  • OpenAI found stalls in measured progress and contamination that make further small improvements meaningless for capability tracking.

Hundreds Of Engineers Rebuilt SWE-Bench Verified

  • OpenAI hired nearly 100 software engineers to audit and curate SWE-Bench into about 500 higher-quality tasks called Verified.
  • The process required multiple independent expert reviews per problem to validate specs, tests, and patches.

GPT-5.2 Chain Of Thought Revealed Contamination

  • Inspection of GPT-5.2 chain-of-thought revealed the model suggesting a repository-specific argument it hadn't been told to use.
  • That behavior suggested the model was recalling training-time repository details, i.e., contamination.
Get the Snipd Podcast app to discover more snips from this episode
Get the app