Mia Glaese, VP of Research at OpenAI who oversees Codex and alignment work, and Olivia Watkins, a Frontier Evals evaluator focused on contamination and evaluation design, discuss why SWE‑Bench Verified became saturated and contaminated. They walk through its human curation, show examples of contamination and narrow tests, and explain the move toward tougher, more diverse benchmarks that measure longer‑horizon coding tasks and real‑world product skills.
26:12
forum Ask episode
web_stories AI Snips
view_agenda Chapters
auto_awesome Transcript
info_circle Episode notes
SWE-Bench Verified Is No Longer A Reliable North Star
SWE-Bench Verified is saturated and contaminated, so it no longer reliably measures real coding progress.
OpenAI found stalls in measured progress and contamination that make further small improvements meaningless for capability tracking.
Hundreds Of Engineers Rebuilt SWE-Bench Verified
OpenAI hired nearly 100 software engineers to audit and curate SWE-Bench into about 500 higher-quality tasks called Verified.
The process required multiple independent expert reviews per problem to validate specs, tests, and patches.
GPT-5.2 Chain Of Thought Revealed Contamination
Inspection of GPT-5.2 chain-of-thought revealed the model suggesting a repository-specific argument it hadn't been told to use.
That behavior suggested the model was recalling training-time repository details, i.e., contamination.
Get the Snipd Podcast app to discover more snips from this episode
Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment teams) discuss a new blog post (https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/) arguing that SWE-Bench Verified—long treated as a key “North Star” coding benchmark—has become saturated and highly contaminated, making it less useful for measuring real coding progress. SWE-Bench Verified originated as a major OpenAI-led cleanup of the original Princeton SWE-Bench benchmark, including a large human review effort with nearly 100 software engineers and multiple independent reviews to curate ~500 higher-quality tasks. But recent findings show that many remaining failures can reflect unfair or overly narrow tests (e.g., requiring specific naming or unspecified implementation details) rather than true model inability, and cite examples suggesting contamination such as models recalling repository-specific implementation details or task identifiers. From now on, OpenAI plans to stop reporting SWE-Bench Verified and instead focus on SWE-Bench Pro (from Scale), which is harder, more diverse (more repos and languages), includes longer tasks (1–4 hours and 4+ hours), and shows substantially less evidence of contamination under their “contamination auditor agent” analysis. We also discuss what future coding/agent benchmarks should measure beyond pass/fail tests—longer-horizon tasks, open-ended design decisions, code quality/maintainability, and real-world product-building—along with the tradeoffs between fast automated grading and human-intensive evaluation. 00:00 Meet the Frontier Evals Team00:56 Why SWE Bench Stalled01:47 How Verified Was Built04:32 Contamination In The Wild06:16 Unfair Tests And Narrow Specs08:40 When Benchmarks Saturate10:28 Switching To SWE Bench Pro12:31 What Great Coding Evals Measure18:17 Beyond Tests Dollars And Autonomy21:49 Preparedness And Future Directions
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe