Get the app
Olivia Watkins
Member of OpenAI's Frontier Evals team focused on evaluation design and contamination analysis; collaborated on creating and analyzing SWE-Bench Verified and related coding benchmarks.
Best podcasts with Olivia Watkins
Ranked by the Snipd community
432 snips
Feb 23, 2026
• 26min
⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data
chevron_right
Mia Glaese, VP of Research at OpenAI who oversees Codex and alignment work, and Olivia Watkins, a Frontier Evals evaluator focused on contamination and evaluation design, discuss why SWE‑Bench Verified became saturated and contaminated. They walk through its human curation, show examples of contamination and narrow tests, and explain the move toward tougher, more diverse benchmarks that measure longer‑horizon coding tasks and real‑world product skills.
The AI-powered Podcast Player
Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
Get the app