Olivia Watkins

Member of OpenAI's Frontier Evals team focused on evaluation design and contamination analysis; collaborated on creating and analyzing SWE-Bench Verified and related coding benchmarks.

Best podcasts with Olivia Watkins

Ranked by the Snipd community

432 snips

Feb 23, 2026 • 26min

⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

Mia Glaese, VP of Research at OpenAI who oversees Codex and alignment work, and Olivia Watkins, a Frontier Evals evaluator focused on contamination and evaluation design, discuss why SWE‑Bench Verified became saturated and contaminated. They walk through its human curation, show examples of contamination and narrow tests, and explain the move toward tougher, more diverse benchmarks that measure longer‑horizon coding tasks and real‑world product skills.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app