MLOps.community

It's 2026, and We're Still Talking Evals

29 snips
Apr 21, 2026
Maggie Konstanty, an AI product manager who builds and evaluates large-scale agents for food ordering and ecommerce. She explains why single accuracy numbers mislead. She contrasts pre‑ship sims with production traces. She digs into drop‑off analytics, the dangers of many disconnected evaluators, and why teams often end up building custom eval tooling.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Accuracy Percentages Lie Without Context

  • Insight accuracy percentages are misleading without a clear definition of "good" for your use case.
  • Maggie says "95% accurate" means nothing unless you tie the metric to concrete behaviors and failure modes like offering meat to vegetarians.
ADVICE

Simulate Personas And Repeat Scenarios

  • Do simulate users with crafted personas and run scenarios multiple times to measure variance and nondeterministic failure modes.
  • Maggie recommends calling the agent endpoint with persona prompts and repeating scenarios 1–100 times to find where it loses track.
ADVICE

Define 'Good' Then Measure That

  • Do define 'what is good' for your product before choosing evaluators and tie core evaluators to business metrics.
  • For a food ordering agent Maggie ties recommendation evaluators to conversion and checks that vegetarian users are never offered meat.
Get the Snipd Podcast app to discover more snips from this episode
Get the app