Behind the Craft

Complete Beginner's Course on AI Evaluations: Step by Step (2025) | Aman Khan

81 snips
Aug 24, 2025
Aman Khan, Head of Product at Arise, specializes in AI evaluations and large language models. He shares insights on creating AI evaluations, including four essential types every product manager should know. The podcast features a live demo of building evals for a customer support agent, discussing the importance of a golden dataset and aligning AI with human judgment. Aman emphasizes effective prompt crafting and the iterative process critical for enhancing AI performance in real-world applications, particularly in customer service scenarios.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Model Bad At Simple Math

  • A real example exposed a model math error: it said 45 minutes passed a 60-minute cancellation window.
  • Aman labeled that output as bad and flagged it for deeper investigation.
ADVICE

Annotate Failures And Iterate

  • Record notes against failing examples and use them to inform prompt and policy fixes.
  • Iterate prompts with concrete changes like 'check your math' or tone constraints.
ANECDOTE

From Sheet To Eval Platform

  • Aman uploaded the spreadsheet CSV into Arize/Arise to run prompts and evals at scale.
  • The platform replayed prompts across many rows and attached eval criteria as LLM-judge prompts.
Get the Snipd Podcast app to discover more snips from this episode
Get the app