The InfoQ Podcast

Elena Samuylova on Large Language Model (LLM) Based Application Evaluation and LLM as a Judge

Oct 6, 2025
Elena Samuylova, Co-founder and CEO of Evidently AI, shares her expertise in LLM-powered application evaluation. She highlights the importance of distinguishing between model and system evaluations. Elena introduces the concept of using an LLM as a judge, discussing its benefits and limitations. She emphasizes the workflow for LLM evaluation, including iterative checks and stress testing. Furthermore, she advises on designing custom LLM judges and stresses the significance of team roles in this process, encouraging developers to adapt their skills as the field evolves.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Model Vs System Evaluation

  • Evaluating an LLM model is different from evaluating an LLM-powered system that includes prompts and integrations.
  • Builders must test the whole system on their use case, not only benchmark model scores.
INSIGHT

Using LLMs As Evaluators

  • LLM-as-judge uses an LLM to classify qualities of another LLM's output rather than to re-answer the prompt.
  • This technique can scale human labeling but requires careful validation and design.
ADVICE

Make Evaluation A Continuous Cycle

  • Start evaluation early: iterate with manual 'vibe checks' then add automated scoring during experiments.
  • Before production, expand tests to stress testing, red teaming, regression testing and monitoring.
Get the Snipd Podcast app to discover more snips from this episode
Get the app