
Elena Samuylova on Large Language Model (LLM) Based Application Evaluation and LLM as a Judge
Oct 6, 2025
Elena Samuylova, Co-founder and CEO of Evidently AI, shares her expertise in LLM-powered application evaluation. She highlights the importance of distinguishing between model and system evaluations. Elena introduces the concept of using an LLM as a judge, discussing its benefits and limitations. She emphasizes the workflow for LLM evaluation, including iterative checks and stress testing. Furthermore, she advises on designing custom LLM judges and stresses the significance of team roles in this process, encouraging developers to adapt their skills as the field evolves.
AI Snips
Chapters
Transcript
Episode notes
Model Vs System Evaluation
- Evaluating an LLM model is different from evaluating an LLM-powered system that includes prompts and integrations.
- Builders must test the whole system on their use case, not only benchmark model scores.
Using LLMs As Evaluators
- LLM-as-judge uses an LLM to classify qualities of another LLM's output rather than to re-answer the prompt.
- This technique can scale human labeling but requires careful validation and design.
Make Evaluation A Continuous Cycle
- Start evaluation early: iterate with manual 'vibe checks' then add automated scoring during experiments.
- Before production, expand tests to stress testing, red teaming, regression testing and monitoring.
