Weaviate Podcast

Haize Labs with Leonard Tang - Weaviate Podcast #121!

31 snips
May 12, 2025
Leonard Tang, co-founder of Haize Labs, delves into innovative techniques for AI evaluation. He shares how stacking weaker models can enhance the performance of stronger ones through the revolutionary Verdict library, boasting a 10-20% improvement over traditional models. The conversation includes practical insights on creating contrastive evaluation sets and implementing debate-based judging systems. Tang discusses the balance between AI safety and user feedback, offering transformative strategies to ensure that AI systems meet enterprise needs effectively.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ADVICE

Stack Weak Models to Scale Judging

  • Scale judge compute by stacking multiple weaker models in ensembles or debates to outperform a single stronger judge.
  • Use architectures inspired by scalable oversight for more reliable AI evaluations at reduced cost.
INSIGHT

Principles of Scalable Oversight

  • Scalable oversight involves probing a stronger model with weaker ones and reaching consensus via debate or voting.
  • These architectures let less capable models collaboratively evaluate more capable AI systems effectively.
ADVICE

Flexible Debate Architectures for Evaluation

  • Use variants of debate architectures such as round-robin or parallel pros-cons aggregation to reduce bias and improve judgments.
  • Build declarative judge pipelines to flexibly combine evaluation strategies without custom coding.
Get the Snipd Podcast app to discover more snips from this episode
Get the app