Haize Labs with Leonard Tang - Weaviate Podcast #121!

31 snips

May 12, 2025

Leonard Tang, co-founder of Haize Labs, delves into innovative techniques for AI evaluation. He shares how stacking weaker models can enhance the performance of stronger ones through the revolutionary Verdict library, boasting a 10-20% improvement over traditional models. The conversation includes practical insights on creating contrastive evaluation sets and implementing debate-based judging systems. Tang discusses the balance between AI safety and user feedback, offering transformative strategies to ensure that AI systems meet enterprise needs effectively.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ADVICE

Stack Weak Models to Scale Judging

Scale judge compute by stacking multiple weaker models in ensembles or debates to outperform a single stronger judge.
Use architectures inspired by scalable oversight for more reliable AI evaluations at reduced cost.

INSIGHT

Principles of Scalable Oversight

Scalable oversight involves probing a stronger model with weaker ones and reaching consensus via debate or voting.
These architectures let less capable models collaboratively evaluate more capable AI systems effectively.

ADVICE

Flexible Debate Architectures for Evaluation

Use variants of debate architectures such as round-robin or parallel pros-cons aggregation to reduce bias and improve judgments.
Build declarative judge pipelines to flexibly combine evaluation strategies without custom coding.

Get the Snipd Podcast app to discover more snips from this episode

Get the app