Equity

The PhD students who became the judges of the AI industry

12 snips
Mar 18, 2026
Wei-Lin Chiang, Arena co-founder and CTO who built evaluation systems for LLMs and agents, and Anastasios Angelopoulos, Arena co-founder and former UC Berkeley PhD who created benchmarking platforms. They discuss how Arena measures real-world intelligence, preserves reproducibility and neutrality despite big lab funding, and expands from chat to agents, coding, and expert leaderboards.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Why Continuous User Data Beats Static Benchmarks

  • Static benchmarks overfit because models can memorize questions, making them less useful over time.
  • Arena uses continuous, millions-strong real-user interactions so the test distribution refreshes and prevents being able to "train to the test."
ADVICE

Publish Pipeline And Confidence Intervals

  • To ensure reproducibility, open-source the evaluation pipeline and provide confidence intervals so leaderboards are statistically interpretable.
  • Arena publishes its pipeline and uses large-scale data so its estimator converges and reports reliability metrics.
INSIGHT

Require Production Parity For Public Rankings

  • Arena enforces that models evaluated publicly are identical to what providers release, preventing specialized 'benchmarked' variants.
  • Providers must release the same production endpoint to be listed on the public leaderboard for neutrality.
Get the Snipd Podcast app to discover more snips from this episode
Get the app