
Equity The PhD students who became the judges of the AI industry
12 snips
Mar 18, 2026 Wei-Lin Chiang, Arena co-founder and CTO who built evaluation systems for LLMs and agents, and Anastasios Angelopoulos, Arena co-founder and former UC Berkeley PhD who created benchmarking platforms. They discuss how Arena measures real-world intelligence, preserves reproducibility and neutrality despite big lab funding, and expands from chat to agents, coding, and expert leaderboards.
AI Snips
Chapters
Transcript
Episode notes
Why Continuous User Data Beats Static Benchmarks
- Static benchmarks overfit because models can memorize questions, making them less useful over time.
- Arena uses continuous, millions-strong real-user interactions so the test distribution refreshes and prevents being able to "train to the test."
Publish Pipeline And Confidence Intervals
- To ensure reproducibility, open-source the evaluation pipeline and provide confidence intervals so leaderboards are statistically interpretable.
- Arena publishes its pipeline and uses large-scale data so its estimator converges and reports reliability metrics.
Require Production Parity For Public Rankings
- Arena enforces that models evaluated publicly are identical to what providers release, preventing specialized 'benchmarked' variants.
- Providers must release the same production endpoint to be listed on the public leaderboard for neutrality.

