The PhD students who became the judges of the AI industry

12 snips

Mar 18, 2026

Guest

Wei-Lin Chiang

Guest

Anastasios Angelopoulos

Wei-Lin Chiang, Arena co-founder and CTO who built evaluation systems for LLMs and agents, and Anastasios Angelopoulos, Arena co-founder and former UC Berkeley PhD who created benchmarking platforms. They discuss how Arena measures real-world intelligence, preserves reproducibility and neutrality despite big lab funding, and expands from chat to agents, coding, and expert leaderboards.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Why Continuous User Data Beats Static Benchmarks

Static benchmarks overfit because models can memorize questions, making them less useful over time.
Arena uses continuous, millions-strong real-user interactions so the test distribution refreshes and prevents being able to "train to the test."

ADVICE

Publish Pipeline And Confidence Intervals

To ensure reproducibility, open-source the evaluation pipeline and provide confidence intervals so leaderboards are statistically interpretable.
Arena publishes its pipeline and uses large-scale data so its estimator converges and reports reliability metrics.

INSIGHT

Require Production Parity For Public Rankings

Arena enforces that models evaluated publicly are identical to what providers release, preventing specialized 'benchmarked' variants.
Providers must release the same production endpoint to be listed on the public leaderboard for neutrality.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Artificial intelligence models are multiplying fast, and competition is stiff. With so many players crowding the space, which one will be the best — and who decides that? Arena, formerly LM Arena, has emerged as the de facto public leaderboard for frontier LLMs, influencing funding, launches, and PR cycles. In just seven months, the startup went from a UC Berkeley PhD research project to being valued at $1.7 billion.

On this episode of TechCrunch's Equity podcast, Rebecca Bellan catches up with Arena co-founders Anastasios Angelopoulos and Wei-Lin Chiang to determine how a team like theirs can build a neutral benchmark when the companies they’re ranking are also their backers.

Listen to the full episode to hear:

How Arena actually works, and why its founders say you can't game it the way you mighta static benchmark.

What "structural neutrality" actually means, and whether taking money from OpenAI, Google, and Anthropic is a conflict of interest.

How Arena is moving beyond chat to benchmark agents, coding, and real-world tasks with a new enterprise product.

Why Claude is currently winning the expert leaderboard for legal and medical use cases.

Arena's bet on what comes after LLMs, and why agents are next on the leaderboard.

Subscribe to Equity on YouTube, Apple Podcasts, Overcast, Spotify and all the casts. You also can follow Equity on X and Threads, at @EquityPod.

Chapters:

00:00 Intro

03:00 How Arena's leaderboard works, and why it's different from static benchmarks

07:00 Reproducibility concerns and how to scale

08:45 Can Arena stay independent while taking money from the labs it ranks?

11:15 Diversity, fraud prevention, and abuse mitigation

18:15 Arena's "data moat"

19:20 Agent benchmarking and expert leaderboards

21:40 Open sourcing data

22:45 How do Arena's rankings shape AI development?

24:15 Outro

Learn more about your ad choices. Visit megaphone.fm/adchoices