EP26: Measuring Intelligence in the Wild - Arena and the Future of AI Evaluation

Feb 24, 2026

Anastasios Angelopoulos, co-founder and CEO of Arena AI and theoretical statistician, explains why static benchmarks fail and how large-scale human-preference leaderboards work. He discusses style control vs substance, measuring AI-generated "slop," tool-use and code evaluation, and how real-user testing and rigorous statistics shape model leaderboards and pre-release testing.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Style Versus Substance Is A Hard Causal Problem

Disentangling style from substance is a hard causal inference problem and regressions rely on pre-specified style features.
Arena acknowledges this is imperfect and runs active research to improve feature representation and causal methods.

ADVICE

Prioritize Expert Voters And Incentivize Quality

Prioritize identifying high-quality users and incentives to reduce noisy votes when collecting human preference data.
Arena curates expert users and plans private/personal leaderboards to focus votes where users actually know the topic.

INSIGHT

Agentic Tool Use Reveals Execution Weaknesses

Arena supports agentic tool use in Code Arena where models can plan and execute multi-step tool calls, providing richer evaluation than single-turn generation.
Failed tool calls are visible because they often prevent final artifacts (like compilable code), which strongly affects preference votes.

Get the Snipd Podcast app to discover more snips from this episode

Get the app