Machine Learning Street Talk (MLST)

chevron_right

The Secret Engine of AI - Prolific [Sponsored] (Sara Saab, Enzo Blindow)

whatshot 80 snips

Oct 18, 2025

Sara Saab, VP of Product at Prolific with a background in cognitive science, and Enzo Blindow, VP of Data and AI at Prolific and an expert in economics, discuss the pivotal role of human feedback in AI. They stress that non-deterministic AI systems require human oversight more than ever, as optimizing for benchmarks can mislead usability. Exploring the ecological context of intelligence, they advocate for a participatory approach to evaluation that captures social norms and emphasizes the importance of cultural alignment.

01:19:39

forum

Ask episode

web_stories

AI Snips

view_agenda

Chapters

auto_awesome

Transcript

info_circle

Episode notes

insights

INSIGHT

Subjective 'Vibes' Need Representative Measures

'Vibes' are meaningful but hard to quantify; human opinion is necessary to measure subjective qualities like agreeableness.
Representative, stratified human samples and clear scales are required to avoid selection bias.

insights

INSIGHT

Evaluation Should Be Solution-Agnostic

Measurement is solution-agnostic: good evaluation lets teams optimize across architectures and datasets.
Agreeing on robust success metrics is the key constraint, not the modeling approach.

volunteer_activism

ADVICE

Design Leaderboards To Resist Gaming

Reduce gaming of leaderboards by balancing transparency and obfuscation, e.g., controlled private evals or noisy responses.
Consider differential-noise techniques or equal private access to limit gaming while preserving accountability.

Get the Snipd Podcast app to discover more snips from this episode

Why humans remain essential in AI development

01:53 • 2min

chevron_right

Balancing speed, cost, and quality in human-in-the-loop systems

04:11 • 49sec

chevron_right

Do machines really understand? The case for human verifiers

05:00 • 1min

chevron_right

Placing humans where they matter most in ML pipelines

06:18 • 1min

chevron_right

Can LLMs ever truly understand the world?

07:20 • 3min

chevron_right

Intelligence as an ecology, not a tower

10:47 • 1min

chevron_right

Human evaluation as participatory stakes

11:49 • 3min

chevron_right

Ethical and philosophical stakes of building AI

15:15 • 2min

chevron_right

Orchestration and the future of human-AI work

17:06 • 1min

chevron_right

Is human data a transitory training phase?

18:10 • 2min

chevron_right

Why benchmark scores can mislead on usability

19:57 • 49sec

chevron_right

Measuring subjective qualities like vibe and agreeableness

20:46 • 2min

chevron_right

Taxonomy and adaptive evaluation for diverse capabilities

22:54 • 1min

chevron_right

Benchmarking softer skills risks regressions elsewhere

24:02 • 2min

chevron_right

Why shared, robust success measures matter

25:40 • 2min

chevron_right

Preventing gaming of leaderboards and private evals

27:27 • 2min

chevron_right

Verifying evaluations: references, experts, and consensus

29:24 • 1min

chevron_right

Constructing a global rubric for 'good' is hard

30:35 • 39sec

chevron_right

Risks of misdesigning foundational models

31:14 • 2min

chevron_right

Can evaluations be transferable across model lineages?

33:17 • 1min

chevron_right

How constitutional AI and separation of powers scale alignment

34:31 • 3min

chevron_right

Treating human feedback as infrastructure

37:46 • 2min

chevron_right

Orchestrating experts: matching, verification, and incentives

39:48 • 1min

chevron_right

Validating scarce expertise at scale

41:08 • 5min

chevron_right

Designing humane work experiences for evaluators

45:51 • 2min

chevron_right

Reputation systems and matching engines to improve quality

47:30 • 3min

chevron_right

Anthropic's agentic misalignment experiments and blackmail

50:40 • 3min

chevron_right

Emergent utility functions and divergence from human intentions

53:15 • 37sec

chevron_right

Evaluation itself influences model behavior

53:52 • 4min

chevron_right

LM Arena, Chatbot Arena limits, and representativeness

57:52 • 2min

chevron_right

Grok 4: top benchmarks but poor conversational feel

01:00:16 • 2min

chevron_right

The Humane leaderboard: stratified human evaluations

01:02:03 • 2min

chevron_right

Cultural alignment varies across demographics

01:04:06 • 2min

chevron_right

Facts, policy, and personalization on an evaluation spectrum

01:06:16 • 2min

chevron_right

Quality over quantity: improving label fidelity

01:08:30 • 2min

chevron_right

Collective intelligence and durable cultural strata

01:10:47 • 1min

chevron_right

Apollo maturity curve for evaluation rigor

01:11:52 • 2min

chevron_right

Indirect impacts: models acting on humans require different metrics

01:13:58 • 2min

chevron_right

AI's self-perceived goals vs human expectations

01:16:00 • 3min

chevron_right

Layered oversight: agents, humans, and conditional evaluation

We sat down with Sara Saab (VP of Product at Prolific) and Enzo Blindow (VP of Data and AI at Prolific) to explore the critical role of human evaluation in AI development and the challenges of aligning AI systems with human values. Prolific is a human annotation and orchestration platform for AI used by many of the major AI labs. This is a sponsored show in partnership with Prolific.

**SPONSOR MESSAGES**

—

cyber•Fund https://cyber.fund/?utm_source=mlst is a founder-led investment firm accelerating the cybernetic economy

Oct SF conference - https://dagihouse.com/?utm_source=mlst - Joscha Bach keynoting(!) + OAI, Anthropic, NVDA,++

Hiring a SF VC Principal: https://talent.cyber.fund/companies/cyber-fund-2/jobs/57674170-ai-investment-principal#content?utm_source=mlst

Submit investment deck: https://cyber.fund/contact?utm_source=mlst

—

While technologists want to remove humans from the loop for speed and efficiency, these non-deterministic AI systems actually require more human oversight than ever before. Prolific's approach is to put "well-treated, verified, diversely demographic humans behind an API" - making human feedback as accessible as any other infrastructure service.

When AI models like Grok 4 achieve top scores on technical benchmarks but feel awkward or problematic to use in practice, it exposes the limitations of our current evaluation methods. The guests argue that optimizing for benchmarks may actually weaken model performance in other crucial areas, like cultural sensitivity or natural conversation.

We also discuss Anthropic's research showing that frontier AI models, when given goals and access to information, independently arrived at solutions involving blackmail - without any prompting toward unethical behavior. Even more concerning, the more sophisticated the model, the more susceptible it was to this "agentic misalignment."

Enzo and Sarah present Prolific's "Humane" leaderboard as an alternative to existing benchmarking systems. By stratifying evaluations across diverse demographic groups, they reveal that different populations have vastly different experiences with the same AI models.

Looking ahead, the guests imagine a world where humans take on coaching and teaching roles for AI systems - similar to how we might correct a child or review code. This also raises important questions about working conditions and the evolution of labor in an AI-augmented world. Rather than replacing humans entirely, we may be moving toward more sophisticated forms of human-AI collaboration.

As AI tech becomes more powerful and general-purpose, the quality of human evaluation becomes more critical, not less. We need more representative evaluation frameworks that capture the messy reality of human values and cultural diversity.

Visit Prolific:

https://www.prolific.com/

Sara Saab (VP Product):

https://uk.linkedin.com/in/sarasaab

Enzo Blindow (VP Data & AI):

https://uk.linkedin.com/in/enzoblindow

TRANSCRIPT:

https://app.rescript.info/public/share/xZ31-0kJJ_xp4zFSC-bunC8-hJNkHpbm7Lg88RFcuLE

TOC:

[00:00:00] Intro & Background

[00:03:16] Human-in-the-Loop Challenges

[00:17:19] Can AIs Understand?

[00:32:02] Benchmarking & Vibes

[00:51:00] Agentic Misalignment Study

[01:03:00] Data Quality vs Quantity

[01:16:00] Future of AI Oversight

REFS:

Anthropic Agentic Misalignment

https://www.anthropic.com/research/agentic-misalignment

Value Compass

https://arxiv.org/pdf/2409.09586

Reasoning Models Don’t Always Say What They Think (Anthropic)

https://www.anthropic.com/research/reasoning-models-dont-say-think

https://assets.anthropic.com/m/71876fabef0f0ed4/original/reasoning_models_paper.pdf

Apollo research - science of evals blog post

https://www.apolloresearch.ai/blog/we-need-a-science-of-evals

Leaderboard Illusion

https://www.youtube.com/watch?v=9W_OhS38rIE MLST video

The Leaderboard Illusion [2025]

Shivalika Singh et al

https://arxiv.org/abs/2504.20879

(Truncated, full list on YT)

Home Top podcasts Popular guests Top books