Latent Space: The AI Engineer Podcast

Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay

556 snips
Jan 23, 2026
Yi Tay, a DeepMind researcher who co-led the IMO Gold project and built the Reasoning & AGI team in Singapore. He recounts training Gemini Deep Think, the live IMO Gold push, the shift from symbolic systems to end-to-end RL, debates on on-policy versus off-policy learning, and the role of self-consistency and data efficiency in unlocking reasoning.
Ask episode
AI Snips
Chapters
Transcript
Episode notes

On-Policy Training Mirrors Human Learning

  • On-policy RL trains models on their own generations and rewards them for outcomes rather than copying others' trajectories.
  • Yi Tay says this mirrors human learning and generalizes better than imitation learning.

Update Quickly When Core Beliefs Break

  • When you're proven wrong, update aggressively instead of minimally adjusting priors from habit or stubbornness.
  • Yi Tay recommends increasing your learning rate when key assumptions fail.

Self-Consistency Unlocks Better Reasoning

  • Self-consistency and sampling multiple chains are fundamental for reasoning improvements and are used during RR training.
  • Yi Tay describes parallel reasoning, majority-voting variants and internal verification as core techniques.
Get the Snipd Podcast app to discover more snips from this episode
Get the app