Scaling Environments and Expert Rewards

Olive describes scaling diverse training environments and using in-house expert developers as reward models.

Play episode from 07:24

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Olive Song from MiniMax shares how her team trains the M series frontier open-weight models using reinforcement learning, tight product feedback loops, and systematic environment perturbations. This crossover episode weaves together her AI Engineer Conference talk and an in-depth interview from the Inference podcast. Listeners will learn about interleaved thinking for long-horizon agentic tasks, fighting reward hacking, and why they moved RL training to FP32 precision. Olive also offers a candid look at debugging real-world LLM failures and how MiniMax uses AI agents to track the fast-moving AI landscape.

Nathan uses Granola to uncover blind spots in conversations and AI research. Try it at ⁠granola.ai/tcr⁠ with code TCR — and if you’re already using it, test his blind spot recipe here: ⁠https://bit.ly/granolablindspot⁠

LINKS:

Conference Talk (AI Engineer, Dec 2025) – https://www.youtube.com/watch?v=lY1iFbDPRlw
Interview (Turing Post, Jan 2026) – https://www.youtube.com/watch?v=GkUMqWeHn40

Sponsors:

Claude:

Claude is the AI collaborator that understands your entire workflow, from drafting and research to coding and complex problem-solving. Start tackling bigger problems with Claude and unlock Claude Pro’s full capabilities at https://claude.ai/tcr

Tasklet:

Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

CHAPTERS:

(00:00) About the Episode

(04:15) Minimax M2 presentation (Part 1)

(17:59) Sponsors: Claude | Tasklet

(21:22) Minimax M2 presentation (Part 2)

(21:26) Research life and culture

(26:27) Alignment, safety and feedback

(32:01) Long-horizon coding agents

(35:57) Open models and evaluation

(43:29) M2.2 and researcher goals

(48:16) Continual learning and AGI

(52:58) Closing musical summary