"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

66 snips

Feb 22, 2026

Olive Song, a senior researcher in reinforcement learning at Minimax who helped build the M series open-weight models, discusses training M2 with RL, tight product feedback, and perturbation pipelines. She covers long-horizon agentic coding, reward-hacking and alignment challenges, FP32 RL decisions, and using internal agents to track fast-moving research.

Ask episode

AI Snips

Chapters

Books

Transcript

Episode notes

INSIGHT

Small Model, Big Agentic Focus

Minimax M2 is a 10B-parameter open-weight model optimized for coding and agentic workplace tasks.
Olive Song emphasizes that benchmark numbers don't capture real-world dynamics and developer experience.

ADVICE

Scale Environments And Expert Rewards

Scale environments and expert feedback together during RL to match real developer workflows.
Use in-house expert developers as reward models to align outputs with what practitioners trust.

INSIGHT

Interleaved Thinking For Noisy Environments

Interleaved thinking interleaves tool calls with thinking, letting the model act, observe, then re-think multiple times.
This approach improves robustness in noisy, dynamic environments and supports long-horizon automation.

Get the Snipd Podcast app to discover more snips from this episode

Get the app