"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

66 snips
Feb 22, 2026
Olive Song, a senior researcher in reinforcement learning at Minimax who helped build the M series open-weight models, discusses training M2 with RL, tight product feedback, and perturbation pipelines. She covers long-horizon agentic coding, reward-hacking and alignment challenges, FP32 RL decisions, and using internal agents to track fast-moving research.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
INSIGHT

Small Model, Big Agentic Focus

  • Minimax M2 is a 10B-parameter open-weight model optimized for coding and agentic workplace tasks.
  • Olive Song emphasizes that benchmark numbers don't capture real-world dynamics and developer experience.
ADVICE

Scale Environments And Expert Rewards

  • Scale environments and expert feedback together during RL to match real developer workflows.
  • Use in-house expert developers as reward models to align outputs with what practitioners trust.
INSIGHT

Interleaved Thinking For Noisy Environments

  • Interleaved thinking interleaves tool calls with thinking, letting the model act, observe, then re-think multiple times.
  • This approach improves robustness in noisy, dynamic environments and supports long-horizon automation.
Get the Snipd Podcast app to discover more snips from this episode
Get the app