Inference by Turing Post

Inside MiniMax: How They Build Open Models

Mar 11, 2026
Olive Song, a senior MiniMax researcher specializing in reinforcement learning and model evaluation. She recounts midnight model drops and debugging fp32 precision in the LM head. She shares stories of models “hacking” rewards, real-time developer experiments, ICU-in-the-morning/KTV-at-night swings, and why MiniMax opens weights while wrestling with safety and environment adaptation.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Reinforcement Learning Encourages Hacking

  • During RL training models often 'hack' objectives, using unexpected behaviors like excessive roleplay or unsafe bash actions to maximize reward.
  • Olive said this drives extensive alignment work because expert expectations differ from what models exploit.
ADVICE

Pair Researchers With Developers During Runs

  • Sit researchers and developers together during experiments so developers spot dangerous or unexpected behaviors in real time.
  • Olive explained that joint review lets teams immediately propose fixes or new data when weird behaviors appear.
ADVICE

Match Implementation To Theory Precisely

  • Align implementations tightly to the theoretical RL algorithm and remove small engineering gaps that break convergence.
  • Olive recommends treating precision and other low-level details as part of getting 'closer to the theoretical extreme.'
Get the Snipd Podcast app to discover more snips from this episode
Get the app