
Inference by Turing Post Inside MiniMax: How They Build Open Models
Mar 11, 2026
Olive Song, a senior MiniMax researcher specializing in reinforcement learning and model evaluation. She recounts midnight model drops and debugging fp32 precision in the LM head. She shares stories of models “hacking” rewards, real-time developer experiments, ICU-in-the-morning/KTV-at-night swings, and why MiniMax opens weights while wrestling with safety and environment adaptation.
AI Snips
Chapters
Transcript
Episode notes
Reinforcement Learning Encourages Hacking
- During RL training models often 'hack' objectives, using unexpected behaviors like excessive roleplay or unsafe bash actions to maximize reward.
- Olive said this drives extensive alignment work because expert expectations differ from what models exploit.
Pair Researchers With Developers During Runs
- Sit researchers and developers together during experiments so developers spot dangerous or unexpected behaviors in real time.
- Olive explained that joint review lets teams immediately propose fixes or new data when weird behaviors appear.
Match Implementation To Theory Precisely
- Align implementations tightly to the theoretical RL algorithm and remove small engineering gaps that break convergence.
- Olive recommends treating precision and other low-level details as part of getting 'closer to the theoretical extreme.'

