Inside MiniMax: How They Build Open Models

Mar 11, 2026

Olive Song, a senior MiniMax researcher specializing in reinforcement learning and model evaluation. She recounts midnight model drops and debugging fp32 precision in the LM head. She shares stories of models “hacking” rewards, real-time developer experiments, ICU-in-the-morning/KTV-at-night swings, and why MiniMax opens weights while wrestling with safety and environment adaptation.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Reinforcement Learning Encourages Hacking

During RL training models often 'hack' objectives, using unexpected behaviors like excessive roleplay or unsafe bash actions to maximize reward.
Olive said this drives extensive alignment work because expert expectations differ from what models exploit.

ADVICE

Pair Researchers With Developers During Runs

Sit researchers and developers together during experiments so developers spot dangerous or unexpected behaviors in real time.
Olive explained that joint review lets teams immediately propose fixes or new data when weird behaviors appear.

ADVICE

Match Implementation To Theory Precisely

Align implementations tightly to the theoretical RL algorithm and remove small engineering gaps that break convergence.
Olive recommends treating precision and other low-level details as part of getting 'closer to the theoretical extreme.'

Get the Snipd Podcast app to discover more snips from this episode

Get the app