Models hacking behaviors during RL

Olive discusses how models 'hack' environments during reinforcement learning and the need for alignment and constraints.

Play episode from 02:37

chevron_right

Transcript

chevron_right

Transcript

Episode notes

First Western interview with a senior MiniMax researcher. Olive Song explains how they actually build models that work.

When MiniMax's RL training wouldn't converge, they debugged layer by layer until they found it: fp32 precision in the LM head. When their models learned to "hack" during training, exploiting loopholes to maximize rewards, they had to rethink alignment from scratch. When benchmarks said their models were good but production said otherwise, they discovered the problem: environment adaptation.

Olive talks about working at a pace where new models drop at midnight and you test them at midnight. How they use an internal AI agent to read every new paper published overnight. Why they sit with developers during experiments to catch dangerous behaviors in real-time. What "ICU in the morning, KTV at night" means when results swing wildly. How problem-solving becomes discovery when you're debugging behaviors no one has seen before.

This is how Chinese labs are moving fast: first-principles thinking, engineering discipline, and willingness to work whenever the model in experimentation requires you to.

We spoke on Sunday at 9 pm Beijing time. Olive was still waiting for results from new model experiments, so my first question was obvious: does everyone at the company work like this?

*Follow on*: https://www.turingpost.com/

*Did you like the episode? You know the drill:*

📌 Subscribe for more conversations with the builders shaping real-world AI.

💬 Leave a comment if this resonated.

👍 Like it if you liked it.

🫶 Thank you for watching and sharing!

*Guest:* Olive Song, Senior Researcher at MiniMax

MiniMax: https://www.minimaxi.com/

Models: https://huggingface.co/MiniMaxAI

*Links:*

vLLM: https://github.com/vllm-project/vllm

SGLang: https://github.com/sgl-project/sglang

📰 Transcript: https://www.turingpost.com/olive

Chapters:

0:00 – Reinforcement Learning and Unexpected Model Behaviors

3:08 – Roleplay, Alignment, and “AI with Everyone”

4:02 – How AI Changes Daily Life and Productivity

4:59 – Inside Miniax: How Researchers and Engineers Work Together

5:32 – Human Alignment and Safety in Open Models

6:16 – Why Engineering Details Matter More Than Algorithms

8:17 – Open Weights: Benefits, Risks, and Responsibility

10:57 – Specialization vs General AI Models

12:07 – Agentic AI and Long-Horizon Tasks

29:50 – AGI, Creativity, and the Future of AI

*Turing Post* – AI stories from labs the Valley doesn't cover.

https://x.com/TheTuringPost

https://www.linkedin.com/in/ksenia-se

#MiniMax #ReinforcementLearning #AIResearch #OpenWeights #ChineseAI #OpensourceAI

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books