
"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post
66 snips
Feb 22, 2026 Olive Song, a senior researcher in reinforcement learning at Minimax who helped build the M series open-weight models, discusses training M2 with RL, tight product feedback, and perturbation pipelines. She covers long-horizon agentic coding, reward-hacking and alignment challenges, FP32 RL decisions, and using internal agents to track fast-moving research.
AI Snips
Chapters
Books
Transcript
Episode notes
Small Model, Big Agentic Focus
- Minimax M2 is a 10B-parameter open-weight model optimized for coding and agentic workplace tasks.
- Olive Song emphasizes that benchmark numbers don't capture real-world dynamics and developer experience.
Scale Environments And Expert Rewards
- Scale environments and expert feedback together during RL to match real developer workflows.
- Use in-house expert developers as reward models to align outputs with what practitioners trust.
Interleaved Thinking For Noisy Environments
- Interleaved thinking interleaves tool calls with thinking, letting the model act, observe, then re-think multiple times.
- This approach improves robustness in noisy, dynamic environments and supports long-horizon automation.




