Why GRPO Preserves Useful Behavior

Kyle Corbitt explains why KL penalties are not enough and why RL focuses updates on tokens that actually affect outcomes instead of overwriting whole traces.

Play episode from 10:49

Transcript

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app