The MAD Podcast with Matt Turck

State of LLMs 2026: RLVR, GRPO, Inference Scaling — Sebastian Raschka

180 snips
Jan 29, 2026
Sebastian Raschka, AI researcher and educator known for practical ML guides and his book on building LLMs, walks through 2025–2026 shifts in large models. He compares architectures like transformers, world models, and text diffusion. He explains RLVR and GRPO post-training methods, warns about benchmark gaming, and highlights inference‑time scaling and private data as key drivers.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
ANECDOTE

Personal RLVR Experiment

  • Sebastian trained a QUEN3 model with only 50 RLVR steps and saw math accuracy jump drastically.
  • He used this to argue the base model already stores reasoning knowledge that training unlocks.
INSIGHT

Process Rewards Are Fragile Today

  • Process reward models (grading intermediate chain-of-thought) are promising but currently fragile due to reward hacking and grader reliability.
  • Multi-model grading stacks can work but increase cost and complexity.
ADVICE

Stabilize GRPO With Practical Fixes

  • Apply practical stability tricks for GRPO: tweak KL terms, normalize rewards, and combine small algorithmic fixes.
  • Treat RL training like other large-scale training: monitor checkpoints and be ready to roll back when updates destabilize performance.
Get the Snipd Podcast app to discover more snips from this episode
Get the app