State of LLMs 2026: RLVR, GRPO, Inference Scaling — Sebastian Raschka

194 snips

Jan 29, 2026

Sebastian Raschka, AI researcher and educator known for practical ML guides and his book on building LLMs, walks through 2025–2026 shifts in large models. He compares architectures like transformers, world models, and text diffusion. He explains RLVR and GRPO post-training methods, warns about benchmark gaming, and highlights inference‑time scaling and private data as key drivers.

Ask episode

AI Snips

Chapters

Books

Transcript

Episode notes

ANECDOTE

Personal RLVR Experiment

Sebastian trained a QUEN3 model with only 50 RLVR steps and saw math accuracy jump drastically.
He used this to argue the base model already stores reasoning knowledge that training unlocks.

INSIGHT

Process Rewards Are Fragile Today

Process reward models (grading intermediate chain-of-thought) are promising but currently fragile due to reward hacking and grader reliability.
Multi-model grading stacks can work but increase cost and complexity.

ADVICE

Stabilize GRPO With Practical Fixes

Apply practical stability tricks for GRPO: tweak KL terms, normalize rewards, and combine small algorithmic fixes.
Treat RL training like other large-scale training: monitor checkpoints and be ready to roll back when updates destabilize performance.

Get the Snipd Podcast app to discover more snips from this episode

Get the app