
"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking
120 snips
May 1, 2026 Kyle Corbitt, OpenPipe founder now leading serverless training at CoreWeave, dives into how RL fine-tuning can beat supervised tuning, why GRPO became a practical favorite, and how teams design rubrics and environments for better model behavior. He also explores reward hacking, LLM judges, LoRA adapters, distillation, and why latency and cost push companies to customize open models.
AI Snips
Chapters
Transcript
Episode notes
Why RL Environments Became A Cottage Industry
- RL environment companies package economically useful tasks as snapshotable, gradable agent worlds like cloned Jira, GitHub, browser, and office workflows.
- Kyle Corbitt says labs prefer many vendors because one builder’s shortcuts create correlated signals that teach narrower skills.
Why Environment Startups Look Rich But Fragile
- Kyle Corbitt says environment startups can scale to tens or hundreds of millions fast, but he doubts they are durable venture businesses.
- He sees them more as strong cash-flow opportunities for founders because environments saturate quickly and must be constantly replaced.
Why Compute May Replace Most Human Labeling
- Kyle Corbitt suspects compute ultimately beats human labelers because once models can generate the needed data, machine-generated data becomes cheaper.
- He leaves one exception: human preference data may still matter if satisfying human tastes remains economically central in an automated economy.

