"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

120 snips
May 1, 2026
Kyle Corbitt, OpenPipe founder now leading serverless training at CoreWeave, dives into how RL fine-tuning can beat supervised tuning, why GRPO became a practical favorite, and how teams design rubrics and environments for better model behavior. He also explores reward hacking, LLM judges, LoRA adapters, distillation, and why latency and cost push companies to customize open models.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Why RL Environments Became A Cottage Industry

  • RL environment companies package economically useful tasks as snapshotable, gradable agent worlds like cloned Jira, GitHub, browser, and office workflows.
  • Kyle Corbitt says labs prefer many vendors because one builder’s shortcuts create correlated signals that teach narrower skills.
INSIGHT

Why Environment Startups Look Rich But Fragile

  • Kyle Corbitt says environment startups can scale to tens or hundreds of millions fast, but he doubts they are durable venture businesses.
  • He sees them more as strong cash-flow opportunities for founders because environments saturate quickly and must be constantly replaced.
INSIGHT

Why Compute May Replace Most Human Labeling

  • Kyle Corbitt suspects compute ultimately beats human labelers because once models can generate the needed data, machine-generated data becomes cheaper.
  • He leaves one exception: human preference data may still matter if satisfying human tastes remains economically central in an automated economy.
Get the Snipd Podcast app to discover more snips from this episode
Get the app