"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

120 snips

May 1, 2026

Kyle Corbitt, OpenPipe founder now leading serverless training at CoreWeave, dives into how RL fine-tuning can beat supervised tuning, why GRPO became a practical favorite, and how teams design rubrics and environments for better model behavior. He also explores reward hacking, LLM judges, LoRA adapters, distillation, and why latency and cost push companies to customize open models.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Why RL Environments Became A Cottage Industry

RL environment companies package economically useful tasks as snapshotable, gradable agent worlds like cloned Jira, GitHub, browser, and office workflows.
Kyle Corbitt says labs prefer many vendors because one builder’s shortcuts create correlated signals that teach narrower skills.

INSIGHT

Why Environment Startups Look Rich But Fragile

Kyle Corbitt says environment startups can scale to tens or hundreds of millions fast, but he doubts they are durable venture businesses.
He sees them more as strong cash-flow opportunities for founders because environments saturate quickly and must be constantly replaced.

INSIGHT

Why Compute May Replace Most Human Labeling

Kyle Corbitt suspects compute ultimately beats human labelers because once models can generate the needed data, machine-generated data becomes cheaper.
He leaves one exception: human preference data may still matter if satisfying human tastes remains economically central in an automated economy.

Get the Snipd Podcast app to discover more snips from this episode

Get the app