From Atari to ChatGPT: How AI Learned to Follow Instructions
Linear Digressions
00:00
Scaling preference feedback with a reward model
Ben explains how a small amount of human labels trains a reward model that amplifies feedback for RL training.
Play episode from 08:39
Transcript


