
Illustrating Reinforcement Learning from Human Feedback (RLHF)
BlueDot Narrated
00:00
Training the reward (preference) model
Perrin Walker explains collecting prompts, human rankings, and converting rankings into scalar rewards.
Play episode from 05:50
Transcript


