
Illustrating Reinforcement Learning from Human Feedback (RLHF)
BlueDot Narrated
00:00
Reward composition and KL penalty
Perrin Walker details R = reward model minus scaled KL divergence to prevent reward hacking.
Play episode from 12:50
Transcript


