
BlueDot Narrated Illustrating Reinforcement Learning from Human Feedback (RLHF)
Jan 4, 2025
A clear walkthrough of why human feedback improves language models. Short explanations of RLHF concepts and the three main training stages. Practical details on reward modeling, PPO fine-tuning with a KL penalty, and iterative training. A quick survey of open-source RLHF tools and a look at remaining safety and research challenges.
AI Snips
Chapters
Transcript
Episode notes
Human Preferences Replace Imperfect Metrics
- Language model quality is hard to capture with a loss because 'good' text is subjective and context-dependent.
- RLHF uses human feedback as a reward signal to directly optimise models for human preferences rather than proxy metrics like BLEU or ROUGE.
Begin RLHF From A Versatile Pretrained Model
- Start RLHF from a pre-trained language model that already responds well to diverse instructions.
- Optionally fine-tune this base on human-generated preferred text before reward-model training, as InstructGPT did.
Preference Models Learn From Rankings Not Scores
- Reward models map text to a scalar score representing human preference and are trained from human rankings of model outputs.
- Rankings (pairwise comparisons) are preferred over absolute scores because they reduce calibration noise among annotators.
