BlueDot Narrated

Illustrating Reinforcement Learning from Human Feedback (RLHF)

Jan 4, 2025
A clear walkthrough of why human feedback improves language models. Short explanations of RLHF concepts and the three main training stages. Practical details on reward modeling, PPO fine-tuning with a KL penalty, and iterative training. A quick survey of open-source RLHF tools and a look at remaining safety and research challenges.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Human Preferences Replace Imperfect Metrics

  • Language model quality is hard to capture with a loss because 'good' text is subjective and context-dependent.
  • RLHF uses human feedback as a reward signal to directly optimise models for human preferences rather than proxy metrics like BLEU or ROUGE.
ADVICE

Begin RLHF From A Versatile Pretrained Model

  • Start RLHF from a pre-trained language model that already responds well to diverse instructions.
  • Optionally fine-tune this base on human-generated preferred text before reward-model training, as InstructGPT did.
INSIGHT

Preference Models Learn From Rankings Not Scores

  • Reward models map text to a scalar score representing human preference and are trained from human rankings of model outputs.
  • Rankings (pairwise comparisons) are preferred over absolute scores because they reduce calibration noise among annotators.
Get the Snipd Podcast app to discover more snips from this episode
Get the app