BlueDot Narrated

Illustrating Reinforcement Learning from Human Feedback (RLHF)

Jan 4, 2025

A clear walkthrough of why human feedback improves language models. Short explanations of RLHF concepts and the three main training stages. Practical details on reward modeling, PPO fine-tuning with a KL penalty, and iterative training. A quick survey of open-source RLHF tools and a look at remaining safety and research challenges.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Human Preferences Replace Imperfect Metrics

Language model quality is hard to capture with a loss because 'good' text is subjective and context-dependent.
RLHF uses human feedback as a reward signal to directly optimise models for human preferences rather than proxy metrics like BLEU or ROUGE.

ADVICE

Begin RLHF From A Versatile Pretrained Model

Start RLHF from a pre-trained language model that already responds well to diverse instructions.
Optionally fine-tune this base on human-generated preferred text before reward-model training, as InstructGPT did.

INSIGHT

Preference Models Learn From Rankings Not Scores

Reward models map text to a scalar score representing human preference and are trained from human rankings of model outputs.
Rankings (pairwise comparisons) are preferred over absolute scores because they reduce calibration noise among annotators.

Get the Snipd Podcast app to discover more snips from this episode