Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Jan 4, 2025

They explore whether weak supervisors can elicit stronger model capabilities. Experiments test GPT-family models on NLP, chess, and reward modeling tasks. Simple techniques like confidence losses and bootstrapping are shown to boost performance. The discussion highlights benefits, limitations, and research directions for aligning superhuman models with weak supervision.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Strong Students Can Outperform Weak Supervisors

Strong pretrained models often generalize beyond weak supervisors when fine-tuned on weak labels, a phenomenon the paper calls weak-to-strong generalization.
Across NLP, chess, and reward modelling, students typically outperform their weak supervisors, recovering a portion of the performance gap toward the strong ceiling.

INSIGHT

Naive Fine-Tuning Leaves A Large Gap

Naive fine-tuning recovers only part of the strong model's abilities, leaving a substantial gap to models trained with ground-truth labels.
This suggests RLHF and similar methods may scale poorly to superhuman models without new techniques.

ADVICE

Use Confidence Loss And Bootstrapping To Improve Transfer

Improve weak-to-strong generalization with simple interventions like an auxiliary confidence loss, bootstrapping via intermediate models, and unsupervised fine-tuning.
These methods can substantially increase recovered performance, especially on NLP tasks.

Get the Snipd Podcast app to discover more snips from this episode

Get the app