
BlueDot Narrated Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Jan 4, 2025
They explore whether weak supervisors can elicit stronger model capabilities. Experiments test GPT-family models on NLP, chess, and reward modeling tasks. Simple techniques like confidence losses and bootstrapping are shown to boost performance. The discussion highlights benefits, limitations, and research directions for aligning superhuman models with weak supervision.
AI Snips
Chapters
Transcript
Episode notes
Strong Students Can Outperform Weak Supervisors
- Strong pretrained models often generalize beyond weak supervisors when fine-tuned on weak labels, a phenomenon the paper calls weak-to-strong generalization.
- Across NLP, chess, and reward modelling, students typically outperform their weak supervisors, recovering a portion of the performance gap toward the strong ceiling.
Naive Fine-Tuning Leaves A Large Gap
- Naive fine-tuning recovers only part of the strong model's abilities, leaving a substantial gap to models trained with ground-truth labels.
- This suggests RLHF and similar methods may scale poorly to superhuman models without new techniques.
Use Confidence Loss And Bootstrapping To Improve Transfer
- Improve weak-to-strong generalization with simple interventions like an auxiliary confidence loss, bootstrapping via intermediate models, and unsupervised fine-tuning.
- These methods can substantially increase recovered performance, especially on NLP tasks.
