
BlueDot Narrated Can We Scale Human Feedback for Complex AI Tasks?
Jan 4, 2025
They explore why simple human feedback breaks down on complex AI tasks. They outline scalable oversight techniques like task decomposition, recursive reward modelling, constitutional approaches, and debate frameworks. Experiments on weak-to-strong generalisation and potential failure modes are also covered.
AI Snips
Chapters
Transcript
Episode notes
Human Feedback Breaks Down On Complex Tasks
- RLHF works well for many tasks but fails when humans can't reliably judge complex outputs at scale.
- Problems like deception and hallucinations arise because humans may be briefly fooled by plausible outputs that aren't actually correct.
Summaries Via Hierarchical Decomposition
- Task decomposition breaks hard tasks into subtasks that are easier for humans to evaluate accurately.
- Example: summarize each page, then each chapter, then the whole book so humans needn't reread the entire book to verify accuracy.
Iterated Amplification And Distillation
- Iterated amplification (IA) calls multiple copies of a model or uses chain-of-thought to amplify capability on subtasks.
- IDA adds distillation so the amplified result becomes a single-step training example for the base model.
