
BlueDot Narrated Constitutional AI Harmlessness from AI Feedback
Jan 4, 2025
They walk through Constitutional AI’s two-stage training and how models can supervise other models. The conversation covers critique–revise pipelines, chain-of-thought for feedback, and using model-generated labels for RL. Listeners hear about experiments on harmlessness vs helpfulness, effects of multiple revisions and principles, and risks like overtraining and tone drift.
AI Snips
Chapters
Transcript
Episode notes
Use Supervised Revisions Before Reinforcement Learning
- Use a two-stage process: supervised critique+revision first, then RL with a preference model to refine behavior.
- Supervised fine-tuning on revised responses reduces exploration and shortens RL training time.
Distill Model Judgments Into Preference Models
- Replace human harmlessness labels with model-generated comparisons to train a preference model for harmlessness.
- Mix model-generated harmlessness labels with human helpfulness labels, then fine-tune via RLAIF against that PM.
Chain Of Thought Boosts Model Evaluation
- Chain-of-thought prompting substantially improves model accuracy at evaluating helpfulness, honesty, and harmlessness.
- Larger models using COT approach the performance of human-trained preference models on binary HHH comparisons.
