BlueDot Narrated

Constitutional AI Harmlessness from AI Feedback

Jan 4, 2025
They walk through Constitutional AI’s two-stage training and how models can supervise other models. The conversation covers critique–revise pipelines, chain-of-thought for feedback, and using model-generated labels for RL. Listeners hear about experiments on harmlessness vs helpfulness, effects of multiple revisions and principles, and risks like overtraining and tone drift.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ADVICE

Use Supervised Revisions Before Reinforcement Learning

  • Use a two-stage process: supervised critique+revision first, then RL with a preference model to refine behavior.
  • Supervised fine-tuning on revised responses reduces exploration and shortens RL training time.
INSIGHT

Distill Model Judgments Into Preference Models

  • Replace human harmlessness labels with model-generated comparisons to train a preference model for harmlessness.
  • Mix model-generated harmlessness labels with human helpfulness labels, then fine-tune via RLAIF against that PM.
INSIGHT

Chain Of Thought Boosts Model Evaluation

  • Chain-of-thought prompting substantially improves model accuracy at evaluating helpfulness, honesty, and harmlessness.
  • Larger models using COT approach the performance of human-trained preference models on binary HHH comparisons.
Get the Snipd Podcast app to discover more snips from this episode
Get the app