BlueDot Narrated

Constitutional AI Harmlessness from AI Feedback

Jan 4, 2025

They walk through Constitutional AI’s two-stage training and how models can supervise other models. The conversation covers critique–revise pipelines, chain-of-thought for feedback, and using model-generated labels for RL. Listeners hear about experiments on harmlessness vs helpfulness, effects of multiple revisions and principles, and risks like overtraining and tone drift.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ADVICE

Use Supervised Revisions Before Reinforcement Learning

Use a two-stage process: supervised critique+revision first, then RL with a preference model to refine behavior.
Supervised fine-tuning on revised responses reduces exploration and shortens RL training time.

INSIGHT

Distill Model Judgments Into Preference Models

Replace human harmlessness labels with model-generated comparisons to train a preference model for harmlessness.
Mix model-generated harmlessness labels with human helpfulness labels, then fine-tune via RLAIF against that PM.

INSIGHT

Chain Of Thought Boosts Model Evaluation

Chain-of-thought prompting substantially improves model accuracy at evaluating helpfulness, honesty, and harmlessness.
Larger models using COT approach the performance of human-trained preference models on binary HHH comparisons.

Get the Snipd Podcast app to discover more snips from this episode