AI Alignment Is Solved?! PhD Researcher Quintin Pope vs Liron Shapira (2023 Twitter Debate)

52 snips

Mar 25, 2026

Quintin Pope, PhD student and AI alignment researcher focused on NLP and alignment, defends the view that RLHF and imitative modeling largely solve current alignment challenges. He and Liron debate whether training-data limits, chain-of-thought, mechanistic interpretability, and phase transitions mean systems will stay controllable or could produce goal-directed superintelligence. They also discuss likely capabilities by 2028 and who will build powerful AIs.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

RLHF Made Alignment Practically Work

Quintin Pope argues alignment looks essentially solved because RLHF plus imitative modeling lets models learn human-preferred behaviors.
He treats alignment progress as an engineering trajectory that has improved since InstructGPT to GPT-4 and can continue alongside capability gains.

INSIGHT

Oversight Breaks When Humans Can't Evaluate Outputs

Liron Shapira emphasizes a key crux: current oversight works when humans are competent to judge model outputs but may fail when models produce novel, high-impact proposals humans can't evaluate.
He gives the DNA/gene-drive example to show how a misleading-seeming suggestion could cause catastrophic harm despite human approval.

INSIGHT

Capabilities Are Grounded In Training Data

Quintin counters that capabilities track the geometry and availability of training data, so systems won't magically be competent in domains lacking data or labels.
He argues emergent competence roots in exposure to relevant data distributions and interpretable feedback processes.

Get the Snipd Podcast app to discover more snips from this episode

Get the app