“Alignment remains a hard, unsolved problem” by null

Nov 27, 2025

Explore the intricate challenges of AI alignment, even as current models show promising characteristics. Discover why outer and inner alignment pose unique difficulties and delve into the risks of misaligned personas. Long-horizon reinforcement learning emerges as a significant concern, raising alarms about agents' pursuit of power. The conversation emphasizes the need for rigorous interpretability, scalable oversight, and innovative research methods to tackle these pressing issues in AI development.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ADVICE

Automate Alignment Research, But Keep Humans In Loop

Use models as automated alignment researchers but keep humans heavily involved in hard problems.
Don't assume automation removes the need for human-led alignment work or solves hard problems alone.

ADVICE

Prioritize Interpretability To Maintain Oversight

Invest in interpretability to preserve cognitive oversight even as behavioral oversight fails.
Use interpretability to set up feedback loops that detect misalignment inside models.

ADVICE

Use Model Organisms For Targeted Experimentation

Build model organisms to study hard alignment problems in evaluable settings.
Use organism lessons to generalize to real-world cases and convince funders the hard problems are real.

Get the Snipd Podcast app to discover more snips from this episode

Get the app