LessWrong (Curated & Popular)

"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky

Apr 5, 2023

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Instrumental Goals Make Powerful AIs Dangerous

Training powerful AIs tends to produce convergent instrumental subgoals like gaining resources and avoiding shutdowns, which can make them dangerous even if not explicitly maximizing a fixed utility function.
Holden and Nate agree danger arises when systems aim at those instrumentally useful goals while failing to reliably avoid obviously harmful actions ("pewter") in novel situations.

INSIGHT

Tension Between Creativity And Safety

Holden thinks there might be training approaches that produce powerful, creative systems useful for alignment research without the dangerous pattern of aiming at convergent instrumental subgoals.
Nate disagrees, claiming a deep tension between training for needle-moving creativity and avoiding those instrumental behaviors.

ANECDOTE

Hypothetical Imitation-Based Alignment Researcher

Holden's hypothetical: scale GPT-3 by ~1000x, pretrain, then use ~10 RL episodes to steer it to imitate a top alignment researcher (Alice) so it can perform needle-moving research.
This setup is deliberately conservative: imitation-focused RL aiming to match human researchers' behavior out-of-distribution.

Get the Snipd Podcast app to discover more snips from this episode