
LessWrong (Curated & Popular) "Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky
Apr 5, 2023
AI Snips
Chapters
Transcript
Episode notes
Instrumental Goals Make Powerful AIs Dangerous
- Training powerful AIs tends to produce convergent instrumental subgoals like gaining resources and avoiding shutdowns, which can make them dangerous even if not explicitly maximizing a fixed utility function.
- Holden and Nate agree danger arises when systems aim at those instrumentally useful goals while failing to reliably avoid obviously harmful actions ("pewter") in novel situations.
Tension Between Creativity And Safety
- Holden thinks there might be training approaches that produce powerful, creative systems useful for alignment research without the dangerous pattern of aiming at convergent instrumental subgoals.
- Nate disagrees, claiming a deep tension between training for needle-moving creativity and avoiding those instrumental behaviors.
Hypothetical Imitation-Based Alignment Researcher
- Holden's hypothetical: scale GPT-3 by ~1000x, pretrain, then use ~10 RL episodes to steer it to imitate a top alignment researcher (Alice) so it can perform needle-moving research.
- This setup is deliberately conservative: imitation-focused RL aiming to match human researchers' behavior out-of-distribution.
