Addressing Incomplete Preferences in Agent Training

The chapter explores the concept of Incomplete Preferences Principle (IPP) as a solution to address issues like goal misgeneralization and deceptive alignment in agent training. It contrasts between money maximizing agents and IPP agents in adopting terminal goals, showcasing different behavioral incentives. The discussion delves into training AI agents with IPP to influence behavior and prevent problems like reward mis-specification and deployment issues.

Play episode from 01:02:58

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app