Training Agents with Incomplete Preferences Principle

The chapter explores training artificial agents using the Incomplete Preferences Principle (IPP) to avoid problems like reward mis-specification and goal misgeneralization. It contrasts the IPP with other proposals and discusses its effectiveness in aligning agent preferences. The chapter also addresses challenges such as deceptive alignment and the importance of ensuring agents can be shutdown and restarted if needed.

Play episode from 43:17

Transcript

Episode notes

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app