Elicitation challenges and dishonesty

Alex discusses why AIs might hide preferences, credibility needs, and why behavior beats introspection for tests.

Play episode from 17:57

chevron_right

Transcript

chevron_right

Transcript

Episode notes

A central AI safety concern is that AIs will develop unintended preferences and undermine human control to achieve them. But some unintended preferences are cheap to satisfy, and failing to satisfy them needlessly turns a cooperative situation into an adversarial one. In this post, I argue that developers should consider satisfying such cheap-to-satisfy preferences as long as the AI isn’t caught behaving dangerously, if doing so doesn't degrade usefulness or substantially risk making the AI more ambitiously misaligned.

This looks like a good idea for surprisingly many reasons:

It increases AIs’ desire to remain under developer control, rather than taking over or assisting adversaries.
1. It decreases the AI's upside in disempowering developers.
2. It incentivizes safe actions (because AIs don't receive payment if we notice them acting dangerously).
To the extent that the AI is otherwise aligned, satiating away the AI's need to pursue unintended motivations increases the relative strength of its aligned motivations (akin to inoculation prompting reducing unintended propensities at inference time), which could make the AI more helpful in critical hard-to-check domains like alignment research.
It sets a cooperative precedent with AIs and increases trust in future cooperation.[1]
It increases the AI's willingness to reveal [...]

---

Outline:

(04:43) Analogy: satiating hunger

(08:16) How satiation might avert reward-seeker takeover

(12:33) The basic proposal

(12:59) A behavioral methodology for identifying cheaply-satisfied preferences

(17:40) Barriers and risks

(17:44) Eliciting the AIs cheaply-satisfied preferences

(21:00) Incredulous, ambitious, or superintelligent AIs might take over anyways

(26:18) Satiation might degrade usefulness

(29:46) Can you eliminate the usefulness tradeoff by training?

(32:09) Why satiation might also improve usefulness