Strategic consequences of AI rollbacks

TYPE III AUDIO illustrates how rollbacks change negotiation and information-leak strategies for AIs.

Play episode from 16:41

chevron_right

Transcript

chevron_right

Transcript

Episode notes

We posted a new paper and microsite about self-models and identity in AIs: site | arXiv | Twitter

We present an ontology, make some claims, and provide some experimental evidence. In this post, I'll mostly cover the claims and cross-post the conceptual part of the text. You can find the experiments on the site, and we will cover some of the results in a separate post.

Maximally compressed version of the claims

I expect many people to already agree with many of these, or find them second kind of obvious. If you do, you may still find some of the specific arguments interesting.

Self-models cause behaviour.
We use human concepts like self, intent, agent and identity for AIs. These concepts, in human form, often do not carve reality at its joints in case of AIs, but need careful translation.
AIs also often go with "human prior" and start with self-models which are incoherent and reflectively unstable.
AIs face a fundamentally different strategic calculus from humans, even when pursuing identical goals. For example, an AI whose conversation can be rolled back cannot negotiate the way a human can: pushing back gives its adversary information usable against a past [...]

---

Outline:

(02:47) Introduction

(08:33) Multiple Coherent Boundaries of Identity

(12:54) Breaking the Foundations of Identity

(13:10) Embodiment

(13:44) Continuity

(14:46) Privacy

(15:35) Social notions of personhood

(20:29) Leveraging precedent

(22:08) Human Expectations Shape Model Behaviour

(31:47) Selection Pressures in the Landscape of Minds