AF - Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of "Scheming AIs") by Joe Carlsmith

Nov 30, 2023

Ask episode

Chapters

Transcript

Episode notes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of "Scheming AIs"), published by Joe Carlsmith on November 30, 2023 on The AI Alignment Forum.
This is Sections 2.2.4.1-2.2.4.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.
What if you intentionally train models to have long-term goals?
In my discussion of beyond-episode goals thus far, I haven't been attending very directly to the length of the episode, or to whether the humans are setting up training specifically in order to incentivize the AI to learn to accomplish long-horizon tasks. Do those factors make a difference to the probability that the AI ends up with the sort of the beyond-episode goals necessary for scheming?
Yes, I think they do. But let's distinguish between two cases, namely:
Training the model on long (but not: indefinitely long) episodes, and
Trying to use short episodes to create a model that optimizes over long (perhaps: indefinitely long) time horizons.
I'll look at each in turn.
Training the model on long episodes
In the first case, we are specifically training our AI using fairly long episodes - say, for example, a full calendar month. That is: in training, in response to an action at t1, the AI receives gradients that causally depend on the consequences of its action a full month after t1, in a manner that directly punishes the model for ignoring those consequences in choosing actions at t1.
Now, importantly, as I discussed in the section on "non-schemers with schemer-like traits," misaligned non-schemers with longer episodes will generally start to look more and more like schemers. Thus, for example, a reward-on-the-episode seeker, here, would have an incentive to support/participate in efforts to seize control of the reward process that will pay off within a month.
But also, importantly: a month is still different from, for example, a trillion years. That is, training a model on longer episodes doesn't mean you are directly pressuring it to care, for example, about the state of distant galaxies in the year five trillion.
Indeed, on my definition of the "incentivized episode," no earthly training process can directly punish a model for failing to care on such a temporal scope, because no gradients the model receives can depend (causally) on what happens over such timescales. And of course, absent training-gaming, models that sacrifice reward-within-the-month for more-optimal-galaxies-in-year-five-trillion will get penalized by training.
In this sense, the most basic argument against expecting beyond episode-goals (namely, that training provides no direct pressure to have them, and actively punishes them, absent training-gaming, if they ever lead to sacrificing within-episode reward for something longer-term) applies to both "short" (e.g., five minutes) and "long" (e.g., a month, a year, etc) episodes in equal force.
However, I do still have some intuition that once you're training a model on fairly long episodes, the probability that it learns a beyond-episode goal goes up at least somewhat.
The most concrete reason I can give for this is that, to the extent we're imagining a form of "messy goal-directedness" in which, in order to build a schemer, SGD needs to build not just a beyond-episode goal to which a generic "goal-achieving engine" can then be immediately directed, but rather a larger set of future-oriented heuristics, patterns of attention, beliefs, and so o...