
LessWrong (30+ Karma) “Are AIs more likely to pursue on-episode or beyond-episode reward?” by Anders Woodruff, Alex Mallen
Consider an AI that terminally pursues reward. How dangerous is this? It depends on how broadly-scoped a notion of reward the model pursues. It could be:
- on-episode reward-seeking: only maximizing reward on the current training episode — i.e., reward that reinforces their current action in RL. This is what people usually mean by “reward-seeker” (e.g. in Carlsmith or The behavioral selection model…).
- beyond-episode reward-seeking: maximizing reward for a larger-scoped notion of “self” (e.g., all models sharing the same weights).
In this post, I’ll discuss which motivation is more likely. This question is, of course, very similar to the broader questions of whether we should expect scheming. But there are a number of considerations particular to motivations that aim for reward; this post focuses on these. Specifically:
- On-episode and beyond-episode reward seekers have very similar motivations (unlike, e.g., paperclip maximizers and instruction followers), making both goal change and goal drift between them particularly easy.
- Selection pressures against beyond-episode reward seeking may be weak, meaning beyond-episode reward seekers might survive training even without goal-guarding.
- Beyond-episode reward is particularly tempting in training compared to random long-term goals, making effective goal-guarding harder.
This question is important because beyond-episode reward seekers are [...]
---
Outline:
(03:41) Pre-RL Priors Might Favor Beyond-Episode Goals
(05:34) Acting on beyond-episode reward seeking is disincentivized if training episodes interact
(08:35) But beyond-episode reward-seekers can goal-guard
(09:29) Are on-episode reward seekers or goal-guarding beyond-episode reward-seekers more likely?
(09:53) Why on-episode reward seekers might be favored
(12:13) Why goal-guarding beyond-episode reward seekers might be favored
(14:50) Conclusion
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
March 12th, 2026
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
