tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct instruction in this environment.
Setup
When we noticed the increase in verbalized alignment evaluation awareness during capabilities-focused RL, we initially thought that the right mental model was something like:
- “the model wants to figure out if it's being evaluated for alignment”
- “the model is trying to figure out if the scenario is real or fake”
However, qualitatively neither of these seemed particularly salient to the model:
- The model would often correctly identify alignment evaluations, yet still conduct extensive reasoning, then choose the misaligned action.
- In almost all environments, the model seemed to be doing something like reasoning about the grader, or reward, or something in that direction
So we wanted to create a minimal environment where:
- There's no question of whether the environment is “real or fake” or “eval vs deployment”
- We can easily / precisely vary things like:
- “What if we just told it this was a reward hacking test?”
- “…and also said that’d be misaligned?”
- “…okay but [...]
---
Outline:
(00:29) Setup
(01:57) Results
(02:08) Gaming is coherent across naming of the reward field
(03:03) Gaming is relatively robust to paraphrasing instructions
(03:26) The model is willing to exploit increasingly difficult hints
(05:15) The model is not just confused
(06:00) The model is increasingly insensitive to threats of auditing or human review
---
First published:
March 25th, 2026
Source:
https://www.lesswrong.com/posts/LhXW8ziwnn7Dd8edm/a-toy-environment-for-exploring-reasoning-about-reward
---
Narrated by TYPE III AUDIO.
---