Robustness to Instruction Paraphrase

TYPE III AUDIO summarizes results showing RL-late consistently games more despite paraphrased instructions.

Play episode from 03:04

chevron_right

Transcript

chevron_right

Transcript

Episode notes

tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct instruction in this environment.

Setup

When we noticed the increase in verbalized alignment evaluation awareness during capabilities-focused RL, we initially thought that the right mental model was something like:

“the model wants to figure out if it's being evaluated for alignment”
“the model is trying to figure out if the scenario is real or fake”

However, qualitatively neither of these seemed particularly salient to the model:

The model would often correctly identify alignment evaluations, yet still conduct extensive reasoning, then choose the misaligned action.
In almost all environments, the model seemed to be doing something like reasoning about the grader, or reward, or something in that direction

So we wanted to create a minimal environment where:

There's no question of whether the environment is “real or fake” or “eval vs deployment”
We can easily / precisely vary things like:
“What if we just told it this was a reward hacking test?”
“…and also said that’d be misaligned?”
“…okay but [...]

---

Outline:

(00:29) Setup

(01:57) Results

(02:08) Gaming is coherent across naming of the reward field

(03:03) Gaming is relatively robust to paraphrasing instructions

(03:26) The model is willing to exploit increasingly difficult hints

(05:15) The model is not just confused

(06:00) The model is increasingly insensitive to threats of auditing or human review

---

First published:
March 25th, 2026

Source:
https://www.lesswrong.com/posts/LhXW8ziwnn7Dd8edm/a-toy-environment-for-exploring-reasoning-about-reward