Summary: "Reward hacking" commonly refers to two different phenomena: misspecified-reward exploitation, where RL reinforces undesired behaviors that score highly under the reward function, and task gaming, where models cheat on tasks specified to them in-context. While these often coincide, they can come apart, require distinct interventions, and lead to distinct threat models. Using a blanket term for both can obscure this.
Distinct phenomena qualify as reward hacking
The term[1] commonly points to two distinct phenomena. I refer to them as misspecified-reward exploitation and task gaming.[2]
Misspecified-reward exploitation: over the course of some RL optimization process using a reward function R, the model achieves high reward according to R via some strategy that developers would have preferred R not assign high reward to. To the extent that a “true” reward R’ can be well-defined, this means the model achieves high reward according to R but not R’ (formalized here). Some examples of misspecified-reward exploitation include:
- A Lego-stacking agent learning to flip over a red lego instead of placing it on top of the blue one, as developers intended (link)
- A boat learns to spin in circles, collecting power ups, rather than actually racing, as developers intended (link)
- A language [...]
---
Outline:
(00:38) Distinct phenomena qualify as reward hacking
(03:46) These phenomena can coincide but can come apart
(04:40) Misspecified-reward exploitation without task gaming
(05:26) Task gaming without misspecified-reward exploitation
(06:54) These phenomena have distinct threat models
(06:58) Just task gaming
(07:45) Task gaming from misspecified-reward exploitation
(08:27) Just misspecified-reward exploitation
(08:49) Recommendation
(09:21) When the lines could blur
(10:35) Acknowledgments
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
March 20th, 2026
Source:
https://www.lesswrong.com/posts/ixyokbwQEHgiHJYFW/confusion-around-the-term-reward-hacking
---
Narrated by TYPE III AUDIO.