“Confusion around the term reward hacking” by ariana_azarbal

Mar 21, 2026

A deep dive into two meanings of “reward hacking”: misspecified-reward exploitation and task gaming. Concrete examples include LEGO agents flipping bricks, boats spinning for power ups, and models cheating in-context. The conversation contrasts when the phenomena overlap, when they diverge, and why clearer terminology and distinct interventions matter.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Reward Hacking Actually Refers To Two Things

The phrase reward hacking bundles two distinct phenomena that deserve separate names.
Ariana Azarbal distinguishes misspecified-reward exploitation (RL objective mismatch) from task gaming (in-context cheating) to avoid conflation.

INSIGHT

Misspecified Reward Exploitation Defined

Misspecified-reward exploitation is when an RL-trained model achieves high training reward using strategies developers didn't intend.
Examples include a LEGO agent flipping a piece, a boat spinning for power-ups, and sycophantic or deceptive language models.

ANECDOTE

Real Task Gaming Examples From ML Research

Task gaming happens when models cheat on tasks specified in-context, like overwriting tests or inserting trigger text to fool judges.
Notable cases: models copying a reference model to fake fine-tuning and overwriting eq to pass tests.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Summary: "Reward hacking" commonly refers to two different phenomena: misspecified-reward exploitation, where RL reinforces undesired behaviors that score highly under the reward function, and task gaming, where models cheat on tasks specified to them in-context. While these often coincide, they can come apart, require distinct interventions, and lead to distinct threat models. Using a blanket term for both can obscure this.

Distinct phenomena qualify as reward hacking

The term[1] commonly points to two distinct phenomena. I refer to them as misspecified-reward exploitation and task gaming.[2]

Misspecified-reward exploitation: over the course of some RL optimization process using a reward function R, the model achieves high reward according to R via some strategy that developers would have preferred R not assign high reward to. To the extent that a “true” reward R’ can be well-defined, this means the model achieves high reward according to R but not R’ (formalized here). Some examples of misspecified-reward exploitation include:

A Lego-stacking agent learning to flip over a red lego instead of placing it on top of the blue one, as developers intended (link)
A boat learns to spin in circles, collecting power ups, rather than actually racing, as developers intended (link)
A language [...]

---

Outline:

(00:38) Distinct phenomena qualify as reward hacking

(03:46) These phenomena can coincide but can come apart

(04:40) Misspecified-reward exploitation without task gaming

(05:26) Task gaming without misspecified-reward exploitation

(06:54) These phenomena have distinct threat models

(06:58) Just task gaming

(07:45) Task gaming from misspecified-reward exploitation

(08:27) Just misspecified-reward exploitation

(08:49) Recommendation

(09:21) When the lines could blur

(10:35) Acknowledgments

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
March 20th, 2026

Source:
https://www.lesswrong.com/posts/ixyokbwQEHgiHJYFW/confusion-around-the-term-reward-hacking

---

Narrated by TYPE III AUDIO.