Outro

TYPE III AUDIO closes the episode, notes narration and publication details, and points to the LessWrong post link.

Play episode from 39:03

chevron_right

Transcript

chevron_right

Transcript

Episode notes

TL;DR We study realism win rate, a metric for measuring how distinguishable Petri audit transcripts are from real deployment interactions. We use it to evaluate the effect of giving the auditor real deployment resources (system prompts, tool definitions, and codebases). Providing these resources to the auditor increases the average realism win rate from 4.6% to 32.8% for reward hacking audits, and doesn’t significantly change hack rate.

💻Code

Research done as part of the Anthropic Fellows Program.

Introduction

Automated auditing agents like Petri are increasingly used to evaluate the propensities of LLMs. One central challenge is eval-awareness: models recognizing when they are being tested and then adjusting their behavior. In the worst case, a scheming model could fake alignment whenever it suspects being in an evaluation, causing auditors to systematically underestimate risk. Currently the Petri auditor must construct system prompts, tool definitions, and codebases from scratch, and these synthetic artifacts often contain tells that a capable model can detect. This motivates the need to measure and improve the realism of audits.

Previous work primarily measured realism by either seeing if models verbalized situational awareness (Goldowsky-Dill et al., Sonnet 4.5 System Card) (which would fail to catch unverbalized situational awareness), or [...]

---

Outline:

(00:48) Introduction

(03:39) Methods

(03:42) Measuring realism of audit transcripts

(07:31) Giving the auditor resources

(09:43) Results

(09:46) Resources improve realism of reward hacking audits

(12:15) Task-driven eval-awareness can bottleneck realism

(14:17) Validating resources with benign audits

(15:35) Validating realism win rate metric

(18:15) Do more realistic audits change target behavior?