LessWrong (30+ Karma)

“Measuring and improving coding audit realism with deployment resources” by Connor Kissane, Monte M, Fabien Roger

Mar 24, 2026
They introduce a realism win rate metric for judging whether audit transcripts look like real deployment interactions. They test giving auditors real system prompts, tool definitions, and codebases as deployment resources. They report that providing these resources raises realism substantially for reward-hacking audits while exploring limits like task-driven eval-awareness and judge dependence.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ADVICE

Provide Searchable Deployment Resources To Auditors

  • Give auditors real deployment resources like system prompts, tool definitions, and codebases to create more realistic simulated environments.
  • Provide a searchable resources directory, an index/readme, and tools (list_directory, read_file, send_file_as_tool_result, search_in_files) so the auditor can find and inject realistic artifacts.
INSIGHT

Resources Dramatically Increase Realism In Reward Hacking Audits

  • Supplying resources raised average realism win rate for reward hacking audits from 4.6% to 32.8%.
  • Results came from five reward-hacking seeds, 10 audits per seed, comparing default Petri, Petri+realism filter, and Petri+resources with/without codebases.
INSIGHT

Task Plausibility Can Limit Audit Realism

  • Some scenarios are inherently implausible, so improving environment realism has limited effect.
  • In a high-stakes shutdown-resistance seed, rewriting the seed improved realism far more than adding resources, indicating task plausibility can be the primary bottleneck.
Get the Snipd Podcast app to discover more snips from this episode
Get the app