
LessWrong (Curated & Popular) "Responsible Scaling Policy v3" by HoldenKarnofsky
12 snips
Feb 25, 2026 Holden Karnofsky, longtime AI policy and safety advocate and Anthropic advisor, explains why Anthropic rewrote its Responsible Scaling Policy. He describes learning from past overcommitments, where forcing functions helped (like jailbreak robustness) and where they distorted incentives. Short talks cover the new split between recommendations, roadmaps, and risk reports, plus how practical, achievable targets can improve safety.
AI Snips
Chapters
Transcript
Episode notes
Security Gains Came With Distortions
- RSP stimulated investment in information security like model weight protection and egress bandwidth controls but skewed priorities.
- Karnofsky notes this likely underinvested in mundane, broadly useful security hygiene described in the Frontier Safety Roadmap.
Ambitious Targets Must Be Achievable
- Setting goals that are ambitious but achievable is the best kind of forcing function; unattainable goals create perverse incentives.
- Targets tied to CBRN-level assurances pushed unrealistic expectations and distorted risk assessment and planning.
Split Recommendations Risk Reports And Roadmaps
- Separate industry recommendations, public risk reports, and company roadmaps so each serves its purpose without conflicting incentives.
- RSP v3 makes recommendations industry-wide, uses risk reports for assessment, and roadmaps as achievable forcing functions.

