The Don’t Worry About the Vase Podcast is a listener-supported podcast. To receive new posts and support the cost of creation, consider becoming a free or paid subscriber.
* 00:00:00 - Introduction
* 00:03:50 - A Three Act Play
* 00:04:53 - Safety Not Guaranteed
* 00:10:53 - Pliny Can Still Jailbreak Everything
* 00:13:01 - Transparency Is Good: The 212-Page System Card
* 00:14:07 - Mostly Harmless
* 00:18:47 - Mostly Honest
* 00:21:10 - Agentic Safety
* 00:23:29 - Prompt Injection
* 00:27:38 - Key Alignment Findings
* 00:38:10 - Behavioral Evidence (6.2)
* 00:43:33 - Reward Hacking and ‘Overly Agentic Actions’
* 00:45:53 - Metrics (six point two point five point two)
* 00:48:01 - All I Did It All For The GUI
* 00:49:14 - Case Studies and Targeted Evaluations Of Behaviors (6.3)
* 00:49:36 - Misrepresenting Tool Results
* 00:50:29 - Unexpected Language Switching
* 00:52:34 - The Ghost of Jones Foods
* 00:54:15 - Loss of Style Points
* 00:55:07 - White Box Model Diffing
* 00:55:26 - Model Welfare
https://open.substack.com/pub/thezvi/p/claude-opus-46-system-card-part-1?utm_campaign=post-expanded-share&utm_medium=web
Get full access to DWAtV Podcast at
dwatvpodcast.substack.com/subscribe