
Introduction to Mechanistic Interpretability
BlueDot Narrated
00:00
Detecting unaligned and deceptive behaviour
Perrin Walker outlines risks like deception, sycophancy, and reward hacking and how interpretability could reveal them.
Play episode from 01:58
Transcript


