LessWrong (30+ Karma)

“Contrastive features elicit different perturbation responses than SAE features” by Francisco Ferreira da Silva, StefanHex

Mar 21, 2026
Researchers compare contrastive feature directions to SAE and random directions and report striking differences in model responses. They describe a perturbation-based method for identifying concept directions in activation space. The discussion covers experimental setups, robustness checks, and implications for interpretability and AI safety.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Contrastive Vectors Reveal Feature-Like Directions

  • Concepts in LLMs can act like directions in activation space that elicit targeted downstream effects when perturbed.
  • Contrastive difference-of-means vectors (English→Mandarin, Python→Haskell, male→female) broke suppression at much smaller angles than SAE or random directions across Gemma, Lama, and Quen.
INSIGHT

SAE Features Look Like Noise To The Model

  • SAE directions behave like random directions and do not produce special downstream responses at small perturbation magnitudes.
  • This supports prior work (Lee and Heimersheim 2024) suggesting SAE features reflect dataset artifacts more than model computations.
INSIGHT

Activation Plateaus Suggest Error Correction

  • Activation plateaus show that small perturbations are suppressed, suggesting models implement error-correction or robustness to interference under superposition.
  • Prior work (Heimersheim & Mendel 2024; Schinkel & Heimersheim 2025) observed the same downstream suppression for small perturbations.
Get the Snipd Podcast app to discover more snips from this episode
Get the app