LW - SAE feature geometry is outside the superposition hypothesis by jake mendel

Jun 24, 2024

Jake Mendel, author on LessWrong, discusses how SAE feature geometry goes beyond the superposition hypothesis, highlighting the importance of feature vectors' specific locations and rich structures. Understanding this geometry could lead to new theories or supplementing existing ones. The podcast explores the limitations of superposition-based interpretations and proposes alternative theories for neural network activation spaces.

Ask episode

Chapters

Transcript

Episode notes

Exploring the Importance of Feature Geometry and Activation Spaces in Neural Networks

00:00 • 18min

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SAE feature geometry is outside the superposition hypothesis, published by jake mendel on June 24, 2024 on LessWrong.
Summary: Superposition-based interpretations of neural network activation spaces are incomplete. The specific locations of feature vectors contain crucial structural information beyond superposition, as seen in circular arrangements of day-of-the-week features and in the rich structures of feature UMAPs. We don't currently have good concepts for talking about this structure in feature geometry, but it is likely very important for model computation.
An eventual understanding of feature geometry might look like a hodgepodge of case-specific explanations, or supplementing superposition with additional concepts, or plausibly an entirely new theory that supersedes superposition. To develop this understanding, it may be valuable to study toy models in depth and do theoretical or conceptual work in addition to studying frontier models.
Epistemic status: Decently confident that the ideas here are directionally correct. I've been thinking these thoughts for a while, and recently got round to writing them up at a high level. Lots of people (including both SAE stans and SAE skeptics) have thought very similar things before and some of them have written about it in various places too.
Some of my views, especially the merit of certain research approaches to tackle the problems I highlight, have been presented here without my best attempt to argue for them.
What would it mean if we could fully understand an activation space through the lens of superposition?
If you fully understand something, you can explain everything about it that matters to someone else in terms of concepts you (and hopefully they) understand.
So we can think about how well I understand an activation space by how well I can communicate to you what the activation space is doing, and we can test if my explanation is good by seeing if you can construct a functionally equivalent activation space (which need not be completely identical of course) solely from the information I have given you. In the case of SAEs, here's what I might say:
1. The activation space contains this list of 100 million features, which I can describe concisely in words because they are monosemantic.
2. The features are embedded as vectors, and the activation vector on any input is a linear combination of the feature vectors that are related to the input.
3. As for where in the activation space each feature vector is placed, oh that doesn't really matter and any nearly orthogonal overcomplete basis will do. Or maybe if I'm being more sophisticated, I can specify the correlations between features and that's enough to pin down all the structure that matters - all the other details of the overcomplete basis are random.
Every part of this explanation is in terms of things I understand precisely. My features are described in natural language, and I know what a random overcomplete basis is (although I'm on the fence about whether a large correlation matrix counts as something that I understand).
The placement of each feature vector in the activation space matters
Why might this description be insufficient? First, there is the pesky problem of SAE reconstruction errors, which are parts of activation vectors that are missed when we give this description.
Second, not all features seem monosemantic, and it is hard to find semantic descriptions of even the most monosemantic features that have both high sensitivity and specificity, let alone descriptions which allow us to quantitatively predict the quantitative values that activating features take on a particular input.
But let's suppose that these issues have been solved: SAE improvements lead to perfect reconstruction and extremely monosemantic features, and new autointerp techniques lea...