LW - Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders by Evan Anders

Feb 27, 2024

Ask episode

Chapters

Transcript

Episode notes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders, published by Evan Anders on February 27, 2024 on LessWrong.
Note: The second figure in this post originally contained a bug pointed out by @LawrenceC, which has since been fixed.
Summary
Sparse Autoencoders (SAEs) reveal interpretable features in the activation spaces of language models, but SAEs don't reconstruct activations perfectly. We lack good metrics for evaluating which parts of model activations SAEs fail to reconstruct, which makes it hard to evaluate SAEs themselves. In this post, we argue that SAE reconstructions should be tested using well-established benchmarks to help determine what kinds of tasks they degrade model performance on.
We stress-test a recently released set of SAEs for each layer of the gpt2-small residual stream using randomly sampled tokens from Open WebText and the Lambada benchmark where the model must predict a specific next token.
The SAEs perform well on prompts with context sizes up to the training context size, but their performance degrades on longer prompts.
In contexts shorter than or equal to the training context, the SAEs that we study generally perform well. We find that the performance of our late-layer SAEs is worse than early-layer SAEs, but since the SAEs all have the same width, this may just be because there are more features to resolve in later layers and our SAEs don't resolve them.
In contexts longer than the training context, SAE performance is poor in general, but it is poorest in earlier layers and best in later layers.
Introduction
Last year,
Anthropic and
EleutherAI/Lee Sharkey's MATS stream showed that sparse autoencoders (SAEs) can decompose language model activations into human-interpretable features. This has led to a significant uptick in the number of people training SAEs and analyzing models with them. However, SAEs are not perfect autoencoders and we still lack a thorough understanding of where and how they miss information. But how do we know if an SAE is "good" other than the fact that it has features we can understand?
SAEs try to reconstruct activations in language models - but they don't do this perfectly. Imperfect activation reconstruction can lead to substantial downstream cross-entropy (CE) loss increases. Generally "good" SAEs retrieve 80-99% of the CE loss (compared to a generous baseline of zero ablation), but only retrieving 80% of the CE loss is enough to substantially degrade the performance of a model to that of a much smaller model (per
scaling laws).
The second basic metric often used in SAE evaluation is the average per-token ℓ0 norm of the hidden layer of the autoencoder. Generally this is something in the range of ~10-60 in a "good" autoencoder, which means that the encoder is sparse. Since we don't know how many features are active per token in natural language, it's useful to at least ask how changes in ℓ0 relate to changes in SAE loss values.
If high-loss data have drastically different ℓ0 from the SAE's average performance during training, that can be evidence of either off-distribution data (compared to the training data) or some kind of data with more complex information.
The imperfect performance of SAEs on these metrics could be explained in a couple of ways:
The fundamental assumptions of SAEs are mostly right, but we're bad at training SAEs. Perhaps if we learn to train better SAEs, these problems will become less bad.
Perhaps we need to accept higher ℓ0 norms (more features active per token). This would not be ideal for interpretability, though.
Perhaps there's part of the signal which is dense or hard for an SAE to learn and so we are systematically missing some kind of information. Maybe a more sophisticated sparsity enforcement could help with this.
The fundamental assumption...