Turn-averaged SAEs reconstruct average activations over conversation turns to represent high-level turn characteristics with a fixed number of features, simplifying long-context interpretability compared to per-token SAEs.
Language Model Circuits Are Sparse in the Neuron Basis
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques which decompose the neuron basis into more interpretable units of model computation, such as sparse autoencoders (SAEs). However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that MLP neurons are as sparse a feature basis as SAEs. We use this finding to develop an end-to-end gradient-based attribution pipeline for circuit tracing on the MLP neuron basis, which surfaces causally effective neurons on a variety of tasks. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city-state-capital task from (Lindsey et al., 2025), we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g. mapping a city to its state), and can be steered to change the model's output. This work thus advances automated interpretability of language models without imposing additional training costs.
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Turn-Averaged SAEs for Feature Discovery and Long-Context Attribution
Turn-averaged SAEs reconstruct average activations over conversation turns to represent high-level turn characteristics with a fixed number of features, simplifying long-context interpretability compared to per-token SAEs.