pith. sign in

Language Model Circuits Are Sparse in the Neuron Basis

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques which decompose the neuron basis into more interpretable units of model computation, such as sparse autoencoders (SAEs). However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that MLP neurons are as sparse a feature basis as SAEs. We use this finding to develop an end-to-end gradient-based attribution pipeline for circuit tracing on the MLP neuron basis, which surfaces causally effective neurons on a variety of tasks. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city-state-capital task from (Lindsey et al., 2025), we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g. mapping a city to its state), and can be steered to change the model's output. This work thus advances automated interpretability of language models without imposing additional training costs.

fields

cs.CL 2

years

2026 2

verdicts

UNVERDICTED 2

clear filters

representative citing papers

Fast & Faithful Function Vectors

cs.CL · 2026-06-03 · unverdicted · novelty 4.0

LRP-based attention head selection and distributed application improve the efficiency and accuracy of function vectors for steering LLMs compared to prior choices.

citing papers explorer

Showing 2 of 2 citing papers after filters.

  • Turn-Averaged SAEs for Feature Discovery and Long-Context Attribution cs.CL · 2026-06-26 · unverdicted · none · ref 3 · internal anchor

    Turn-averaged SAEs reconstruct average activations over conversation turns to represent high-level turn characteristics with a fixed number of features, simplifying long-context interpretability compared to per-token SAEs.

  • Fast & Faithful Function Vectors cs.CL · 2026-06-03 · unverdicted · none · ref 39 · internal anchor

    LRP-based attention head selection and distributed application improve the efficiency and accuracy of function vectors for steering LLMs compared to prior choices.