Recognition: 1 theorem link
· Lean TheoremThe Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
Pith reviewed 2026-05-11 01:09 UTC · model grok-4.3
The pith
Features in deep networks are linear directions among centroids of local affine experts instead of raw activations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Linear Centroids Hypothesis identifies features with linear directions among a network's centroid spaces, where each vector is a centroid or summary of a local affine expert that characterizes the learned input-output maps exactly for piecewise-affine networks or approximately for smooth networks such as transformers. Replacing intermediate activations with these centroids yields a functional drop-in alternative for standard interpretability tools, empirically producing sparser and more useful feature dictionaries on DINO ViTs, suppressing spurious directions on controlled tasks, recovering interpretable circuits in GPT2-Large, and generating faithful gradient-based saliency maps.
What carries the argument
Centroid spaces computed from local affine experts, which summarize the network's input-output behavior and serve as the new representation in which linear feature directions are sought.
If this is right
- Feature dictionaries extracted from centroid spaces are sparser and more useful for downstream tasks than those from raw activations.
- Spurious or non-causal directions are suppressed when analysis is performed in centroid space on controlled tasks.
- Mechanistic circuits become recoverable in large language models when activations are replaced by centroids.
- Gradient-based saliency maps align more closely with model decisions when computed through centroid representations.
Where Pith is reading between the lines
- The same centroid construction could be applied during training to encourage models whose internal summaries already align with human-interpretable directions.
- If centroids prove stable across fine-tuning, they may offer a route to transfer interpretability findings between models without retraining probes from scratch.
- Extending the construction to recurrent or state-space models would test whether the local-expert view generalizes beyond feed-forward and transformer architectures.
Load-bearing premise
Centroids computed from local affine experts can replace raw activations while preserving or improving the results of standard interpretability methods.
What would settle it
A side-by-side comparison in which the same interpretability pipeline applied to centroid spaces produces measurably less sparse dictionaries, more spurious directions, or less faithful saliency maps than the identical pipeline applied to raw activations across the tested model families.
read the original abstract
The Linear Representation Hypothesis (LRH) identifies features of a trained deep network (DN) as linear directions in the activation spaces, i.e., output spaces of intermediate layers. This characterization decouples the input-output maps learned by a DN from the organization of feature directions in its activation spaces. We introduce the Linear Centroids Hypothesis (LCH), which instead identifies features with linear directions among a DN's centroid spaces -- where any vector denotes a centroid or summary of a local affine expert characterizing the learned input-output maps of the DN exactly (e.g., for piecewise-affine DNs) or approximately (e.g., for smooth DNs like transformers). We show that replacing intermediate activations with centroids yields a functional drop-in alternative for standard interpretability tools. Empirically, this change yields sparser, more downstream-useful feature dictionaries on DINO ViTs, suppresses spurious directions on a controlled task, recovers interpretable circuits in GPT2-Large, and produces faithful gradient-based saliency maps. LCH unifies dictionaries, probing, circuits, and saliency maps into a single geometric object grounded in the network's input-output map -- making interpretability mechanistic by construction rather than post hoc. Code to study the LCH https://github.com/ThomasWalker1/LinearCentroidsHypothesis .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Linear Centroids Hypothesis (LCH), which identifies features in deep networks as linear directions in centroid spaces derived from local affine experts. These experts are presented as exact summaries of the input-output map for piecewise-affine networks and approximate ones for smooth networks such as transformers. The authors claim that replacing raw activations with these centroids serves as a drop-in replacement that improves feature dictionaries on DINO ViTs, suppresses spurious directions on a controlled task, recovers circuits in GPT2-Large, and produces more faithful saliency maps, thereby unifying dictionaries, probing, circuits, and saliency under a single geometric object grounded in the network's input-output map.
Significance. If the central claims hold, LCH would offer a mechanistic unification of interpretability methods by tying them directly to local characterizations of the network function rather than post-hoc analysis of activations. The open availability of code to study the hypothesis is a clear strength supporting reproducibility.
major comments (2)
- [Sections discussing LCH for smooth networks and empirical results on transformers] The claim that LCH renders interpretability 'mechanistic by construction' (abstract and introduction) rests on centroids faithfully summarizing the true input-output map. For smooth networks (DINO ViTs, GPT2-Large), the local affine experts are only approximate; the manuscript provides no quantitative bounds, error measurements, or sensitivity analysis showing that approximation deviations do not alter the recovered feature directions in the regions used for circuit recovery or saliency maps.
- [Empirical evaluation sections] The empirical claims of improvement (abstract) are load-bearing for the unification argument, yet the manuscript lacks detailed quantitative tables, ablation controls, or statistical tests comparing centroid-based versus activation-based methods on the controlled task, DINO ViTs, and GPT2-Large. Without these, it is not possible to verify that the reported gains survive standard controls or that centroids preserve utility as a drop-in replacement.
minor comments (2)
- [Notation and definitions] Clarify the precise mathematical definition of 'centroid spaces' and how centroids are extracted from local affine experts, including any hyperparameters in the approximation procedure for smooth networks.
- [Related work] Add explicit comparisons to related geometric or local-linear interpretability approaches in the related-work section to better situate the novelty of LCH.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive report. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Sections discussing LCH for smooth networks and empirical results on transformers] The claim that LCH renders interpretability 'mechanistic by construction' (abstract and introduction) rests on centroids faithfully summarizing the true input-output map. For smooth networks (DINO ViTs, GPT2-Large), the local affine experts are only approximate; the manuscript provides no quantitative bounds, error measurements, or sensitivity analysis showing that approximation deviations do not alter the recovered feature directions in the regions used for circuit recovery or saliency maps.
Authors: We agree that the absence of explicit quantitative bounds on the approximation error for smooth networks leaves the mechanistic claim partially reliant on empirical outcomes. The manuscript already notes that local affine experts are approximate for transformers and demonstrates through experiments on DINO ViTs and GPT2-Large that centroid-based features produce sparser dictionaries, recoverable circuits, and faithful saliency maps. To directly address the concern, we will add a dedicated subsection containing approximation-error measurements (e.g., local linearization residuals) and sensitivity analysis on the regions used for circuit and saliency experiments. revision: yes
-
Referee: [Empirical evaluation sections] The empirical claims of improvement (abstract) are load-bearing for the unification argument, yet the manuscript lacks detailed quantitative tables, ablation controls, or statistical tests comparing centroid-based versus activation-based methods on the controlled task, DINO ViTs, and GPT2-Large. Without these, it is not possible to verify that the reported gains survive standard controls or that centroids preserve utility as a drop-in replacement.
Authors: We acknowledge that the current empirical presentation would be strengthened by more granular quantitative reporting. While the manuscript already reports improvements in sparsity, circuit interpretability, and saliency faithfulness, we will expand the evaluation sections with additional tables that include ablation controls (e.g., varying expert locality), full numerical metrics for all tasks, and statistical significance tests comparing centroid versus raw-activation baselines. revision: yes
Circularity Check
No circularity: LCH is a conceptual hypothesis with empirical validation, not a tautological derivation
full rationale
The paper introduces the Linear Centroids Hypothesis as a geometric reframing of features via centroids of local affine experts that summarize the input-output map (exactly for piecewise-affine networks, approximately for smooth ones). This is not derived from or reduced to the interpretability metrics being evaluated; centroids are computed independently as functional summaries. No equations show a fitted parameter being renamed as a prediction, no self-citation chains justify the core premise, and no ansatz is smuggled in. The unification claim and empirical results (sparser dictionaries, recovered circuits, faithful saliency) rest on direct substitution experiments rather than self-referential definitions. The approximation caveat for transformers is stated explicitly and does not create a circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Deep networks admit a local affine expert characterization of their input-output maps (exact for piecewise-affine networks, approximate for smooth networks).
invented entities (2)
-
Centroid spaces
no independent evidence
-
Local affine experts
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean; IndisputableMonolith/Foundation/AlexanderDuality.leanreality_from_one_distinction; Jcost uniqueness unclearcentroids ... equal to the row-sum of its input-output Jacobian ... power diagram subdivision ... features ... linear directions of the corresponding centroids
Forward citations
Cited by 1 Pith paper
-
Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories
Gradient-based SVD diagnostic uncovers hidden SED-LCH coupling in single and multitask settings and shows rank-3 subspace constraints speed up grokking by 2.3x.
Reference graph
Works this paper leans on
-
[1]
Concrete Problems in AI Safety
arXiv:1606.06565. Balagansky, N., Maksimov, I., and Gavrilov, D. Mechanis- tic permutability: Match features across layers. InThe Thirteenth International Conference on Learning Repre- sentations,
work page internal anchor Pith review arXiv
-
[2]
Balestriero, R. and Baraniuk, R. Mad Max: Affine Spline Insights into Deep Learning, November 2018a. arXiv:1805.06576. Balestriero, R. and Baraniuk, R. A Spline Theory of Deep Learning. InProceedings of the 35th International Con- ference on Machine Learning, 2018b. Balestriero, R. and Baraniuk, R. G. From hard to soft: Understanding deep network nonlinea...
-
[3]
Elhage, N., Hume, T., Olsson, C., Schiefer, N
arXiv:2407.02678. Elhage, N., Hume, T., Olsson, C., Schiefer, N. et al. Toy models of superposition.Transformer Circuits Thread,
-
[4]
Localizing Model Behavior with Path Patching , journal =
arXiv:2304.05969. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A. et al. The Llama 3 Herd of Models,
-
[5]
arXiv:2407.21783. Hanin, B. and Rolnick, D. Complexity of Linear Regions in Deep Networks. InProceedings of the 36th Interna- tional Conference on Machine Learning, pp. 2596–2604. PMLR, May
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Gaussian Error Linear Units (GELUs)
arXiv:1606.08415. Hindupur, S. S. R., Lubana, E. S., Fel, T., and Ba, D. E. Projecting assumptions: The duality between sparse au- toencoders and concept geometry. InICML Workshop on Methods and Opportunities at Small Scale,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
I., Balestriero, R., and Baraniuk, R
9 The Linear Centroids Hypothesis: How Deep Network Features Represent Data Humayun, A. I., Balestriero, R., and Baraniuk, R. Deep Networks Always Grok and Here is Why. InHigh- Dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning,
work page 2024
-
[8]
Searching for Activation Functions
arXiv:1710.05941. Rodr´ıguez-Mu˜noz, A., Wang, T., and Torralba, A. Charac- terizing Model Robustness via Natural Input Gradients. InProceedings of the European Conference on Computer Vision,
work page internal anchor Pith review arXiv
-
[9]
Open Problems in Mechanistic Interpretability
arXiv:2501.16496. Sim´eoni, O., V o, H. V ., Seitzer, M., Baldassarre, F. et al. DINOv3,
work page internal anchor Pith review arXiv
-
[10]
SmoothGrad: removing noise by adding noise
arXiv:1706.03825. Smith, L. The ‘strong’ feature hypothesis could be wrong, August
-
[11]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
arXiv:2211.05100. Xiao, H., Rasul, K., and V ollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learn- ing Algorithms,
work page internal anchor Pith review arXiv
-
[12]
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms
arXiv:1708.07747. 11 The Linear Centroids Hypothesis: How Deep Network Features Represent Data A. Comparison to Prior Work Features.Definition 3.1 is analogous to the notion of a DN concept used in Park et al. (2024), which provides a rigorous theoretical characterization of the LRH. In particular, in Park et al. (2024), a DN feature is a variable that le...
work page internal anchor Pith review arXiv 2024
-
[13]
belongs to the swish family of nonlinearities (Ramachandran et al., 2017), which are theoretically known to provide an appropriate softening of a ReLU DN’s geometry (Balestriero & Baraniuk, 2018c). In the first panel of Figure 11, we show that this smooth approximation does not impact the structure of the centroids. In the second panel of Figure 11, we co...
work page 2017
-
[14]
of these softened centroids at the second hidden layer of the DN. We observe clustering of centroids by the sector of the input domain in which the input samples were located. This corroborates our prior analysis using the second hidden layer’s geometry. Moreover, it demonstrates how the LCH can be applied hierarchically. In Figure 1, centroids were compu...
work page 2025
-
[15]
and see that nonlinearities still align along the boundary of the polygon, when we observe the centroids of input samples we see the same linear structures, and the influence of pruning neurons on the centroids is still effective as a neuron attribution metric. 14 The Linear Centroids Hypothesis: How Deep Network Features Represent Data DN Geometry Centro...
work page 1998
-
[16]
of increasing capacity to those computed across the DINOv2 vision transformer (Oquab et al., 2024). As expected under the PRH, the cosine similarity of activations and model capacity is positively related. 109 Number of Parameters 0.105 0.110 0.115 0.120 0.125Latent Alignment bloomz-560mbloomz-1b1 bloomz-3b bloomz-7b1 Latents Centroids 0.044 0.046 0.048 0...
work page 2024
-
[17]
More specifically, we test how the attribution values of the neurons changes as we consider increasingly large neighborhoods. The neighborhoods we consider are of the form Bϵ(x), where x is the embedding of the last token of a prompt at the 31st of GPT2-Large. We consider ϵ normalized by the norm of the centroid of x at this layer of the DN. To compute Eq...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.