arxiv: 2604.11962 · v2 · submitted 2026-04-13 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk, Thomas Walker

Pith reviewed 2026-05-11 01:09 UTC · model grok-4.3

classification 💻 cs.LG

keywords Linear Centroids HypothesisLinear Representation Hypothesismechanistic interpretabilityfeature directionslocal affine expertsdeep neural networksfeature dictionariessaliency maps

0 comments

The pith

Features in deep networks are linear directions among centroids of local affine experts instead of raw activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The Linear Centroids Hypothesis reframes how we locate features inside trained networks. Where the earlier Linear Representation Hypothesis treated features as directions in activation vectors, this version treats them as directions among centroids that each summarize a local affine expert. Those experts characterize the network's actual input-to-output mapping, either exactly in piecewise-affine models or approximately in smooth ones. Substituting centroids for activations leaves existing interpretability methods intact but makes them more directly tied to the learned function. Experiments on vision transformers and language models show the substitution produces sparser dictionaries, fewer spurious directions, clearer circuits, and more faithful saliency maps.

Core claim

The Linear Centroids Hypothesis identifies features with linear directions among a network's centroid spaces, where each vector is a centroid or summary of a local affine expert that characterizes the learned input-output maps exactly for piecewise-affine networks or approximately for smooth networks such as transformers. Replacing intermediate activations with these centroids yields a functional drop-in alternative for standard interpretability tools, empirically producing sparser and more useful feature dictionaries on DINO ViTs, suppressing spurious directions on controlled tasks, recovering interpretable circuits in GPT2-Large, and generating faithful gradient-based saliency maps.

What carries the argument

Centroid spaces computed from local affine experts, which summarize the network's input-output behavior and serve as the new representation in which linear feature directions are sought.

If this is right

Feature dictionaries extracted from centroid spaces are sparser and more useful for downstream tasks than those from raw activations.
Spurious or non-causal directions are suppressed when analysis is performed in centroid space on controlled tasks.
Mechanistic circuits become recoverable in large language models when activations are replaced by centroids.
Gradient-based saliency maps align more closely with model decisions when computed through centroid representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same centroid construction could be applied during training to encourage models whose internal summaries already align with human-interpretable directions.
If centroids prove stable across fine-tuning, they may offer a route to transfer interpretability findings between models without retraining probes from scratch.
Extending the construction to recurrent or state-space models would test whether the local-expert view generalizes beyond feed-forward and transformer architectures.

Load-bearing premise

Centroids computed from local affine experts can replace raw activations while preserving or improving the results of standard interpretability methods.

What would settle it

A side-by-side comparison in which the same interpretability pipeline applied to centroid spaces produces measurably less sparse dictionaries, more spurious directions, or less faithful saliency maps than the identical pipeline applied to raw activations across the tested model families.

read the original abstract

The Linear Representation Hypothesis (LRH) identifies features of a trained deep network (DN) as linear directions in the activation spaces, i.e., output spaces of intermediate layers. This characterization decouples the input-output maps learned by a DN from the organization of feature directions in its activation spaces. We introduce the Linear Centroids Hypothesis (LCH), which instead identifies features with linear directions among a DN's centroid spaces -- where any vector denotes a centroid or summary of a local affine expert characterizing the learned input-output maps of the DN exactly (e.g., for piecewise-affine DNs) or approximately (e.g., for smooth DNs like transformers). We show that replacing intermediate activations with centroids yields a functional drop-in alternative for standard interpretability tools. Empirically, this change yields sparser, more downstream-useful feature dictionaries on DINO ViTs, suppresses spurious directions on a controlled task, recovers interpretable circuits in GPT2-Large, and produces faithful gradient-based saliency maps. LCH unifies dictionaries, probing, circuits, and saliency maps into a single geometric object grounded in the network's input-output map -- making interpretability mechanistic by construction rather than post hoc. Code to study the LCH https://github.com/ThomasWalker1/LinearCentroidsHypothesis .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Linear Centroids Hypothesis (LCH), which identifies features in deep networks as linear directions in centroid spaces derived from local affine experts. These experts are presented as exact summaries of the input-output map for piecewise-affine networks and approximate ones for smooth networks such as transformers. The authors claim that replacing raw activations with these centroids serves as a drop-in replacement that improves feature dictionaries on DINO ViTs, suppresses spurious directions on a controlled task, recovers circuits in GPT2-Large, and produces more faithful saliency maps, thereby unifying dictionaries, probing, circuits, and saliency under a single geometric object grounded in the network's input-output map.

Significance. If the central claims hold, LCH would offer a mechanistic unification of interpretability methods by tying them directly to local characterizations of the network function rather than post-hoc analysis of activations. The open availability of code to study the hypothesis is a clear strength supporting reproducibility.

major comments (2)

[Sections discussing LCH for smooth networks and empirical results on transformers] The claim that LCH renders interpretability 'mechanistic by construction' (abstract and introduction) rests on centroids faithfully summarizing the true input-output map. For smooth networks (DINO ViTs, GPT2-Large), the local affine experts are only approximate; the manuscript provides no quantitative bounds, error measurements, or sensitivity analysis showing that approximation deviations do not alter the recovered feature directions in the regions used for circuit recovery or saliency maps.
[Empirical evaluation sections] The empirical claims of improvement (abstract) are load-bearing for the unification argument, yet the manuscript lacks detailed quantitative tables, ablation controls, or statistical tests comparing centroid-based versus activation-based methods on the controlled task, DINO ViTs, and GPT2-Large. Without these, it is not possible to verify that the reported gains survive standard controls or that centroids preserve utility as a drop-in replacement.

minor comments (2)

[Notation and definitions] Clarify the precise mathematical definition of 'centroid spaces' and how centroids are extracted from local affine experts, including any hyperparameters in the approximation procedure for smooth networks.
[Related work] Add explicit comparisons to related geometric or local-linear interpretability approaches in the related-work section to better situate the novelty of LCH.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive report. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Sections discussing LCH for smooth networks and empirical results on transformers] The claim that LCH renders interpretability 'mechanistic by construction' (abstract and introduction) rests on centroids faithfully summarizing the true input-output map. For smooth networks (DINO ViTs, GPT2-Large), the local affine experts are only approximate; the manuscript provides no quantitative bounds, error measurements, or sensitivity analysis showing that approximation deviations do not alter the recovered feature directions in the regions used for circuit recovery or saliency maps.

Authors: We agree that the absence of explicit quantitative bounds on the approximation error for smooth networks leaves the mechanistic claim partially reliant on empirical outcomes. The manuscript already notes that local affine experts are approximate for transformers and demonstrates through experiments on DINO ViTs and GPT2-Large that centroid-based features produce sparser dictionaries, recoverable circuits, and faithful saliency maps. To directly address the concern, we will add a dedicated subsection containing approximation-error measurements (e.g., local linearization residuals) and sensitivity analysis on the regions used for circuit and saliency experiments. revision: yes
Referee: [Empirical evaluation sections] The empirical claims of improvement (abstract) are load-bearing for the unification argument, yet the manuscript lacks detailed quantitative tables, ablation controls, or statistical tests comparing centroid-based versus activation-based methods on the controlled task, DINO ViTs, and GPT2-Large. Without these, it is not possible to verify that the reported gains survive standard controls or that centroids preserve utility as a drop-in replacement.

Authors: We acknowledge that the current empirical presentation would be strengthened by more granular quantitative reporting. While the manuscript already reports improvements in sparsity, circuit interpretability, and saliency faithfulness, we will expand the evaluation sections with additional tables that include ablation controls (e.g., varying expert locality), full numerical metrics for all tasks, and statistical significance tests comparing centroid versus raw-activation baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: LCH is a conceptual hypothesis with empirical validation, not a tautological derivation

full rationale

The paper introduces the Linear Centroids Hypothesis as a geometric reframing of features via centroids of local affine experts that summarize the input-output map (exactly for piecewise-affine networks, approximately for smooth ones). This is not derived from or reduced to the interpretability metrics being evaluated; centroids are computed independently as functional summaries. No equations show a fitted parameter being renamed as a prediction, no self-citation chains justify the core premise, and no ansatz is smuggled in. The unification claim and empirical results (sparser dictionaries, recovered circuits, faithful saliency) rest on direct substitution experiments rather than self-referential definitions. The approximation caveat for transformers is stated explicitly and does not create a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the existence of local affine experts whose centroids can serve as faithful substitutes for activations. No free parameters are named in the abstract; the main new entities are the centroid spaces and local affine experts themselves.

axioms (1)

domain assumption Deep networks admit a local affine expert characterization of their input-output maps (exact for piecewise-affine networks, approximate for smooth networks).
Invoked in the definition of centroid spaces as summaries of these experts.

invented entities (2)

Centroid spaces no independent evidence
purpose: Spaces whose vectors are centroids of local affine experts, used as the substrate for feature directions.
New geometric object introduced to ground features in the input-output map.
Local affine experts no independent evidence
purpose: Piecewise or locally linear approximations that exactly or approximately characterize the network's learned mapping.
Postulated construct used to define the centroids.

pith-pipeline@v0.9.0 · 5540 in / 1495 out tokens · 45042 ms · 2026-05-11T01:09:18.665903+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean; IndisputableMonolith/Foundation/AlexanderDuality.lean reality_from_one_distinction; Jcost uniqueness unclear
centroids ... equal to the row-sum of its input-output Jacobian ... power diagram subdivision ... features ... linear directions of the corresponding centroids

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories
cs.LG 2026-04 unverdicted novelty 5.0

Gradient-based SVD diagnostic uncovers hidden SED-LCH coupling in single and multitask settings and shows rank-3 subspace constraints speed up grokking by 2.3x.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Concrete Problems in AI Safety

arXiv:1606.06565. Balagansky, N., Maksimov, I., and Gavrilov, D. Mechanis- tic permutability: Match features across layers. InThe Thirteenth International Conference on Learning Repre- sentations,

work page internal anchor Pith review arXiv
[2]

and Baraniuk, R

Balestriero, R. and Baraniuk, R. Mad Max: Affine Spline Insights into Deep Learning, November 2018a. arXiv:1805.06576. Balestriero, R. and Baraniuk, R. A Spline Theory of Deep Learning. InProceedings of the 35th International Con- ference on Machine Learning, 2018b. Balestriero, R. and Baraniuk, R. G. From hard to soft: Understanding deep network nonlinea...

work page arXiv
[3]

Elhage, N., Hume, T., Olsson, C., Schiefer, N

arXiv:2407.02678. Elhage, N., Hume, T., Olsson, C., Schiefer, N. et al. Toy models of superposition.Transformer Circuits Thread,

work page arXiv
[4]

Localizing Model Behavior with Path Patching , journal =

arXiv:2304.05969. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A. et al. The Llama 3 Herd of Models,

work page arXiv
[5]

The Llama 3 Herd of Models

arXiv:2407.21783. Hanin, B. and Rolnick, D. Complexity of Linear Regions in Deep Networks. InProceedings of the 36th Interna- tional Conference on Machine Learning, pp. 2596–2604. PMLR, May

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Gaussian Error Linear Units (GELUs)

arXiv:1606.08415. Hindupur, S. S. R., Lubana, E. S., Fel, T., and Ba, D. E. Projecting assumptions: The duality between sparse au- toencoders and concept geometry. InICML Workshop on Methods and Opportunities at Small Scale,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

I., Balestriero, R., and Baraniuk, R

9 The Linear Centroids Hypothesis: How Deep Network Features Represent Data Humayun, A. I., Balestriero, R., and Baraniuk, R. Deep Networks Always Grok and Here is Why. InHigh- Dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning,

work page 2024
[8]

Searching for Activation Functions

arXiv:1710.05941. Rodr´ıguez-Mu˜noz, A., Wang, T., and Torralba, A. Charac- terizing Model Robustness via Natural Input Gradients. InProceedings of the European Conference on Computer Vision,

work page internal anchor Pith review arXiv
[9]

Open Problems in Mechanistic Interpretability

arXiv:2501.16496. Sim´eoni, O., V o, H. V ., Seitzer, M., Baldassarre, F. et al. DINOv3,

work page internal anchor Pith review arXiv
[10]

SmoothGrad: removing noise by adding noise

arXiv:1706.03825. Smith, L. The ‘strong’ feature hypothesis could be wrong, August

work page Pith review arXiv
[11]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

arXiv:2211.05100. Xiao, H., Rasul, K., and V ollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learn- ing Algorithms,

work page internal anchor Pith review arXiv
[12]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

arXiv:1708.07747. 11 The Linear Centroids Hypothesis: How Deep Network Features Represent Data A. Comparison to Prior Work Features.Definition 3.1 is analogous to the notion of a DN concept used in Park et al. (2024), which provides a rigorous theoretical characterization of the LRH. In particular, in Park et al. (2024), a DN feature is a variable that le...

work page internal anchor Pith review arXiv 2024
[13]

In the first panel of Figure 11, we show that this smooth approximation does not impact the structure of the centroids

belongs to the swish family of nonlinearities (Ramachandran et al., 2017), which are theoretically known to provide an appropriate softening of a ReLU DN’s geometry (Balestriero & Baraniuk, 2018c). In the first panel of Figure 11, we show that this smooth approximation does not impact the structure of the centroids. In the second panel of Figure 11, we co...

work page 2017
[14]

We observe clustering of centroids by the sector of the input domain in which the input samples were located

of these softened centroids at the second hidden layer of the DN. We observe clustering of centroids by the sector of the input domain in which the input samples were located. This corroborates our prior analysis using the second hidden layer’s geometry. Moreover, it demonstrates how the LCH can be applied hierarchically. In Figure 1, centroids were compu...

work page 2025
[15]

and see that nonlinearities still align along the boundary of the polygon, when we observe the centroids of input samples we see the same linear structures, and the influence of pruning neurons on the centroids is still effective as a neuron attribution metric. 14 The Linear Centroids Hypothesis: How Deep Network Features Represent Data DN Geometry Centro...

work page 1998
[16]

As expected under the PRH, the cosine similarity of activations and model capacity is positively related

of increasing capacity to those computed across the DINOv2 vision transformer (Oquab et al., 2024). As expected under the PRH, the cosine similarity of activations and model capacity is positively related. 109 Number of Parameters 0.105 0.110 0.115 0.120 0.125Latent Alignment bloomz-560mbloomz-1b1 bloomz-3b bloomz-7b1 Latents Centroids 0.044 0.046 0.048 0...

work page 2024
[17]

The neighborhoods we consider are of the form Bϵ(x), where x is the embedding of the last token of a prompt at the 31st of GPT2-Large

More specifically, we test how the attribution values of the neurons changes as we consider increasingly large neighborhoods. The neighborhoods we consider are of the form Bϵ(x), where x is the embedding of the last token of a prompt at the 31st of GPT2-Large. We consider ϵ normalized by the norm of the centroid of x at this layer of the DN. To compute Eq...

work page 2021