pith. machine review for the scientific record. sign in

arxiv: 2605.05223 · v1 · submitted 2026-04-18 · 💻 cs.LG · cs.AI

Recognition: unknown

Structural Instability of Feature Composition

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords sparse autoencodersfeature compositioncompositional steeringgeometric instabilityReLU ratchet effectsignal coneGaussian mean widthCLEVR features
0
0 comments X

The pith

Feature unions in sparse autoencoders collapse beyond a threshold set by the statistical dimension of the signal cone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a geometric model of feature composition to show when simultaneously activating multiple semantic latents stops working. It treats the activation space as a sparse cone and derives a collapse threshold governed by the Gaussian mean width of that cone under spherical dictionaries. ReLU turns tiny correlation fluctuations into a one-directional drift that builds up with each added feature, producing ratchet-like interference growth. Validation on CLEVR semantic features shows that real hierarchical correlations push the collapse earlier than random baselines would predict. The result sets a concrete limit on how many features can be steered together before interference overtakes the intended signals.

Core claim

Modeling the activation space as a high-dimensional sparse cone manifold, the work derives an asymptotic compositional-collapse threshold under a spherical dictionary model, characterized by the Gaussian mean width of the signal cone. In the high-bias regime, ReLU rectification converts microscopic correlation-induced variance fluctuations into a systematic drift that accumulates under composition, yielding interference growth consistent with a ratchet effect. Experiments on structured semantic features from CLEVR confirm that hierarchical correlations accelerate the transition to collapse relative to random baselines.

What carries the argument

The sparse cone manifold representation of activation space, whose Gaussian mean width (statistical dimension) determines the asymptotic compositional-collapse threshold in the spherical dictionary model.

If this is right

  • Union-based steering remains stable only up to the statistical dimension of the signal cone before interference dominates.
  • ReLU introduces accumulating one-way drift from small correlations, so the instability grows with each additional composed feature.
  • Hierarchical correlations in real data lower the effective collapse threshold compared with independent random features.
  • Composition mechanisms must explicitly manage interference rather than relying on naive linear superposition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Current SAE steering may therefore be limited to small numbers of simultaneous edits even when individual features are clean.
  • The ratchet mechanism could explain why certain feature combinations fail in practice despite working alone.
  • Testing the same cone-width threshold on non-ReLU activations would isolate whether the drift is activation-specific.
  • Adding explicit interference cancellation during composition could extend the usable range of union steering.

Load-bearing premise

The activation space can be accurately modeled as a high-dimensional sparse cone manifold and the dictionary follows a spherical model.

What would settle it

Measure the number of features at which union steering collapses in an SAE and check whether that count scales with the Gaussian mean width of the corresponding signal cone as predicted.

Figures

Figures reproduced from arXiv: 2605.05223 by Yunpeng Zhou.

Figure 1
Figure 1. Figure 1: Mechanistic origin of the ReLU Ratchet. (A) In linear systems, interference is symmet￾ric and zero-mean (E[X] = 0), allowing noise to cancel out across features. (B) ReLU rectification induces a systematic mean drift η = E[σ(X)] > 0, transforming stochastic fluctuations into a persistent geometric bias. This drift shifts the interference distribution toward the threshold, causing a faster saturation of the… view at source ↗
Figure 2
Figure 2. Figure 2: Geometry of Compositional Separation. (a) Stable regime: The signal cone KS (blue) is disjoint from the ghost polar cone K◦ J (red). (b) Collapse: As density increases, KS widens and collides with the ghost constraints, triggering the phase transition. 4.3. Derivation via Gordon’s Escape Theorem Directly computing the intersection probability of the high-dimensional cones defined in Section 4.1 is geometri… view at source ↗
Figure 3
Figure 3. Figure 3: Empirical validation of the Ratchet Mechanism. (a) ReLU rectifies stochastic inter￾ference into systematic bias η. (b) Spurious energy undergoes an abrupt transition as compositional density γ approaches the threshold γ ∗ . 10 [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Phase Transition of Compositional Collapse. Theoretical prediction (blue) vs. empir￾ical CLEVR latents (red points). The theoretical curve shows a transition at γ ∗ . Real￾world structured features exhibit a correlation shift, collapsing slightly earlier than the random baseline. Detailed Protocol. Steering is implemented by adding the vector z to the residual stream with coefficient λsteer. We explicitly … view at source ↗
Figure 5
Figure 5. Figure 5: Structure of Interference. Comparison of the Gram matrix (Gij = |⟨di , dj ⟩|) for (A) a random spherical dictionary and (B) learned CLEVR features. The CLEVR features exhibit significant block-diagonal structure (semantic clusters) and off-diagonal corre￾lations, leading to a higher effective coherence µlocal than the random baseline. This structural alignment accelerates the phase transition. D.4. Ablatio… view at source ↗
Figure 6
Figure 6. Figure 6: Empirical verification of the phase boundary. The transition exhibits a characteristic ”tail” near γ ∗ due to finite-size effects, aligning with the geometric threshold derived from Gaussian mean width. Appendix G. Extended Discussion and Limitations In this section, we expand upon the implications of our theoretical findings and explicitly address the limitations of our current framework. G.1. Limitations… view at source ↗
read the original abstract

Sparse Autoencoders (SAEs) have emerged as a powerful paradigm for disentangling feature superposition in transformer-based architectures, enabling precise control via activation steering. However, the theoretical foundations of compositional steering -- the simultaneous activation of distinct semantic latents -- remain under-explored. The prevailing Linear Representation Hypothesis often abstracts away non-linear interference effects that arise in overcomplete dictionaries. We present a geometric framework for analyzing the instability of feature unions. Modeling the activation space as a high-dimensional sparse cone manifold, we derive an asymptotic compositional-collapse threshold under a spherical dictionary model, characterized by the Gaussian mean width (statistical dimension) of the signal cone. We further show that, in the high-bias regime, ReLU rectification converts microscopic correlation-induced variance fluctuations into a systematic drift that accumulates under composition, yielding interference growth consistent with a ratchet effect. We validate the predicted scaling trends on structured semantic features extracted from CLEVR, where hierarchical correlations accelerate the transition relative to random baselines. Together, our results highlight geometric constraints on the scalability of union-based steering and motivate composition mechanisms that explicitly manage interference beyond naive linear superposition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript develops a geometric framework for feature composition instability in Sparse Autoencoders. It models activation space as a high-dimensional sparse cone manifold, derives an asymptotic compositional-collapse threshold under a spherical dictionary model characterized by the Gaussian mean width (statistical dimension) of the signal cone, and shows that ReLU rectification in the high-bias regime produces a ratchet effect by converting microscopic variance fluctuations into accumulating systematic drift. Predictions are tested on structured semantic features from the CLEVR dataset, where hierarchical correlations are reported to accelerate the transition relative to random baselines.

Significance. If the derivations prove rigorous and the sparse-cone plus spherical-dictionary assumptions are shown to hold for real SAE latents, the work would supply a useful theoretical constraint on the scalability of union-based steering, connecting high-dimensional geometry to practical limits of linear superposition. The explicit use of Gaussian mean width as the characterizing quantity is a positive link to existing statistical-dimension literature. The CLEVR validation, while limited, at least demonstrates an empirical effect of hierarchical structure.

major comments (3)
  1. [Abstract] Abstract: the compositional-collapse threshold is stated to be 'derived' and 'characterized by the Gaussian mean width of the signal cone,' yet no explicit formula, proof sketch, or definition of the threshold itself appears. Without these, it is impossible to determine whether the threshold is obtained parameter-free or whether the mean-width expression reduces by construction to a quantity fitted from data.
  2. [CLEVR validation] CLEVR validation: the experiments show that hierarchical correlations accelerate the observed transition relative to random baselines, but supply no direct test of whether the specific mean-width formula governs the scaling. Consequently the results do not independently confirm that real SAE latents obey the sparse-cone manifold or spherical-dictionary geometry at the scales where the asymptotics are claimed.
  3. [ReLU ratchet-effect derivation] ReLU ratchet-effect derivation: the claim that ReLU rectification converts microscopic correlation-induced variance fluctuations into systematic drift in the high-bias regime is presented without the supporting equations or intermediate steps. This leaves open the possibility that the ratchet effect is defined circularly in terms of the same geometric model used for the threshold.
minor comments (2)
  1. [Abstract] Abstract: the terms 'compositional-collapse threshold' and 'ratchet effect' are introduced without even a one-sentence gloss, reducing accessibility for readers outside the immediate sub-area.
  2. [Introduction] The manuscript would benefit from a brief comparison table or paragraph situating the Gaussian-mean-width approach against prior uses of statistical dimension in sparse recovery and dictionary learning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies important gaps in the presentation of our theoretical results and empirical validation. We address each major comment below and will make substantial revisions to the manuscript to provide the requested derivations, equations, and clarifications.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the compositional-collapse threshold is stated to be 'derived' and 'characterized by the Gaussian mean width of the signal cone,' yet no explicit formula, proof sketch, or definition of the threshold itself appears. Without these, it is impossible to determine whether the threshold is obtained parameter-free or whether the mean-width expression reduces by construction to a quantity fitted from data.

    Authors: We agree that the current manuscript does not include an explicit formula for the compositional-collapse threshold, a proof sketch, or a clear definition within the abstract or main text. In the revised version, we will insert a new subsection in the theoretical framework that states the threshold explicitly as a function of the Gaussian mean width (statistical dimension) of the signal cone under the spherical dictionary model. We will also provide a concise proof outline deriving the asymptotic threshold from the high-dimensional geometry of the sparse cone manifold, demonstrating that it follows parameter-free from the model assumptions rather than from data fitting. revision: yes

  2. Referee: [CLEVR validation] CLEVR validation: the experiments show that hierarchical correlations accelerate the observed transition relative to random baselines, but supply no direct test of whether the specific mean-width formula governs the scaling. Consequently the results do not independently confirm that real SAE latents obey the sparse-cone manifold or spherical-dictionary geometry at the scales where the asymptotics are claimed.

    Authors: The CLEVR experiments were intended to illustrate the qualitative effect of hierarchical correlations on accelerating the transition relative to random baselines, consistent with the predicted scaling trends. We acknowledge that they do not constitute a direct quantitative test of the specific mean-width formula or an independent confirmation that real SAE latents satisfy the sparse-cone manifold and spherical-dictionary assumptions at the relevant scales. In revision, we will add an explicit limitations discussion and include supplementary analysis estimating the mean width from the CLEVR feature statistics to compare against observed collapse points. We will also clarify the scope of the validation as supporting the directional predictions rather than fully validating the geometric model for arbitrary SAE latents. revision: partial

  3. Referee: [ReLU ratchet-effect derivation] ReLU ratchet-effect derivation: the claim that ReLU rectification converts microscopic correlation-induced variance fluctuations into systematic drift in the high-bias regime is presented without the supporting equations or intermediate steps. This leaves open the possibility that the ratchet effect is defined circularly in terms of the same geometric model used for the threshold.

    Authors: We accept that the manuscript presents the ReLU ratchet-effect claim without the supporting equations or intermediate derivation steps. In the revised manuscript, we will expand the relevant section with a self-contained derivation that begins from the high-bias regime of the ReLU activation, shows how microscopic variance fluctuations induced by feature correlations are rectified into unidirectional drift, and demonstrates the accumulation under repeated composition. This derivation will be presented prior to and independently of the threshold result to eliminate any suggestion of circularity, with all intermediate equations included. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard geometric tools to stated model assumptions

full rationale

The paper explicitly adopts a high-dimensional sparse cone manifold model for activation space and a spherical dictionary model, then derives the compositional-collapse threshold as characterized by the Gaussian mean width (a pre-existing concept from convex geometry) of the signal cone. The ReLU ratchet effect is likewise obtained by analyzing how rectification maps microscopic fluctuations to accumulated drift under the same high-bias regime and geometry. No equation or step reduces the claimed threshold or ratchet to a fitted parameter, self-citation, or redefinition of the inputs; the CLEVR experiments are presented as separate empirical checks on scaling trends rather than part of the derivation. The chain therefore remains self-contained against external benchmarks in high-dimensional geometry and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on two modeling choices presented without independent evidence in the abstract: the sparse cone manifold representation of activation space and the spherical dictionary assumption used for the asymptotic analysis. No explicit free parameters are named; the collapse threshold and ratchet effect are derived quantities rather than fitted constants.

axioms (2)
  • domain assumption Activation space is modeled as a high-dimensional sparse cone manifold
    Invoked to derive the compositional-collapse threshold
  • domain assumption Dictionary follows a spherical model
    Enables the asymptotic analysis characterized by Gaussian mean width
invented entities (2)
  • compositional-collapse threshold no independent evidence
    purpose: Characterizes the point of instability for feature unions
    Derived from the cone manifold and spherical dictionary
  • ratchet effect from ReLU rectification no independent evidence
    purpose: Explains systematic drift accumulation from microscopic correlations
    Described for the high-bias regime

pith-pipeline@v0.9.0 · 5478 in / 1483 out tokens · 43526 ms · 2026-05-10T06:36:42.324875+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    Scaling Monosemanticity: Extracting Interpretable Features from

    Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Dennison, Chris and others , journal =. Scaling Monosemanticity: Extracting Interpretable Features from. 2024 , url =

  2. [2]

    International Conference on Learning Representations (ICLR) , year =

    Scaling and Evaluating Sparse Autoencoders , author =. International Conference on Learning Representations (ICLR) , year =

  3. [3]

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 , author =. arXiv preprint arXiv:2408.05147 , year =

  4. [4]

    Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretabil- ity.arXiv preprint arXiv:2503.09532,

    SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability , author =. arXiv preprint arXiv:2503.09532 , year =

  5. [5]

    International Conference on Learning Representations (ICLR) , year =

    Sparse Autoencoders Do Not Find Canonical Units of Analysis , author =. International Conference on Learning Representations (ICLR) , year =

  6. [6]

    International Conference on Machine Learning (ICML) , year =

    The Linear Representation Hypothesis and the Geometry of Large Language Models , author =. International Conference on Machine Learning (ICML) , year =

  7. [7]

    International Conference on Learning Representations (ICLR) , year =

    Not All Language Model Features Are One-Dimensionally Linear , author =. International Conference on Learning Representations (ICLR) , year =

  8. [8]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Birth of a Transformer: A Memory Viewpoint , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  9. [9]

    Transformer Circuits Thread , year =

    Toy Models of Superposition , author =. Transformer Circuits Thread , year =

  10. [10]

    Annual Meeting of the Association for Computational Linguistics (ACL) , year =

    Steering Llama 2 via Contrastive Activation Addition , author =. Annual Meeting of the Association for Computational Linguistics (ACL) , year =

  11. [11]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Representation Engineering: A Top-Down Approach to AI Transparency , author =. arXiv preprint arXiv:2310.01405 , year =

  12. [12]

    arXiv preprint arXiv:2406.15518 , year =

    Steering Without Side Effects: Improving Post-Deployment Control of Language Models , author =. arXiv preprint arXiv:2406.15518 , year =

  13. [13]

    2022 , archivePrefix=

    Polysemanticity and Capacity in Neural Networks , author =. arXiv preprint arXiv:2210.01892 , year =

  14. [14]

    Transactions on Machine Learning Research , year=

    Finding Neurons in a Haystack: Case Studies with Sparse Probing , author=. Transactions on Machine Learning Research , year=

  15. [15]

    Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume=

    Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing , author=. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , volume=. 2009 , publisher=

  16. [16]

    and Tropp, Joel A

    Amelunxen, Dennis and Lotz, Martin and McCoy, Michael B. and Tropp, Joel A. , title =. Information and Inference: A Journal of the IMA , volume =. 2014 , month =. doi:10.1093/imaiai/iau005 , url =

  17. [17]

    2018 , publisher=

    High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

  18. [18]

    On Milman's inequality and random subspaces which escape through a mesh in Rn

    Gordon, Y. On Milman's inequality and random subspaces which escape through a mesh in Rn. Geometric Aspects of Functional Analysis. 1988

  19. [19]

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J

    Analyzing the Generalization and Reliability of Steering Vectors , author =. arXiv preprint arXiv:2407.12404 , year =

  20. [20]

    NeurIPS 2025 (submission) , year =

    From Steering Vectors to Conceptors: Compositional Affine Activation Steering for LLMs , author =. NeurIPS 2025 (submission) , year =

  21. [21]

    arXiv preprint arXiv:2406.19384 , year=

    The Remarkable Robustness of LLMs: Stages of Inference? , author =. arXiv preprint arXiv:2406.19384 , year =

  22. [22]

    arXiv preprint arXiv:2601.09667 , year =

    Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning , author =. arXiv preprint arXiv:2601.09667 , year =