Recognition: 2 theorem links
· Lean TheoremToy Models of Superposition
Pith reviewed 2026-05-11 22:39 UTC · model grok-4.3
The pith
Neural networks exhibit polysemanticity because they represent additional sparse features in superposition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the toy model, when the number of features exceeds the number of dimensions available, the network learns to represent features in superposition, resulting in polysemantic neurons that activate for multiple features.
What carries the argument
Superposition, the representation of multiple sparse features as linear combinations within the same neuron activations.
If this is right
- Superposition emerges above a critical ratio of features to neurons.
- The optimal feature directions align with the geometry of regular polytopes.
- Adversarial examples arise naturally from the overlapping representations.
- Mechanistic interpretability must account for distributed feature storage.
Where Pith is reading between the lines
- This framework predicts that increasing model width reduces polysemanticity for a fixed number of features.
- One could search for polytope structures in the activation spaces of real language models.
- It implies that pruning or editing individual neurons may affect multiple unrelated concepts simultaneously.
Load-bearing premise
The representational dynamics in these small toy models with engineered sparse features reflect the key pressures driving feature representation in large trained networks on natural data.
What would settle it
Train a larger network on real data, extract the neuron weights, and check whether the feature directions form the vertices of a uniform polytope or exhibit the predicted phase change in superposition.
read the original abstract
Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a one-hidden-layer ReLU network trained to reconstruct a set of sparse features from their linear combination. It demonstrates that polysemanticity arises when the number of features exceeds the hidden dimension under sufficient sparsity, identifies a phase transition in this regime, establishes a connection between the learned representations and the geometry of uniform polytopes, and provides simulation evidence linking superposition to increased adversarial vulnerability. Implications for mechanistic interpretability are discussed.
Significance. If the observed dynamics generalize, the work supplies a fully specified, simulatable mechanism for polysemanticity that directly follows from the L2 reconstruction loss and sparsity penalty. Strengths include the explicit toy architecture with no free parameters beyond the model definition, reproducible forward simulations of the phase transition and polytope geometry, and the absence of circularity in the reported behaviors. This provides a concrete foundation for further analysis in interpretability research.
minor comments (2)
- [Abstract] The abstract states that the model shows 'evidence of a link to adversarial examples,' but the precise quantitative strength of this link (e.g., the magnitude of the interference term) could be stated more explicitly to match the simulation results.
- [Toy Model section] Notation for the sparsity penalty in the loss could be introduced with an explicit equation number at first use to improve readability for readers unfamiliar with the setup.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript, accurate summary of our contributions, and recommendation to accept. We are pleased that the work is viewed as providing a concrete, simulatable mechanism for polysemanticity.
Circularity Check
No significant circularity; derivation self-contained in explicit toy model
full rationale
The paper defines a one-hidden-layer ReLU network with explicit L2 reconstruction loss plus sparsity penalty, then trains it on hand-crafted sparse features and reports emergent behaviors (phase transition, polytope geometry, interference) directly from those simulations. No step fits a parameter on a data subset and then labels a related quantity a 'prediction'; all results follow from forward simulation of the stated architecture and objective. No load-bearing self-citation or definitional equivalence is present; the chain reduces only to the model's own equations and training dynamics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Input features are sparse and statistically independent.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.HierarchyEmergence (uniform scale ladder + additive closure forces φ)hierarchy_emergence_forces_phi echoestoy models are simple ReLU networks... superposition organizes features into geometric structures such as digons, triangles, pentagons, and tetrahedrons
Forward citations
Cited by 51 Pith papers
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
Crafting Reversible SFT Behaviors in Large Language Models
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
-
KAN: Kolmogorov-Arnold Networks
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
-
The Linear Representation Hypothesis and the Geometry of Large Language Models
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
-
The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge
In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.
-
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
-
SMIXAE: Towards Unsupervised Manifold Discovery in Language Models
SMIXAE is a new mixture-of-autoencoders architecture that learns multidimensional manifolds directly from transformer activations, recovering known structures and identifying novel ones in Gemma 2 2B and 9B models.
-
From Mechanistic to Compositional Interpretability
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaran...
-
What Cohort INRs Encode and Where to Freeze Them
Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
-
Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval
Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.
-
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
-
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction
A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in ...
-
Adjoint Inversion Reveals Holographic Superposition and Destructive Interference in CNN Classifiers
CNN classifiers work by holographic superposition and destructive interference in pixel space rather than selecting cleaned features, as proven by a new adjoint inversion framework that also yields a covariance-volume...
-
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.
-
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
-
Cell-Based Representation of Relational Binding in Language Models
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
-
Psychological Steering of Large Language Models
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
-
Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision
Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.
-
Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP
The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model pr...
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
Eliciting Latent Predictions from Transformers with the Tuned Lens
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
-
Domain Restriction via Multi SAE Layer Transitions
Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
-
Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.
-
Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm
Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method...
-
Decomposing and Steering Functional Metacognition in Large Language Models
LLMs have linearly decodable functional metacognitive states that causally modulate reasoning when steered via activation interventions.
-
Architecture, Not Scale: Circuit Localization in Large Language Models
Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.
-
A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models
Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.
-
Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners
Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.
-
The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models
LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.
-
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
Decoder-based VLMs over-align visual features to a universal text subspace, injecting linguistic bias; projecting out its top principal components reduces hallucinations on POPE, CHAIR, AMBER and improves long-form ca...
-
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
Decoder-based VLMs hallucinate due to geometric over-alignment of visual embeddings with the text manifold in a universal dataset-agnostic subspace, mitigated by projecting out the linguistic bias.
-
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
-
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...
-
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Harmful intent is geometrically recoverable as a linear direction or angular deviation in LLM residual streams, with high AUROC across 12 models, stable under alignment variants including abliterated ones, and transfe...
-
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out har...
-
Representation-Guided Parameter-Efficient LLM Unlearning
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
-
From Attribution to Action: A Human-Centered Application of Activation Steering
Activation steering paired with attribution enables intervention-based debugging in vision models, as all 8 interviewed experts shifted to hypothesis testing, most trusted observed responses, and highlighted risks lik...
-
Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory
Student networks are limited to d_S * g(α) features via superposition, creating a permanent importance-weighted loss floor in distillation that cannot be overcome by training.
-
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.
-
Reasoning emerges from constrained inference manifolds in large language models
Reasoning in LLMs emerges from inference dynamics forming constrained low-dimensional manifolds that preserve non-degenerate information volume, rather than from compression alone.
-
From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models
IGDS uses sparse autoencoders to find internal task features in LLMs and selects data that maximally activates them, yielding better math reasoning performance than full-dataset fine-tuning with only half the data.
-
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering close the gap.
-
Tracing Relational Knowledge Recall in Large Language Models
Per-head attention contributions to the residual stream serve as strong linear features for classifying relational knowledge in LLMs, with probe accuracy correlating to relation specificity and signal distribution.
-
Diagnosable ColBERT: Debugging Late-Interaction Retrieval Models Using a Learned Latent Space as Reference
Diagnosable ColBERT aligns ColBERT embeddings to an expert-grounded clinical latent space to enable direct diagnosis of model misunderstandings and better training data curation.
-
Singularity Formation: Synergy in Theoretical, Numerical and Machine Learning Approaches
The work introduces a modulation-based analytical method for singularity proofs in singular PDEs and refines ML techniques like PINNs and KANs to identify blowup solutions, with application to the open 3D Keller-Segel...
-
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
-
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
-
Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers
Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.
-
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance
AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
-
High-Dimensional Statistics: Reflections on Progress and Open Problems
A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.
Reference graph
Works this paper leans on
-
[1]
Zoom in: An introduction to circuits
Distill. DOI: 10.23915/distill.00024.001 . Softmax Linear Units Elhage, N., Hume, T., Olsson, C., Nanda, N., Henighan, T., Johnston, S., ElShowk, S., Joseph, N., DasSarma, N., Mann, B.,Hernandez, D., Askell, A., Ndousse, K., Jones, A., Drain, D., Chen, A., Bai, Y., Ganguli, D., Lovitt, L., Hatfield-Dodds, Z., Kernion, J., Conerly, T., Kravec, S., Fort, S....
- [2]
-
[3]
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
arXiv preprint arXiv:1511.06434. . Visualizing and understanding recurrent networks[PDF] Karpathy, A., Johnson, J. and Fei-Fei, L.,
work page internal anchor Pith review arXiv
-
[4]
arXiv preprint arXiv:1506.02078 , year=
arXiv preprint arXiv:1506.02078. . Learning to generate reviews and discovering sentiment[PDF] Radford, A., Jozefowicz, R. and Sutskever, I.,
-
[5]
Learning to Generate Reviews and Discovering Sentiment , publisher =
arXiv preprint arXiv:1704.01444. . Object detectors emerge in deep scene cnns[PDF] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A.,
-
[6]
arXiv preprint arXiv:1412.6856. . Network Dissection: Quantifying Interpretability of Deep Visual Representations[PDF]Bau, D., Zhou, B., Khosla, A., Oliva, A. and Torralba, A.,
-
[7]
On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron Donnelly, J
arXiv preprint arXiv:1803.06959. . On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron Donnelly, J. and Roegiest, A.,
-
[8]
DOI: 10.23915/distill.00024.005
Distill. DOI: 10.23915/distill.00024.005 . Multimodal Neurons in Artificial Neural Networks Goh, G., Cammarata, N., Voss, C., Carter, S., Petrov, M., Schubert, L., Radford, A. and Olah, C.,
-
[9]
Multimodal neurons in artificial neural networks
Distill. DOI:10.23915/distill.00030 . Convergent learning: Do different neural networks learn the same representations? Li, Y., Yosinski, J., Clune, J., Lipson, H., Hopcroft, J.E. and others,,
-
[10]
Distill. DOI: 10.23915/distill.00007 . Adversarial examples are not bugs, they are featuresIlyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B. and Madry, A.,
-
[11]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
arXiv preprint arXiv:2201.02177. . The surprising simplicity of the early-time learning dynamics of neural networksHu, W., Xiao, L., Adlam, B. and Pennington, J.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Adversarial spheres Gilmer, J., Metz, L., Faghri, F., Schoenholz, S.S., Raghu, M., Wattenberg, M
arXiv preprint arXiv:1611.03814. . Adversarial spheres Gilmer, J., Metz, L., Faghri, F., Schoenholz, S.S., Raghu, M., Wattenberg, M. and Goodfellow, I.,
-
[13]
arXiv preprintarXiv:1801.02774. . Adversarial robustness as a prior for learned representations Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Tran, B. and Madry, A.,
-
[14]
Adversarial robustness as a prior for learned representations.arXiv preprint arXiv:1906.00945, 2019
arXiv preprint arXiv:1906.00945. . Delving into transferable adversarial examples and black-box attacks Liu, Y., Chen, X., Liu, C. and Song, D.,
-
[15]
Delving into Transferable Adversarial Examples and Black-box Attacks
arXiv preprint arXiv:1611.02770. . An introduction to systems biology: design principles of biological circuitsAlon, U.,
-
[16]
CRC press. DOI: 10.1201/9781420011432 . The Building Blocks of Interpretability[link] Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K. and Mordvintsev, A.,
-
[17]
Distill. DOI:10.23915/distill.00010 . Visualizing Weights[link] Voss, C., Cammarata, N., Goh, G., Petrov, M., Schubert, L., Egan, B., Lim, S.K. and Olah, C.,
-
[18]
DOI:10.23915/distill.00024.007
Distill. DOI:10.23915/distill.00024.007 . A Review of Sparse Expert Models in Deep Learning Fedus, W., Dean, J. and Zoph, B.,
-
[19]
arXiv preprint arXiv:2209.01667. . A Mathematical Framework for Transformer Circuits[HTML] Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark,...
-
[20]
An Overview of Early Vision in InceptionV1,
Distill. DOI: 10.23915/distill.00024.002 . beta-vae: Learning basic visual concepts with a constrained variational framework Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S. and Lerchner, A.,
- [21]
-
[22]
2015 53rd annual allerton conference on communication, control, andcomputing (Allerton), pp. 1336--1343. . Learned D-AMP: Principled neural network based compressive image recovery Metzler, C., Mousavi, A. and Baraniuk, R.,
work page 2015
-
[23]
Distributed representationsPlate, T.,
DOI: 10.1038/s41598-019-43460-8 . Distributed representationsPlate, T.,
-
[24]
Annual Review of Neuroscience, Vol 35(1), pp. 485-508. DOI: 10.1146/annurev-neuro-062111-150410 Nonlinear Compression This paper focuses on the assumption that representations are linear. But what if models donʼt use linear featuredirections to represent information? What might such a thing concretely look like? Neural networks have nonlinearities that ma...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.