arxiv: 2209.10652 · v1 · submitted 2022-09-21 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Toy Models of Superposition

Carol Chen, Catherine Olsson, Christopher Olah, Dario Amodei, Dawn Drain, Jared Kaplan, Martin Wattenberg, Nelson Elhage, Nicholas Schiefer, Robert Lasenby, Roger Grosse, Sam McCandlish, Shauna Kravec, Tom Henighan, Tristan Hume, Zac Hatfield-Dodds

Pith reviewed 2026-05-11 22:39 UTC · model grok-4.3

classification 💻 cs.LG

keywords polysemanticitysuperpositiontoy modelsmechanistic interpretabilityadversarial examplesneural network features

0 comments

The pith

Neural networks exhibit polysemanticity because they represent additional sparse features in superposition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a toy model to explain why neurons in neural networks often respond to multiple unrelated concepts, a phenomenon called polysemanticity. The model demonstrates that this occurs when networks store more features than they have neurons by packing them into superposition. It reveals a phase transition in behavior tied to the geometry of uniform polytopes and connects this to the existence of adversarial examples. Understanding this mechanism could improve methods for interpreting the internal workings of large neural networks.

Core claim

In the toy model, when the number of features exceeds the number of dimensions available, the network learns to represent features in superposition, resulting in polysemantic neurons that activate for multiple features.

What carries the argument

Superposition, the representation of multiple sparse features as linear combinations within the same neuron activations.

If this is right

Superposition emerges above a critical ratio of features to neurons.
The optimal feature directions align with the geometry of regular polytopes.
Adversarial examples arise naturally from the overlapping representations.
Mechanistic interpretability must account for distributed feature storage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This framework predicts that increasing model width reduces polysemanticity for a fixed number of features.
One could search for polytope structures in the activation spaces of real language models.
It implies that pruning or editing individual neurons may affect multiple unrelated concepts simultaneously.

Load-bearing premise

The representational dynamics in these small toy models with engineered sparse features reflect the key pressures driving feature representation in large trained networks on natural data.

What would settle it

Train a larger network on real data, extract the neuron weights, and check whether the feature directions form the vertices of a uniform polytope or exhibit the predicted phase change in superposition.

read the original abstract

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

These toy models cleanly show superposition causing polysemanticity in ReLU nets once features exceed dimensions under sparsity, with a phase transition and polytope link.

read the letter

The main takeaway is that these toy models show polysemanticity emerging naturally when a ReLU network has to represent more sparse features than it has hidden dimensions. The paper demonstrates a phase transition and ties the superposition to the geometry of regular polytopes. What is new here is the explicit ReLU construction with tunable sparsity and the polytope connection. The paper does well by making all behaviors come from direct simulation of the loss function, with no fitted parameters masquerading as predictions. The math and the runs line up cleanly. The limitation is that everything stays in the toy regime. Real networks have much more complex data and training dynamics, so the claim that this explains polysemanticity in practice rests on untested extrapolation. The authors are upfront about this, but it means the work is more of a proof of concept than a full account. This paper is for people doing mechanistic interpretability. It supplies a reproducible story that can be built on or tested against larger models. The thinking is straightforward and the citations are appropriate. I would cite the core results in related work and bring it to a reading group. It should go to peer review because the central argument is grounded and the novelty is in the concrete setup.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces a one-hidden-layer ReLU network trained to reconstruct a set of sparse features from their linear combination. It demonstrates that polysemanticity arises when the number of features exceeds the hidden dimension under sufficient sparsity, identifies a phase transition in this regime, establishes a connection between the learned representations and the geometry of uniform polytopes, and provides simulation evidence linking superposition to increased adversarial vulnerability. Implications for mechanistic interpretability are discussed.

Significance. If the observed dynamics generalize, the work supplies a fully specified, simulatable mechanism for polysemanticity that directly follows from the L2 reconstruction loss and sparsity penalty. Strengths include the explicit toy architecture with no free parameters beyond the model definition, reproducible forward simulations of the phase transition and polytope geometry, and the absence of circularity in the reported behaviors. This provides a concrete foundation for further analysis in interpretability research.

minor comments (2)

[Abstract] The abstract states that the model shows 'evidence of a link to adversarial examples,' but the precise quantitative strength of this link (e.g., the magnitude of the interference term) could be stated more explicitly to match the simulation results.
[Toy Model section] Notation for the sparsity penalty in the loss could be introduced with an explicit equation number at first use to improve readability for readers unfamiliar with the setup.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, accurate summary of our contributions, and recommendation to accept. We are pleased that the work is viewed as providing a concrete, simulatable mechanism for polysemanticity.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained in explicit toy model

full rationale

The paper defines a one-hidden-layer ReLU network with explicit L2 reconstruction loss plus sparsity penalty, then trains it on hand-crafted sparse features and reports emergent behaviors (phase transition, polytope geometry, interference) directly from those simulations. No step fits a parameter on a data subset and then labels a related quantity a 'prediction'; all results follow from forward simulation of the stated architecture and objective. No load-bearing self-citation or definitional equivalence is present; the chain reduces only to the model's own equations and training dynamics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the modeling choice that real features are both sparse and roughly independent; no additional free parameters or invented entities are introduced beyond standard ReLU network components.

axioms (1)

domain assumption Input features are sparse and statistically independent.
Invoked in the data-generation process and loss to produce the superposition regime.

pith-pipeline@v0.9.0 · 5421 in / 1112 out tokens · 42055 ms · 2026-05-11T22:39:58.357359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyEmergence (uniform scale ladder + additive closure forces φ) hierarchy_emergence_forces_phi echoes
toy models are simple ReLU networks... superposition organizes features into geometric structures such as digons, triangles, pentagons, and tetrahedrons

Forward citations

Cited by 51 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
Crafting Reversible SFT Behaviors in Large Language Models
cs.LG 2026-05 unverdicted novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
KAN: Kolmogorov-Arnold Networks
cs.LG 2024-04 conditional novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
The Linear Representation Hypothesis and the Geometry of Large Language Models
cs.CL 2023-11 conditional novelty 8.0

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge
stat.ML 2026-05 unverdicted novelty 7.0

In two-layer networks, weak-to-strong training elicits the target feature direction from pre-trained subspaces and preserves correlated off-target features, unlike standard fine-tuning.
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
cs.LG 2026-05 unverdicted novelty 7.0

SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
SMIXAE: Towards Unsupervised Manifold Discovery in Language Models
cs.LG 2026-05 unverdicted novelty 7.0

SMIXAE is a new mixture-of-autoencoders architecture that learns multidimensional manifolds directly from transformer activations, recovering known structures and identifying novel ones in Gemma 2 2B and 9B models.
From Mechanistic to Compositional Interpretability
cs.LG 2026-05 unverdicted novelty 7.0

Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaran...
What Cohort INRs Encode and Where to Freeze Them
cs.LG 2026-05 unverdicted novelty 7.0

Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval
stat.ML 2026-05 unverdicted novelty 7.0

Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
cs.LG 2026-05 unverdicted novelty 7.0

Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction
cs.AI 2026-05 unverdicted novelty 7.0

A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in ...
Adjoint Inversion Reveals Holographic Superposition and Destructive Interference in CNN Classifiers
cs.CV 2026-04 unverdicted novelty 7.0

CNN classifiers work by holographic superposition and destructive interference in pixel space rather than selecting cleaned features, as proven by a new adjoint inversion framework that also yields a covariance-volume...
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
cs.CV 2026-04 unverdicted novelty 7.0

Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
cs.LG 2026-04 unverdicted novelty 7.0

Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
Cell-Based Representation of Relational Binding in Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
cs.LG 2026-04 conditional novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
Psychological Steering of Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision
cs.CV 2026-04 unverdicted novelty 7.0

Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.
Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP
cs.CV 2026-04 unverdicted novelty 7.0

The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model pr...
Scaling and evaluating sparse autoencoders
cs.LG 2024-06 unverdicted novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
Eliciting Latent Predictions from Transformers with the Tuned Lens
cs.LG 2023-03 accept novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
Domain Restriction via Multi SAE Layer Transitions
cs.AI 2026-05 unverdicted novelty 6.0

Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
cs.AI 2026-05 unverdicted novelty 6.0

SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.
Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm
cs.CL 2026-05 unverdicted novelty 6.0

Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method...
Decomposing and Steering Functional Metacognition in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

LLMs have linearly decodable functional metacognitive states that causally modulate reasoning when steered via activation interventions.
Architecture, Not Scale: Circuit Localization in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.
A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.
Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners
cs.AI 2026-05 unverdicted novelty 6.0

Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.
The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models
cs.AI 2026-05 unverdicted novelty 6.0

LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Decoder-based VLMs over-align visual features to a universal text subspace, injecting linguistic bias; projecting out its top principal components reduces hallucinations on POPE, CHAIR, AMBER and improves long-form ca...
When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Decoder-based VLMs hallucinate due to geometric over-alignment of visual embeddings with the text manifold in a universal dataset-agnostic subspace, mitigated by projecting out the linguistic bias.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
cs.LG 2026-04 conditional novelty 6.0

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
cs.LG 2026-04 unverdicted novelty 6.0

Harmful intent is geometrically recoverable as a linear direction or angular deviation in LLM residual streams, with high AUROC across 12 models, stable under alignment variants including abliterated ones, and transfe...
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
cs.LG 2026-04 unverdicted novelty 6.0

Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out har...
Representation-Guided Parameter-Efficient LLM Unlearning
cs.CL 2026-04 unverdicted novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
From Attribution to Action: A Human-Centered Application of Activation Steering
cs.AI 2026-04 unverdicted novelty 6.0

Activation steering paired with attribution enables intervention-based debugging in vision models, as all 8 interviewed experts shifted to hypothesis testing, most trusted observed responses, and highlighted risks lik...
Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory
cs.LG 2026-04 conditional novelty 6.0

Student networks are limited to d_S * g(α) features via superposition, creating a permanent importance-weighted loss floor in distillation that cannot be overcome by training.
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
cs.CL 2026-05 unverdicted novelty 5.0

Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.
Reasoning emerges from constrained inference manifolds in large language models
cs.LG 2026-05 unverdicted novelty 5.0

Reasoning in LLMs emerges from inference dynamics forming constrained low-dimensional manifolds that preserve non-degenerate information volume, rather than from compression alone.
From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models
cs.AI 2026-04 unverdicted novelty 5.0

IGDS uses sparse autoencoders to find internal task features in LLMs and selects data that maximally activates them, yielding better math reasoning performance than full-dataset fine-tuning with only half the data.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
cs.LG 2026-04 unverdicted novelty 5.0

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering close the gap.
Tracing Relational Knowledge Recall in Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

Per-head attention contributions to the residual stream serve as strong linear features for classifying relational knowledge in LLMs, with probe accuracy correlating to relation specificity and signal distribution.
Diagnosable ColBERT: Debugging Late-Interaction Retrieval Models Using a Learned Latent Space as Reference
cs.IR 2026-04 unverdicted novelty 5.0

Diagnosable ColBERT aligns ColBERT embeddings to an expert-grounded clinical latent space to enable direct diagnosis of model misunderstandings and better training data curation.
Singularity Formation: Synergy in Theoretical, Numerical and Machine Learning Approaches
math.NA 2026-04 unverdicted novelty 5.0

The work introduces a modulation-based analytical method for singularity proofs in singular PDEs and refines ML techniques like PINNs and KANs to identify blowup solutions, with application to the open 3D Keller-Segel...
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
cs.CL 2026-05 unverdicted novelty 4.0

Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers
cs.CV 2026-05 unverdicted novelty 4.0

Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance
cs.AI 2026-05 unverdicted novelty 4.0

AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
High-Dimensional Statistics: Reflections on Progress and Open Problems
math.ST 2026-05 unverdicted novelty 2.0

A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 48 Pith papers · 2 internal anchors

[1]

Zoom in: An introduction to circuits

Distill. DOI: 10.23915/distill.00024.001 . Softmax Linear Units Elhage, N., Hume, T., Olsson, C., Nanda, N., Henighan, T., Johnston, S., ElShowk, S., Joseph, N., DasSarma, N., Mann, B.,Hernandez, D., Askell, A., Ndousse, K., Jones, A., Drain, D., Chen, A., Bai, Y., Ganguli, D., Lovitt, L., Hatfield-Dodds, Z., Kernion, J., Conerly, T., Kravec, S., Fort, S....

work page doi:10.23915/distill.00024.001
[2]

746--751

Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp. 746--751. . Linguistic regularities in sparse and explicit word representations Levy, O. and Goldberg, Y.,

work page 2013
[3]

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

arXiv preprint arXiv:1511.06434. . Visualizing and understanding recurrent networks[PDF] Karpathy, A., Johnson, J. and Fei-Fei, L.,

work page internal anchor Pith review arXiv
[4]

arXiv preprint arXiv:1506.02078 , year=

arXiv preprint arXiv:1506.02078. . Learning to generate reviews and discovering sentiment[PDF] Radford, A., Jozefowicz, R. and Sutskever, I.,

work page arXiv
[5]

Learning to Generate Reviews and Discovering Sentiment , publisher =

arXiv preprint arXiv:1704.01444. . Object detectors emerge in deep scene cnns[PDF] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A.,

work page arXiv
[6]

Network Dissection: Quantifying Interpretability of Deep Visual Representations[PDF]Bau, D., Zhou, B., Khosla, A., Oliva, A

arXiv preprint arXiv:1412.6856. . Network Dissection: Quantifying Interpretability of Deep Visual Representations[PDF]Bau, D., Zhou, B., Khosla, A., Oliva, A. and Torralba, A.,

work page arXiv
[7]

On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron Donnelly, J

arXiv preprint arXiv:1803.06959. . On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron Donnelly, J. and Roegiest, A.,

work page arXiv
[8]

DOI: 10.23915/distill.00024.005

Distill. DOI: 10.23915/distill.00024.005 . Multimodal Neurons in Artificial Neural Networks Goh, G., Cammarata, N., Voss, C., Carter, S., Petrov, M., Schubert, L., Radford, A. and Olah, C.,

work page doi:10.23915/distill.00024.005
[9]

Multimodal neurons in artificial neural networks

Distill. DOI:10.23915/distill.00030 . Convergent learning: Do different neural networks learn the same representations? Li, Y., Yosinski, J., Clune, J., Lipson, H., Hopcroft, J.E. and others,,

work page doi:10.23915/distill.00030
[10]

Neel Nanda

Distill. DOI: 10.23915/distill.00007 . Adversarial examples are not bugs, they are featuresIlyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B. and Madry, A.,

work page doi:10.23915/distill.00007
[11]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

arXiv preprint arXiv:2201.02177. . The surprising simplicity of the early-time learning dynamics of neural networksHu, W., Xiao, L., Adlam, B. and Pennington, J.,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Adversarial spheres Gilmer, J., Metz, L., Faghri, F., Schoenholz, S.S., Raghu, M., Wattenberg, M

arXiv preprint arXiv:1611.03814. . Adversarial spheres Gilmer, J., Metz, L., Faghri, F., Schoenholz, S.S., Raghu, M., Wattenberg, M. and Goodfellow, I.,

work page arXiv
[13]

Adversarial robustness as a prior for learned representations Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Tran, B

arXiv preprintarXiv:1801.02774. . Adversarial robustness as a prior for learned representations Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Tran, B. and Madry, A.,

work page arXiv
[14]

Adversarial robustness as a prior for learned representations.arXiv preprint arXiv:1906.00945, 2019

arXiv preprint arXiv:1906.00945. . Delving into transferable adversarial examples and black-box attacks Liu, Y., Chen, X., Liu, C. and Song, D.,

work page arXiv 1906
[15]

Delving into Transferable Adversarial Examples and Black-box Attacks

arXiv preprint arXiv:1611.02770. . An introduction to systems biology: design principles of biological circuitsAlon, U.,

work page Pith review arXiv
[16]

DOI: 10.1201/9781420011432

CRC press. DOI: 10.1201/9781420011432 . The Building Blocks of Interpretability[link] Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K. and Mordvintsev, A.,

work page doi:10.1201/9781420011432
[17]

DOI:10.23915/distill.00010

Distill. DOI:10.23915/distill.00010 . Visualizing Weights[link] Voss, C., Cammarata, N., Goh, G., Petrov, M., Schubert, L., Egan, B., Lim, S.K. and Olah, C.,

work page doi:10.23915/distill.00010
[18]

DOI:10.23915/distill.00024.007

Distill. DOI:10.23915/distill.00024.007 . A Review of Sparse Expert Models in Deep Learning Fedus, W., Dean, J. and Zoph, B.,

work page doi:10.23915/distill.00024.007
[19]

arXiv preprint arXiv:2209.01667. . A Mathematical Framework for Transformer Circuits[HTML] Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark,...

work page arXiv
[20]

An Overview of Early Vision in InceptionV1,

Distill. DOI: 10.23915/distill.00024.002 . beta-vae: Learning basic visual concepts with a constrained variational framework Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S. and Lerchner, A.,

work page doi:10.23915/distill.00024.002
[21]

961--965

2007 IEEE International Symposium on Information Theory, pp. 961--965. . Lower bounds for sparse recovery Do Ba, K., Indyk, P., Price, E. and Woodruff, D.P.,

work page 2007
[22]

1336--1343

2015 53rd annual allerton conference on communication, control, andcomputing (Allerton), pp. 1336--1343. . Learned D-AMP: Principled neural network based compressive image recovery Metzler, C., Mousavi, A. and Baraniuk, R.,

work page 2015
[23]

Distributed representationsPlate, T.,

DOI: 10.1038/s41598-019-43460-8 . Distributed representationsPlate, T.,

work page doi:10.1038/s41598-019-43460-8
[24]

Annual Review of Neuroscience, Vol 35(1), pp. 485-508. DOI: 10.1146/annurev-neuro-062111-150410 Nonlinear Compression This paper focuses on the assumption that representations are linear. But what if models donʼt use linear featuredirections to represent information? What might such a thing concretely look like? Neural networks have nonlinearities that ma...

work page doi:10.1146/annurev-neuro-062111-150410