arxiv: 2605.01148 · v1 · submitted 2026-05-01 · 💻 cs.AI · cs.CL

Recognition: unknown

Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts

Atticus Geiger, Can Rager, Daniel Wurgaft, Ekdeep Singh Lubana, Jack Merullo, Owen Lewis, Rapha\"el Sarfati, Sheridan Feucht, Tal Haklay, Thomas Fel, Thomas McGrath, Usha Bhalla

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:42 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords mechanistic interpretabilitylarge language modelsfourier featurescyclic reasoningbase-10 additionmlp neuronsarithmetic in llms

0 comments

The pith

Llama-3.1-8B uses base-10 addition to reason about cyclic concepts like months.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how the Llama-3.1-8B model handles reasoning over cyclic concepts, such as determining the month six months after August. Although the model's internal representations for these concepts form circular structures, the computation does not rely on modular addition tailored to the cycle's length. Instead, the model performs standard base-10 addition on the inputs and then maps the resulting number back to the cyclic concept. This approach uses Fourier features whose periods align with base-10 addition, such as 2, 5, and 10, and these features are shared across different cyclic tasks. A small set of 28 neurons in the model's MLP layer at layer 18 handles this addition in a reusable way for all such tasks.

Core claim

Llama-3.1-8B reasons over cyclic concepts by computing the sum of its two inputs using base-10 addition and then mapping this sum back to cyclic concept space. It achieves the addition using task-agnostic Fourier features that have periods respecting base-10, like 2, 5, and 10, rather than the cyclic period like 12. The model re-uses a generic addition mechanism across tasks that operates independently of concept-specific geometry. A sparse set of 28 MLP neurons, roughly 0.2 percent at layer 18, can be partitioned into disjoint clusters each computing the sum for a Fourier feature with a different period.

What carries the argument

Task-agnostic Fourier features with base-10 periods, implemented by a sparse set of 28 reusable MLP neurons, that perform the addition step before mapping to cyclic space.

Load-bearing premise

The identified sparse set of 28 MLP neurons and the Fourier features are causally responsible for the base-10 addition rather than correlated side effects of the analysis.

What would settle it

Ablating the 28 neurons or disrupting the identified Fourier features would not impair the model's ability to correctly answer cyclic addition questions, or the features would not show periods matching base-10 in causal tests.

Figures

Figures reproduced from arXiv: 2605.01148 by Atticus Geiger, Can Rager, Daniel Wurgaft, Ekdeep Singh Lubana, Jack Merullo, Owen Lewis, Rapha\"el Sarfati, Sheridan Feucht, Tal Haklay, Thomas Fel, Thomas McGrath, Usha Bhalla.

**Figure 1.** Figure 1: Llama-3.1-8B calculates six months after August with a standard addition mechanism that is used for numbers, months, weekdays, and hours. (a) Cyclic concepts are represented with circular geometry at the input token position (Engels et al., 2025). (b) The model computes addition in a base-10 Fourier number space (Kantamneni & Tegmark, 2025; Fu et al., 2026) using the same neurons for all tasks. We discover… view at source ↗

**Figure 2.** Figure 2: DAS results provide strong evidence that input concept and offset information are combined view at source ↗

**Figure 3.** Figure 3: Probes reveal that circular structure (Engels et al., 2025) is not present in layer 18 at the final token, where the input concept and offset are combined. (a) We train circular probes for (i) the input concept at the input concept token position and (ii) the output concept at the final token position. Circular structure for (i) is consistently recovered across layers; see (b) for an example of the weekday… view at source ↗

**Figure 4.** Figure 4: Patching from addition to cyclic tasks within the shared subspace of both tasks shows that base-10 numerical representations at layer 18 are converted into cyclic concepts. (e.g., 6+8=14→February). Patching at layer 18 consistently causes the model to output the predicted concept. This patch does not cause the model to output the source number token (red line), except for hours, which also uses number toke… view at source ↗

**Figure 5.** Figure 5: Patching from months → addition within the shared subspace of both tasks shows that Llama-3.1-8B represents cyclic concepts using base-10 numerical representations at layer 18 (e.g., patching from six months after August into addition prompts causes the model to output 14, suggesting that the model computed 6+8 as an intermediate step). We observe a surprising +100 echo, where the expected sum is sometimes… view at source ↗

**Figure 6.** Figure 6: Applying Fourier probes trained on the addition task to layer 18 activations reveals circular structures with base-10 periodicities across all tasks; here, we show T ∈ {2, 5, 10}. Although hours, months, and weekdays have their own natural periods (e.g., 24, 12, and 7), the model represents intermediate sums in these tasks using a base-10 system. Addition Avg. Steered Output Probs Steering Target Hours Mon… view at source ↗

**Figure 7.** Figure 7: Steering with Fourier probes at layer 18 shows that Fourier features identified for the view at source ↗

**Figure 8.** Figure 8: Addition neurons are sparse and can be grouped by the Fourier periodicities found in view at source ↗

**Figure 9.** Figure 9: Addition neuron activations organized by input concept and offset, separated into gate and Cosine Probe T = 2 Cosine Probe T = 5 Cosine Probe T = 10 Cosine Probe T = 20 view at source ↗

**Figure 10.** Figure 10: The model computes the sum 18+4=22 on multiple orthogonal planes, each encoding a different modulo. We visualize all addition neurons for periods T ∈ {2, 5, 10, 20} projected onto their respective Fourier planes: arrows indicate each neuron’s down projection row di scaled by its activation for the prompt four hours after 18:00. The gray dotted line indicate the sum of these vectors, and gray stars indicat… view at source ↗

**Figure 11.** Figure 11: Parity neurons (T = 2) sharpen model predictions. (a) Full activations (gate · up) for both parity neurons on the months task. These neurons write in opposite directions for alternating prompts. (b) Zero-ablating just these two neurons shifts probability towards neighboring outputs with the wrong parity, “blurring” the model’s predictions. only for even+even=even. 3 Figure 11b shows that if we zero-ablate… view at source ↗

**Figure 12.** Figure 12: Model performance for cyclic tasks. Llama-3.1-8B output probabilities for all prompts in each dataset are displayed, aggregated by pre-modulo sum. Cells for correct answers are outlined in black, and accuracy for each pre-modulo value is displayed beside each row. When the pre-modulo sum is less than or equal to the cycle length for a given concept (e.g., January + four=5=May), model accuracy is perfect, … view at source ↗

**Figure 13.** Figure 13: We test whether Llama-3.1-8B can solve prompts of the form view at source ↗

**Figure 14.** Figure 14: We test whether Llama-3.1-8B can solve prompts of the form view at source ↗

**Figure 15.** Figure 15: Residual stream patching results for the view at source ↗

**Figure 16.** Figure 16: Residual stream patching results for the view at source ↗

**Figure 17.** Figure 17: Residual stream patching results for the view at source ↗

**Figure 18.** Figure 18: DAS results for every dimension across layers, alternating between sublayer (post attention) view at source ↗

**Figure 20.** Figure 20: Alignment between weekday/number tokens for Llama-3.1-8B with (a) token embeddings view at source ↗

**Figure 21.** Figure 21: Accuracy when patching from cyclic tasks into addition: this intervention “exposes” a sum computed in the forward pass of a cyclic prompt (e.g., three months after November=3+11=14). (a) Patching from months→addition. More than 60% of the time, this intervention exposes the pre-modulo sum, but sometimes the highest probability is that sum + 100. See view at source ↗

**Figure 22.** Figure 22: [Duplicate of view at source ↗

**Figure 23.** Figure 23: Patching from months→addition at layers 16, 18, and 20. Note that patching is most consistent at layer 18; this plot is the same as view at source ↗

**Figure 24.** Figure 24: Patching from hours→addition at layers 16, 18, and 20. Surprisingly, patching is also quite effective at layer 16 for this task, implying that the input representations for hours transfer well to addition; this may be because hours uses literal number tokens. See view at source ↗

**Figure 25.** Figure 25: Patching from weekdays→addition at layers 16, 18, and 20. Note that patching is most consistent at layer 18, and that we observe a strange +4 offset for weekdays: see App. E for discussion. See view at source ↗

**Figure 26.** Figure 26: Patching from addition→months at layers 16, 18, and 20. Note that patching works best at layer 18, and that performance begins to break down as the sum increases, closely matching errors in a clean forward pass in view at source ↗

**Figure 27.** Figure 27: Patching from addition→weekdays at layers 16, 18, and 20. Note that patching works best at layer 18, and that performance begins to break down as the sum increases, closely matching errors in a clean forward pass in view at source ↗

**Figure 28.** Figure 28: Patching from addition→hours at layers 16, 18, and 20. See view at source ↗

**Figure 29.** Figure 29: Projection of the layer 18 residual activations at the final token position onto the Fourier view at source ↗

**Figure 30.** Figure 30: Projection of the layer 18 residual activations at the final token position onto the Fourier view at source ↗

**Figure 31.** Figure 31: Projection of the layer 18 residual activations at the final token position for the hours task view at source ↗

**Figure 32.** Figure 32: Projection of the layer 18 residual activations at the final token position for the hours task view at source ↗

**Figure 33.** Figure 33: Projection of the layer 18 residual activations at the final token position for the months view at source ↗

**Figure 34.** Figure 34: Projection of the layer 18 residual activations at the final token position for the months view at source ↗

**Figure 35.** Figure 35: Projection of the layer 18 residual activations at the final token position for the weekdays view at source ↗

**Figure 36.** Figure 36: Projection of the layer 18 residual activations at the final token position for the weekdays view at source ↗

**Figure 37.** Figure 37: Average output probabilities after steering with Fourier probes at the output of layer 18 view at source ↗

**Figure 38.** Figure 38: Output probabilities after steering each prompt with Fourier probes at the output of layer view at source ↗

**Figure 39.** Figure 39: Overlap between addition Fourier probes and DAS output concept subspaces at layer view at source ↗

**Figure 40.** Figure 40: R2 scores for Fourier probes across layers and for each T ∈ {2, . . . , 150}. For each T, we train sine and cosine probes and report the average R2 . 46 view at source ↗

**Figure 41.** Figure 41: Cosine similarity scores for Fourier probes at layer 18. All probe directions are orthogonal, view at source ↗

**Figure 42.** Figure 42: shows the distribution of write scores for all layer 18 MLP neurons (Section 5.1): this score measures the proportion of a neuron’s down projection row di that is within the best DAS output subspace at layer 18. We choose a threshold τ = 0.4 by eye based on the addition task, and find that neurons in all other tasks are a subset of these 28 addition neurons view at source ↗

**Figure 43.** Figure 43: Llama-3.1-8B errors on the addition task. We show errors for a clean model run (95% accuracy), as well as errors when all neurons at the L18 MLP are zero-ablated except for our 28 addition neurons Nadd (86% accuracy). Most errors come from higher number ranges, suggesting that Nadd excludes some neurons with larger periods. H.2 Addition Neurons Group by Fourier Frequency view at source ↗

**Figure 44.** Figure 44: Addition neuron activations Nadd across all examples for all tasks: addition, hours, months, and weekdays. We observe the same periodic structure across sums for all four tasks, although it is more difficult to see for tasks with smaller output ranges. Neurons that are also within the set for a respective task are **starred (e.g., n1712 is starred in all plots, so is used for all tasks). Addition neurons … view at source ↗

**Figure 45.** Figure 45: Simple hierarchical clustering of addition neurons with view at source ↗

**Figure 46.** Figure 46: N12728 activations across prompts for all four tasks, organized by input variables. This is a period 5 neuron. We also show read/write scores for gi , ui , di with input and output spaces. For all cyclic tasks, we can see that this neuron’s gate vector g12728 has a much higher read scores from the input subspace (high first bar, horizontal stripes), whereas its up vector u12728 reads more heavily from the… view at source ↗

**Figure 47.** Figure 47: N1712 activations across prompts for all four tasks, organized by input variables. This is a parity neuron. We also show read/write scores for gi , ui , di with input and output spaces. We can see that this neuron’s gate vector g1712 reads equally from both input subspaces (checkered pattern across examples, and the first two bars are similar heights), whereas its up vector u1712 is mostly negative across… view at source ↗

**Figure 48.** Figure 48: N8409 activations across prompts for all four tasks, organized by input variables. This is a period 10 neuron. We also show read/write scores for gi , ui , di with input and output spaces. We can see that this neuron’s gate vector g8409 reads equally from both input subspaces (checkered pattern across examples, and the first two bars are similar heights), as well as its up vector u8409. 53 view at source ↗

**Figure 49.** Figure 49: All period 2 neurons, activations for the hours and months tasks. 55 view at source ↗

**Figure 50.** Figure 50: All period 5 neurons, activations for the hours task. 56 view at source ↗

**Figure 51.** Figure 51: All period 10 neurons, activations for the hours task. 57 view at source ↗

**Figure 52.** Figure 52: All period 20 neurons, activations for the hours task. 58 view at source ↗

**Figure 53.** Figure 53: All period 50 neurons, activations for the hours task. 59 view at source ↗

**Figure 54.** Figure 54: All period 100 neurons, activations for the hours task. 60 view at source ↗

**Figure 55.** Figure 55: Down projection rows di for all addition neurons, projected onto the Fourier plane that corresponds to their activation period from view at source ↗

**Figure 56.** Figure 56: Period 5 neuron behavior when scrubbing across the input concept for the view at source ↗

**Figure 57.** Figure 57: Neuron outputs for the hours prompt four hours after 18:00 = 22:00 projected onto the Fourier plane for each period. T ∈ [2, 5, 10, 20] activate strongly in the correct directions, while T ∈ [50, 100] activations are close to zero for this prompt. 62 view at source ↗

read the original abstract

Does structure in representations imply structure in computation? We study how Llama-3.1-8B reasons over cyclic concepts (e.g., "what month is six months after August?"). Even though Llama-3.1-8B's representations for these concepts are circularly structured, we find that instead of directly computing modular addition in the period of the cyclic concept (e.g., 12 for months), the model re-uses a generic addition mechanism across tasks that operates independently of concept-specific geometry. First, it computes the sum of its two inputs using base-10 addition (six + August=14). Then, it maps this sum back to cyclic concept space (14->February). We show that Llama-3.1-8B uses task-agnostic Fourier features to compute these sums--in fact, these features have periods that respect standard base-10 addition, e.g., 2, 5, and 10, rather than the cyclic concept period (e.g., 12 for months). Furthermore, we identify a sparse set of 28 MLP neurons re-used across all tasks (approximately 0.2% of the MLP at layer 18) that can be partitioned into disjoint clusters, each computing the sum for a Fourier feature with a different period. Our work highlights how an interplay between causal abstraction and feature geometry can deepen our mechanistic understanding of LMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Llama reuses base-10 addition via Fourier features for cyclic tasks with a small set of neurons, but causal interventions are missing.

read the letter

The main thing to know is that this paper shows Llama-3.1-8B using base-10 addition with Fourier features of periods 2, 5, and 10 to handle cyclic reasoning, instead of direct modular arithmetic, and it pins this on a reusable set of 28 neurons in one layer. What stands out as new is the specific finding that the features match base-10 rather than the cycle length, and that the same small neuron group works across different cyclic tasks like months and days. The clustering into disjoint groups for each period is a clean observation. They combine causal abstraction with the geometry analysis, which helps make the case for how the computation flows from sum to remap. The paper does a good job laying out this mechanism in plain terms and giving numbers on the sparsity. It's a useful data point for how models might generalize arithmetic. The soft spot is the causal status of those neurons. Identification through analysis and clustering is there, but without reported interventions that selectively disrupt the addition step, the features could be side effects. The stress-test concern holds based on what's described: no clear evidence from ablating them to produce specific sum errors. If the full paper has more on this, it would address it. This is worth the time for anyone in mechanistic interpretability of language models. Readers who follow circuit-level work on arithmetic or feature reuse will get something out of the concrete details. It has a focused claim and enough empirical grounding to go to a serious referee for feedback on the methods and generalizability. I'd recommend putting it through peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that Llama-3.1-8B reasons over cyclic concepts (e.g., months) by first computing the sum of inputs via a generic base-10 addition mechanism implemented with task-agnostic Fourier features whose periods are 2, 5, and 10 (rather than the cyclic period such as 12), followed by a remapping step back into cyclic space. It further identifies a sparse set of 28 MLP neurons (0.2% of the layer-18 MLP) that are reused across tasks and can be partitioned into disjoint clusters each responsible for one Fourier component of the sum.

Significance. If the causal responsibility of the identified neurons and Fourier features is established, the work would demonstrate reuse of general arithmetic circuits for domain-specific cyclic reasoning in LLMs, providing a concrete case study of how feature geometry (Fourier periods) interacts with computational mechanisms identified via causal abstraction. This advances mechanistic interpretability by showing that representation structure need not dictate the underlying algorithm.

major comments (3)

[§4.3] §4.3 (Neuron Identification): The 28 MLP neurons are identified via activation clustering and correlation with Fourier components, but the manuscript reports no causal interventions such as selective ablation, activation patching, or counterfactual editing that would show these neurons are necessary and sufficient for the base-10 addition step (e.g., producing base-10 carry errors like 6+8=13 while preserving the subsequent cyclic mapping). Without such tests the features could be downstream correlates of numeric training data rather than the operative circuit.
[§3.2] §3.2 (Fourier Feature Extraction): The claim that the extracted features have periods strictly respecting base-10 addition (2, 5, 10) rather than cyclic periods is supported only by correlational analysis of representations; no quantitative model comparison (e.g., R² or reconstruction error for base-10 vs. modular-12 bases) or control for embedding geometry confounds is provided to establish that the periods are computed by the model rather than induced by the analysis pipeline.
[Methods] Methods section: The abstract references experiments using causal abstraction and neuron identification, yet the manuscript lacks full details on sample sizes, statistical thresholds, exact clustering procedure, and controls for multiple comparisons. This prevents evaluation of whether the reported sparsity (28 neurons) and period selectivity are robust or sensitive to analysis choices.

minor comments (2)

[Figure 4] Figure 4 (neuron clusters): axis labels and color legends are insufficiently described, making it difficult to verify the claimed disjoint partitioning into period-specific clusters.
[§3.1] The notation for Fourier periods (e.g., “period 2 feature”) is introduced without an explicit equation linking it to the standard Fourier basis used in the analysis.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key opportunities to strengthen the causal claims and methodological transparency of our work. We address each major comment below and will revise the manuscript accordingly to incorporate additional experiments and details.

read point-by-point responses

Referee: [§4.3] §4.3 (Neuron Identification): The 28 MLP neurons are identified via activation clustering and correlation with Fourier components, but the manuscript reports no causal interventions such as selective ablation, activation patching, or counterfactual editing that would show these neurons are necessary and sufficient for the base-10 addition step (e.g., producing base-10 carry errors like 6+8=13 while preserving the subsequent cyclic mapping). Without such tests the features could be downstream correlates of numeric training data rather than the operative circuit.

Authors: We appreciate the referee's observation that our neuron identification in §4.3 relies on activation clustering and correlation with Fourier components. While these methods reveal a sparse, task-agnostic set of neurons reused across cyclic tasks, we agree that explicit causal interventions would provide stronger evidence of necessity and sufficiency for the base-10 addition mechanism. In the revised manuscript, we will add selective ablation results demonstrating that removing these 28 neurons increases base-10 carry errors (e.g., 6+8 producing 13) while leaving the subsequent cyclic remapping intact. We will also include activation patching experiments to establish sufficiency. These will be reported with quantitative metrics in an expanded §4.3. revision: yes
Referee: [§3.2] §3.2 (Fourier Feature Extraction): The claim that the extracted features have periods strictly respecting base-10 addition (2, 5, 10) rather than cyclic periods is supported only by correlational analysis of representations; no quantitative model comparison (e.g., R² or reconstruction error for base-10 vs. modular-12 bases) or control for embedding geometry confounds is provided to establish that the periods are computed by the model rather than induced by the analysis pipeline.

Authors: We acknowledge that the evidence for base-10 period selectivity in §3.2 is currently correlational. To address this, the revised version will include quantitative model comparisons, reporting R² values and reconstruction errors when fitting base-10 Fourier bases (periods 2, 5, 10) versus modular bases matched to each cyclic concept period (e.g., 12 for months). We will also add controls for embedding geometry confounds by repeating the Fourier analysis on permuted or randomly projected embeddings. These comparisons will be added to §3.2 to demonstrate that the observed periods reflect the model's internal computation. revision: yes
Referee: [Methods] Methods section: The abstract references experiments using causal abstraction and neuron identification, yet the manuscript lacks full details on sample sizes, statistical thresholds, exact clustering procedure, and controls for multiple comparisons. This prevents evaluation of whether the reported sparsity (28 neurons) and period selectivity are robust or sensitive to analysis choices.

Authors: We agree that the Methods section requires greater detail to support evaluation of the reported findings. In the revision, we will expand the Methods to specify all sample sizes used in the experiments, the exact statistical thresholds (including correlation cutoffs and any significance criteria), the precise clustering procedure (algorithm, parameters such as number of clusters, and initialization method), and the approach to multiple-comparison correction. These additions will allow readers to assess the robustness of the 28-neuron sparsity and the selectivity to periods 2, 5, and 10. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical analysis of representations and interventions

full rationale

The paper's central claims rest on direct empirical inspection of Llama-3.1-8B activations, identification of Fourier features with periods 2/5/10, and clustering of 28 MLP neurons across tasks. These are observational and interventional findings (representation geometry, neuron reuse) rather than any derivation, prediction, or first-principles result that reduces to fitted parameters or self-referential definitions by construction. No equations, ansatzes, or uniqueness theorems are invoked; no self-citations appear as load-bearing premises. The analysis pipeline produces new observations about the model's computation without tautological renaming or statistical forcing of the reported patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical observation that representations are circular yet computation proceeds via base-10 mechanisms; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5601 in / 1120 out tokens · 34625 ms · 2026-05-09T18:42:48.832957+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

165 extracted references · 32 canonical work pages · 4 internal anchors

[1]

arXiv preprint arXiv:2602.03655 , year =

Sequential Group Composition: A Window into the Mechanics of Deep Learning , author=. arXiv preprint arXiv:2602.03655 , year=

work page arXiv
[2]

arXiv preprint arXiv:2601.05328 , year =

Bi-Orthogonal Factor Decomposition for Vision Transformers , author=. arXiv preprint arXiv:2601.05328 , year=

work page arXiv
[3]

Advances in Neural Information Processing Systems , volume=

Pareto frontiers in deep feature learning: Data, compute, width, and luck , author=. Advances in Neural Information Processing Systems , volume=
[4]

The Thirty Sixth Annual Conference on Learning Theory , pages=

Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics , author=. The Thirty Sixth Annual Conference on Learning Theory , pages=. 2023 , organization=

2023
[5]

arXiv preprint arXiv:2506.06489 , year =

Alternating gradient flows: A theory of feature learning in two-layer neural networks , author=. arXiv preprint arXiv:2506.06489 , year=

work page arXiv
[6]

What happens during the loss plateau? understanding abrupt learning in transformers

What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers , author=. arXiv preprint arXiv:2506.13688 , year=

work page arXiv
[7]

Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, and Stella Biderman

Representation shattering in transformers: A synthetic study with knowledge editing , author=. arXiv preprint arXiv:2410.17194 , year=

work page arXiv
[8]

arXiv preprint arXiv:2505.18651 , year =

On the emergence of linear analogies in word embeddings , author=. arXiv preprint arXiv:2505.18651 , year=

work page arXiv
[9]

Goodfire , year =

Pearce, Michael and Simon, Elana and Byun, Michael and Balsam, Daniel , title =. Goodfire , year =
[10]

Transformer Circuits Thread , year=

Gurnee, Wes and Ameisen, Emmanuel and Kauvar, Isaac and Tarng ,Julius and Pearce, Adam and Olah, Chris and Batson, Joshua , title=. Transformer Circuits Thread , year=
[11]

NeurIPS Workshop on Symmetry and Geometry in Neural Representations (NeurReps) , year =

Neural Manifold Geometry Encodes Feature Fields , author =. NeurIPS Workshop on Symmetry and Geometry in Neural Representations (NeurReps) , year =
[12]

arXiv preprint arXiv:2603.16689 , year =

Grid-World Representations in Transformers Reflect Predictive Geometry , author=. arXiv preprint arXiv:2603.16689 , year=

work page arXiv
[13]

Uncovering hidden geometry in transformers via disentangling position and context.arXiv preprint arXiv:2310.04861, 2023

Uncovering hidden geometry in transformers via disentangling position and context , author=. arXiv preprint arXiv:2310.04861 , year=

work page arXiv
[14]

NeurIPS 2024 Workshop on Symmetry and Geometry in Neural Representations , year=

Constrained Belief Updating and Geometric Structures in Transformer Representations , author=. NeurIPS 2024 Workshop on Symmetry and Geometry in Neural Representations , year=

2024
[15]

Advances in Neural Information Processing Systems , volume=

Transformers represent belief state geometry in their residual stream , author=. Advances in Neural Information Processing Systems , volume=
[16]

Advances in Neural Information Processing Systems , volume=

Abrupt learning in transformers: A case study on matrix completion , author=. Advances in Neural Information Processing Systems , volume=
[17]

2023 , eprint=

Kernelized Concept Erasure , author=. 2023 , eprint=

2023
[18]

2022 , eprint=

Linear Adversarial Concept Erasure , author=. 2022 , eprint=

2022
[20]

Nora Belrose and David Schneider. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , year =

2023
[21]

Log-linear

Shauli Ravfogel and Yoav Goldberg and Ryan Cotterell , editor =. Log-Linear Guardedness and its Implications , booktitle =. 2023 , url =. doi:10.18653/V1/2023.ACL-LONG.523 , timestamp =

work page doi:10.18653/v1/2023.acl-long.523 2023
[22]

2017 , eprint=

Understanding Neural Networks through Representation Erasure , author=. 2017 , eprint=

2017
[23]

Distill , year =

Cammarata, Nick and Carter, Shan and Goh, Gabriel and Olah, Chris and Petrov, Michael and Schubert, Ludwig and Voss, Chelsea and Egan, Ben and Lim, Swee Kiat , title =. Distill , year =
[24]

2023 , eprint=

Dissecting Recall of Factual Associations in Auto-Regressive Language Models , author=. 2023 , eprint=

2023
[25]

Advances in neural information processing systems , volume=

Universality and individuality in neural dynamics across large populations of recurrent networks , author=. Advances in neural information processing systems , volume=
[26]

Proceedings of the National Academy of Sciences , volume=

Neural representational geometry underlies few-shot concept learning , author=. Proceedings of the National Academy of Sciences , volume=. 2022 , publisher=

2022
[27]

Proceedings of the National Academy of Sciences , volume=

A mathematical theory of semantic development in deep neural networks , author=. Proceedings of the National Academy of Sciences , volume=. 2019 , publisher=

2019
[28]

Computational Linguistics , volume=

Probing classifiers: Promises, shortcomings, and advances , author=. Computational Linguistics , volume=
[29]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[30]

A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis , booktitle =

Alessandro Stolfo and Yonatan Belinkov and Mrinmaya Sachan , editor =. A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis , booktitle =. 2023 , url =. doi:10.18653/V1/2023.EMNLP-MAIN.435 , timestamp =

work page doi:10.18653/v1/2023.emnlp-main.435 2023
[31]

ArXiv , year=

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , author=. ArXiv , year=
[32]

Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =

Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Singer, Yaron and Shieber, Stuart , booktitle =. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , url =
[33]

Computational Linguistics , pages =

Mueller, Aaron and Brinkmann, Jannik and Li, Millicent and Marks, Samuel and Pal, Koyena and Prakash, Nikhil and Rager, Can and Sankaranarayanan, Aruna and Sharma, Arnab Sen and Sun, Jiuding and Todd, Eric and Bau, David and Belinkov, Yonatan , title =. Computational Linguistics , pages =. 2026 , month =. doi:10.1162/COLI.a.572 , url =

work page doi:10.1162/coli.a.572 2026
[34]

Journal of Machine Learning Research , year =

Atticus Geiger and Duligur Ibeling and Amir Zur and Maheep Chaudhary and Sonakshi Chauhan and Jing Huang and Aryaman Arora and Zhengxuan Wu and Noah Goodman and Christopher Potts and Thomas Icard , title =. Journal of Machine Learning Research , year =
[35]

2026 , eprint=

The Shape of Beliefs: Geometry, Dynamics, and Interventions along Representation Manifolds of Language Models' Posteriors , author=. 2026 , eprint=

2026
[36]

2025 , eprint=

Not All Language Model Features Are One-Dimensionally Linear , author=. 2025 , eprint=

2025
[37]

2025 , eprint=

The Origins of Representation Manifolds in Large Language Models , author=. 2025 , eprint=

2025
[38]

2024 , eprint=

Pre-trained Large Language Models Use Fourier Features to Compute Addition , author=. 2024 , eprint=

2024
[39]

2025 , eprint=

Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics , author=. 2025 , eprint=

2025
[40]

2025 , eprint=

Language Models Use Trigonometry to Do Addition , author=. 2025 , eprint=

2025
[41]

2023 , eprint=

Progress measures for grokking via mechanistic interpretability , author=. 2023 , eprint=

2023
[42]

2023 , eprint=

The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks , author=. 2023 , eprint=

2023
[43]

Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =

Geiger, Atticus and Lu, Hanson and Icard, Thomas and Potts, Christopher , title =. Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =. 2021 , isbn =

2021
[44]

2025 , booktitle =

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders , author =. 2025 , booktitle =

2025
[45]

2025 , journal =

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability , author =. 2025 , journal =

2025
[46]

2025 , booktitle =

Combining Causal Models for More Accurate Abstractions of Neural Networks , author =. 2025 , booktitle =

2025
[47]

The Fourteenth International Conference on Learning Representations , year=

Correlations in the Data Lead to Semantically Rich Feature Geometry Under Superposition , author=. The Fourteenth International Conference on Learning Representations , year=
[48]

2025 , booktitle =

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization , author =. 2025 , booktitle =

2025
[49]

2025 , booktitle =

Enhancing Automated Interpretability with Output-Centric Feature Descriptions , author =. 2025 , booktitle =

2025
[50]

2025 , booktitle =

How Causal Abstraction Underpins Computational Explanation , author =. 2025 , booktitle =

2025
[51]

2025 , booktitle =

How Do Transformers Learn Variable Binding in Symbolic Programs? , author =. 2025 , booktitle =

2025
[52]

2025 , booktitle =

HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks , author =. 2025 , booktitle =

2025
[53]

2025 , booktitle =

HyperSteer: Activation Steering at Scale with Hypernetworks , author =. 2025 , booktitle =

2025
[54]

2025 , booktitle =

MIB: A Mechanistic Interpretability Benchmark , author =. 2025 , booktitle =

2025
[55]

Why Can't Transformers Learn Multiplication? Reverseengineering Reveals LongRange Dependency Pitfalls,

Xiaoyan Bai and Itamar Pres and Yuntian Deng and Chenhao Tan and Stuart M. Shieber and Fernanda B. Vi. Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls , journal =. 2025 , url =. doi:10.48550/ARXIV.2510.00184 , eprinttype =. 2510.00184 , timestamp =

work page doi:10.48550/arxiv.2510.00184 2025
[56]

2025 , booktitle =

Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context , author =. 2025 , booktitle =

2025
[57]

2025 , journal =

Open Problems in Mechanistic Interpretability , author =. 2025 , journal =

2025
[58]

(2023)'s Interpretability Illusions Arguments , author =

A Reply to Makelov et al. (2023)'s Interpretability Illusions Arguments , author =. 2024 , booktitle =

2023
[59]

2024 , booktitle =

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small , author =. 2024 , booktitle =

2024
[60]

2024 , booktitle =

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , author =. 2024 , booktitle =

2024
[61]

2024 , booktitle =

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching , author =. 2024 , booktitle =

2024
[62]

2024 , booktitle =

Language Models Linearly Represent Sentiment , author =. 2024 , booktitle =

2024
[63]

2024 , booktitle =

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions , author =. 2024 , booktitle =

2024
[64]

2024 , booktitle =

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations , author =. 2024 , booktitle =

2024
[65]

2024 , booktitle =

Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations , author =. 2024 , booktitle =

2024
[66]

2024 , booktitle =

ReFT: Representation Finetuning for Language Models , author =. 2024 , booktitle =

2024
[67]

2024 , booktitle =

Updating CLIP to Prefer Descriptions Over Captions , author =. 2024 , booktitle =

2024
[68]

2023 , booktitle =

A Semantics for Causing, Enabling, and Preventing Verbs Using Structural Causal Models , author =. 2023 , booktitle =

2023
[69]

2023 , booktitle =

Causal Abstraction with Soft Interventions , author =. 2023 , booktitle =

2023
[70]

Language Models Encode Numbers Using Digit Representations in Base 10 , booktitle =

Amit Arnold Levy and Mor Geva , editor =. Language Models Encode Numbers Using Digit Representations in Base 10 , booktitle =. 2025 , url =. doi:10.18653/V1/2025.NAACL-SHORT.33 , timestamp =

work page doi:10.18653/v1/2025.naacl-short.33 2025
[71]

2025 , eprint=

Priors in Time: Missing Inductive Biases for Language Model Interpretability , author=. 2025 , eprint=

2025
[72]

2025 , eprint=

From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit , author=. 2025 , eprint=

2025
[73]

2025 , eprint=

MIB: A Mechanistic Interpretability Benchmark , author=. 2025 , eprint=

2025
[74]

The Fourteenth International Conference on Learning Representations (ICLR 2026) , year =

LLMs Process Lists With General Filter Heads , author=. The Fourteenth International Conference on Learning Representations (ICLR 2026) , year=. 2510.26784 , archivePrefix=

work page arXiv 2026
[75]

2026 , eprint=

From Directions to Regions: Decomposing Activations in Language Models via Local Geometry , author=. 2026 , eprint=

2026
[76]

FoNE: Precise Single-Token Number Embeddings via Fourier Features

FoNE: Precise Single-Token Number Embeddings via Fourier Features , author=. arXiv preprint arXiv:2502.09741 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[77]

Saxe , editor =

Lukas Braun and Erin Grant and Andrew M. Saxe , editor =. Not all solutions are created equal: An analytical dissociation of functional and representational similarity in deep linear neural networks , booktitle =. 2025 , url =

2025
[78]

Second Mechanistic Interpretability Workshop at NeurIPS , year=

Vector Arithmetic in Concept and Token Subspaces , author=. Second Mechanistic Interpretability Workshop at NeurIPS , year=
[79]

Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry , booktitle =

Chi. Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry , booktitle =. 2025 , url =

2025
[80]

CoRR , volume =

Hyunmo Kang and Abdulkadir Canatar and SueYeon Chung , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2502.19648 , eprinttype =. 2502.19648 , timestamp =

work page doi:10.48550/arxiv.2502.19648 2025
[81]

2025 , eprint=

Language Models use Lookbacks to Track Beliefs , author=. 2025 , eprint=

2025

Showing first 80 references.