arxiv: 2605.09239 · v1 · submitted 2026-05-10 · 💻 cs.CL · cs.LG

Recognition: no theorem link

Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs

Sohan Venkatesh

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:15 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords repeated token countinglinear probesMLP overwriteinternal representationsLLM failuresrouting vs representationresidual stream

0 comments

The pith

Linear probes show LLMs track repeated token counts correctly inside but a late MLP block overwrites them before output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models fail at counting repeated tokens in lists even though they succeed on many other tasks. Linear probes recover the correct count with near-perfect accuracy from the residual stream at every layer after the embedding, including layers where the model has already begun to emit the wrong final answer. The error originates in a format-specific MLP block that activates on space-separated repeated words and replaces the accurate count with a fixed incorrect value at roughly 88 to 93 percent of network depth. The same overwriting is absent for repeated digit tokens and is suppressed by comma delimiters in larger models. This pattern across several model families indicates that the failure is one of routing the representation to the output rather than of forming the count representation itself.

Core claim

Linear probes on the residual stream decode the correct count with near-perfect accuracy at every post-embedding layer, across all model depths. This holds even at the exact layers where the wrong answer crystallizes while the model simultaneously outputs an incorrect count. A format-triggered multi-layer perceptron (MLP) block overwrites the correctly-encoded count with a fixed wrong answer at roughly 88-93% network depth for repeated word-tokens in space-separated list format.

What carries the argument

A format-triggered multi-layer perceptron (MLP) block that activates on repeated word-tokens in space-separated lists and overwrites the correct internal count with a fixed incorrect value at late network depth.

If this is right

Interventions that target or bypass the late MLP block could restore correct counting outputs.
Comma-separated formats suppress the overwriting effect in larger models but not smaller ones.
The same dissociation between accurate representations and incorrect outputs occurs at consistent relative depths across Llama-3.2 and Qwen2.5 models.
Attention patterns show no collapse on repeated tokens, eliminating attention-based explanations for the failure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Analogous late overwriting blocks may underlie other LLM failures where internal probes reveal correct knowledge but outputs remain wrong.
Targeted editing of specific MLP blocks at consistent relative depths could improve reliability on simple arithmetic tasks without full retraining.
The persistence of the overwriting prior across model scales suggests that pure scaling will not automatically resolve routing errors of this kind.

Load-bearing premise

Linear probes faithfully extract the model's true internal count without probe-induced artifacts, and the identified MLP block is causally responsible for overwriting the count rather than merely correlated with the output error.

What would settle it

An intervention that ablates or edits only the MLP block at 88-93% depth and measures whether output counts become accurate for repeated word tokens while other behaviors remain unchanged.

Figures

Figures reproduced from arXiv: 2605.09239 by Sohan Venkatesh.

**Figure 1.** Figure 1: Illustration of the routing failure: the correct count is linearly encoded in the residual stream at every layer, yet a formattriggered MLP block overwrites it with a fixed wrong answer. The correct count persists in later layers but does not reach the output. lack of internal counting representations (Zhang et al., 2024; Hasani et al., 2026). The format does not contain the answer. A model that succeeds … view at source ↗

**Figure 2.** Figure 2: Model output vs. correct count across sequence lengths for repeated-token counting. The dashed diagonal denotes the correct answer. Llama-1B collapses to a dominant attractor state of “8” while Llama-3B collapses to “14” after n=10. Qwen-1.5B remains correct through n=8 before drifting to “10”. Qwen-3B and Qwen-7B remain mostly accurate with only late-sequence degradation. 4.3. Anomaly detection fails even… view at source ↗

**Figure 3.** Figure 3: shows probe R2 per layer for Llama-3.2-1B. Linear probes achieve R2>0.99 from L01 onward for the repeatedtoken condition. The embedding layer produces chancelevel predictions (R2=−0.21), confirming the count signal emerges from transformer computation rather than token embeddings. The same experiment on Llama-3.2-3B yields R2>0.99 from L01 throughout all 28 layers for both conditions. On Qwen2.5-1.5B, R… view at source ↗

**Figure 4.** Figure 4: Primary MLP writer depth across all three models showing the 79–93 % depth regularity. Each marker shows where the wrong answer is first stably written. Llama-3B has two writers (L22 initiates, L26 reinstates after erasure at L23) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: shows that the banana is over-attended (ratio >1.5) in 10 of 36 layers, receiving 2–4× the attention of individual apple tokens. 0 5 10 15 20 25 30 35 0 2 4 uniform over-attended Layer Banana / apple attention ratio [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Large language models fail at counting repeated tokens despite strong performance on broader reasoning benchmarks. These failures are commonly attributed to limitations in internal count tracking. We show this attribution is wrong. Linear probes on the residual stream decode the correct count with near-perfect accuracy at every post-embedding layer, across all model depths. This holds even at the exact layers where the wrong answer crystallizes while the model simultaneously outputs an incorrect count. Attention patterns show no evidence of collapse over repeated tokens and tokenization artifacts account for none of the failure. Instead, a format-triggered multi-layer perceptron (MLP) block overwrites the correctly-encoded count with a fixed wrong answer at roughly 88--93,% network depth. This prior fires for repeated word-tokens in space-separated list format and is absent for repeated digit-tokens. It is suppressed by comma-separated delimiters in larger models but persists in smaller ones. The finding holds across Llama-3.2 (1B and 3B) and Qwen2.5 (1.5B, 3B and 7B) at consistent relative depth. Counting failure is a failure of routing not of representation and the two require different interventions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds that LLMs track repeated token counts accurately in the residual stream but a late MLP overwrites the output for space-separated word lists, though the causal role of that block is not directly tested.

read the letter

The core observation is that linear probes recover the correct count at every layer, including right where the model output flips to the wrong number. This dissociation between internal representation and final answer is the useful part, and it holds at consistent relative depth across Llama-3.2 and Qwen2.5 variants. The format trigger—space-separated words but not digits, and suppressed by commas in larger models—narrows down when the error appears. Ruling out attention collapse and tokenization problems is straightforward and helpful for narrowing the cause to routing rather than missing information.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that failures of large language models to correctly count repeated tokens in lists are not due to inadequate internal count representations. Instead, linear probes trained on the residual stream can decode the correct count with near-perfect accuracy from every post-embedding layer, even at depths where the model's output has crystallized to the incorrect count. The authors identify a format-triggered MLP block at roughly 88-93% network depth that overwrites the correct count with a fixed incorrect value, specific to space-separated word token lists. This mechanism is absent for digit tokens and modulated by delimiters in larger models. The findings are replicated across Llama-3.2 (1B, 3B) and Qwen2.5 (1.5B, 3B, 7B) models, with attention patterns and tokenization ruled out as causes. The conclusion is that counting failure is a routing issue rather than a representation issue.

Significance. If the central claims are substantiated, particularly the causal role of the identified MLP block, this paper would make a notable contribution to mechanistic interpretability in NLP. It provides evidence for a dissociation between what is represented in the model's activations and what is output, highlighting that strong internal encodings can be overridden by late-stage computations triggered by input format. The cross-model consistency at similar relative depths suggests a general phenomenon. This could have implications for designing interventions that target specific layers or formats to improve counting and similar tasks. The empirical approach using linear probes to uncover hidden information is a positive aspect, though the absence of causal interventions limits the strength of the overwriting attribution.

major comments (2)

[Identification of the MLP block (around 88-93% depth)] The paper attributes the overwriting of the correctly encoded count to a specific MLP block based on its location coinciding with the crystallization of the wrong output and its format specificity. However, this remains correlational. No activation patching, ablation, or other causal interventions are reported to isolate the block's contribution to the output error. As a result, it is unclear whether this MLP is the active agent performing the overwrite or merely correlated with downstream effects. This is central to the dissociation claim and requires strengthening.
[Linear probe experiments] The near-perfect accuracy of linear probes in decoding the correct count is a key pillar of the argument that representations are intact. However, the manuscript lacks details on probe training procedures, including the training data used, any regularization, control tasks, or statistical tests for significance. Without these, it is difficult to rule out that the high accuracy is due to probe overfitting or artifacts rather than faithful extraction of the model's internal state. This affects the soundness of the representation claim.

minor comments (2)

[Abstract] There is a typographical error in the abstract: '88--93,% network depth' where the comma appears misplaced after the percentage.
The manuscript would benefit from clearer notation or a dedicated section defining how the 'count' target is constructed for probing and how input formats are exactly tokenized and presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential contribution of this work to mechanistic interpretability. We address each major comment below and indicate the revisions planned for the manuscript.

read point-by-point responses

Referee: [Identification of the MLP block (around 88-93% depth)] The paper attributes the overwriting of the correctly encoded count to a specific MLP block based on its location coinciding with the crystallization of the wrong output and its format specificity. However, this remains correlational. No activation patching, ablation, or other causal interventions are reported to isolate the block's contribution to the output error. As a result, it is unclear whether this MLP is the active agent performing the overwrite or merely correlated with downstream effects. This is central to the dissociation claim and requires strengthening.

Authors: We agree that the attribution of the overwriting role to the identified MLP block is based on correlational evidence, specifically the precise alignment between the block's depth and the crystallization of the incorrect output, the block's format-specific triggering (present for space-separated word lists but absent for digits and modulated by delimiters), and its consistency across the five models tested. No activation patching or ablation experiments were conducted. This limits the direct causal claim. In the revised manuscript we will add an explicit limitations paragraph acknowledging the correlational nature of the MLP identification and will include additional supporting analyses such as direct inspection of the MLP's contribution to the residual stream at that depth. We maintain that the overall dissociation between intact representations (via probes) and incorrect outputs is robustly evidenced even without these interventions. revision: partial
Referee: [Linear probe experiments] The near-perfect accuracy of linear probes in decoding the correct count is a key pillar of the argument that representations are intact. However, the manuscript lacks details on probe training procedures, including the training data used, any regularization, control tasks, or statistical tests for significance. Without these, it is difficult to rule out that the high accuracy is due to probe overfitting or artifacts rather than faithful extraction of the model's internal state. This affects the soundness of the representation claim.

Authors: We acknowledge the omission of detailed probe methodology. In the revised manuscript we will add a dedicated methods subsection and appendix that fully specifies the linear probe procedures: probes were trained as logistic regression classifiers on residual-stream activations using held-out lists with balanced count distributions; L2 regularization was applied with cross-validated strength; control tasks included probes trained on label-permuted data and on non-count features; statistical significance was evaluated via 5-fold cross-validation, bootstrap confidence intervals, and comparison against random-feature baselines. These additions will confirm that the reported near-perfect accuracies reflect genuine extraction of count information rather than overfitting or artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical probe-based dissociation analysis

full rationale

The paper's claims rest on linear probing experiments, attention pattern observations, and cross-model replication across Llama and Qwen variants. No derivation chain, equations, or self-referential definitions exist that reduce outputs to inputs by construction. Probe accuracies and MLP timing identifications are direct measurements, not fitted predictions renamed as results. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The work is self-contained against external benchmarks via replication and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that linear probes recover true internal representations and on the post-hoc identification of the MLP as the overwrite site. No explicit free parameters are stated in the abstract.

axioms (1)

domain assumption Linear probes on residual streams can decode linearly represented information such as token counts if it is present.
Standard assumption in mechanistic interpretability work; invoked implicitly when claiming probes decode the correct count.

invented entities (1)

format-triggered MLP block no independent evidence
purpose: Overwrites correctly encoded count with fixed wrong answer at late layers for space-separated word lists
Identified from probe accuracy drop and layer-wise analysis; no independent falsifiable prediction provided in abstract.

pith-pipeline@v0.9.0 · 5503 in / 1275 out tokens · 54991 ms · 2026-05-12T04:15:09.501747+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 3 internal anchors

[1]

arXiv preprint arXiv:2212.03827 (2022) 3

Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827,

work page arXiv
[2]

From early encoding to late suppression: Interpret- ing llms on character counting tasks.arXiv preprint arXiv:2604.00778,

Datta, A., Marreddy, M., Mehler, A., Zhao, Z., and Mamidi, R. From early encoding to late suppression: Interpret- ing llms on character counting tasks.arXiv preprint arXiv:2604.00778,

work page arXiv
[3]

Transformer feed-forward layers are key-value memories

Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. InProceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495,

work page 2021
[4]

Trans- former feed-forward layers build predictions by promot- ing concepts in the vocabulary space

Geva, M., Caciularu, A., Wang, K., and Goldberg, Y . Trans- former feed-forward layers build predictions by promot- ing concepts in the vocabulary space. InProceedings of the 2022 conference on empirical methods in natural language processing, pp. 30–45,

work page 2022
[5]

2023 , archivePrefix=

Goldowsky-Dill, N., MacLeod, C., Sato, L., and Arora, A. Localizing model behavior with path patching.arXiv preprint arXiv:2304.05969,

work page arXiv
[6]

Hasani, H., Banayeeanzade, M., Nafisi, A., Mohammadian, S., Askari, F., Bagherian, M., Izadi, A., and Baghshah, M. S. Mechanistic interpretability of large-scale count- ing in llms through a system-2 strategy.arXiv preprint arXiv:2601.02989,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Marks, S. and Tegmark, M. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824,

work page internal anchor Pith review arXiv
[8]

com/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens

URL https://www.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens. Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N. A., and Lewis, M. Measuring and narrowing the composi- tionality gap in language models. InFindings of the As- sociation for Computational Linguistics: EMNLP 2023, pp. 5687–5711,

work page 2023
[9]

A mechanistic interpretation of arithmetic reasoning in language mod- els using causal mediation analysis

Stolfo, A., Belinkov, Y ., and Sachan, M. A mechanistic interpretation of arithmetic reasoning in language mod- els using causal mediation analysis. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7035–7052,

work page 2023
[10]

arXiv preprint arXiv:2310.15154 , year=

Tigges, C., Hollinsworth, O. J., Geiger, A., and Nanda, N. Linear representations of sentiment in large language models.arXiv preprint arXiv:2310.15154,

work page arXiv
[11]

Causal mediation analysis for in- terpreting neural nlp: The case of gender bias.arXiv preprint arXiv:2004.12265, 2020

Vig, J., Gehrmann, S., Belinkov, Y ., Qian, S., Nevo, D., Sakenis, S., Huang, J., Singer, Y ., and Shieber, S. Causal mediation analysis for interpreting neural nlp: The case of gender bias.arXiv preprint arXiv:2004.12265,

work page arXiv 2004
[12]

Effi- cient streaming language models with attention sinks

9 Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Effi- cient streaming language models with attention sinks. In International Conference on Learning Representations, volume 2024, pp. 21875–21895,

work page 2024
[14]

Qwen2.5 Technical Report

URLhttps://arxiv.org/abs/2412.15115. Zhang, X., Cao, J., and You, C. Counting ability of large lan- guage models and impact of tokenization.arXiv preprint arXiv:2410.19730,

work page internal anchor Pith review Pith/arXiv arXiv