pith. machine review for the scientific record. sign in

arxiv: 2605.09239 · v1 · submitted 2026-05-10 · 💻 cs.CL · cs.LG

Recognition: no theorem link

Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:15 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords repeated token countinglinear probesMLP overwriteinternal representationsLLM failuresrouting vs representationresidual stream
0
0 comments X

The pith

Linear probes show LLMs track repeated token counts correctly inside but a late MLP block overwrites them before output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models fail at counting repeated tokens in lists even though they succeed on many other tasks. Linear probes recover the correct count with near-perfect accuracy from the residual stream at every layer after the embedding, including layers where the model has already begun to emit the wrong final answer. The error originates in a format-specific MLP block that activates on space-separated repeated words and replaces the accurate count with a fixed incorrect value at roughly 88 to 93 percent of network depth. The same overwriting is absent for repeated digit tokens and is suppressed by comma delimiters in larger models. This pattern across several model families indicates that the failure is one of routing the representation to the output rather than of forming the count representation itself.

Core claim

Linear probes on the residual stream decode the correct count with near-perfect accuracy at every post-embedding layer, across all model depths. This holds even at the exact layers where the wrong answer crystallizes while the model simultaneously outputs an incorrect count. A format-triggered multi-layer perceptron (MLP) block overwrites the correctly-encoded count with a fixed wrong answer at roughly 88-93% network depth for repeated word-tokens in space-separated list format.

What carries the argument

A format-triggered multi-layer perceptron (MLP) block that activates on repeated word-tokens in space-separated lists and overwrites the correct internal count with a fixed incorrect value at late network depth.

If this is right

  • Interventions that target or bypass the late MLP block could restore correct counting outputs.
  • Comma-separated formats suppress the overwriting effect in larger models but not smaller ones.
  • The same dissociation between accurate representations and incorrect outputs occurs at consistent relative depths across Llama-3.2 and Qwen2.5 models.
  • Attention patterns show no collapse on repeated tokens, eliminating attention-based explanations for the failure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Analogous late overwriting blocks may underlie other LLM failures where internal probes reveal correct knowledge but outputs remain wrong.
  • Targeted editing of specific MLP blocks at consistent relative depths could improve reliability on simple arithmetic tasks without full retraining.
  • The persistence of the overwriting prior across model scales suggests that pure scaling will not automatically resolve routing errors of this kind.

Load-bearing premise

Linear probes faithfully extract the model's true internal count without probe-induced artifacts, and the identified MLP block is causally responsible for overwriting the count rather than merely correlated with the output error.

What would settle it

An intervention that ablates or edits only the MLP block at 88-93% depth and measures whether output counts become accurate for repeated word tokens while other behaviors remain unchanged.

Figures

Figures reproduced from arXiv: 2605.09239 by Sohan Venkatesh.

Figure 1
Figure 1. Figure 1: Illustration of the routing failure: the correct count is linearly encoded in the residual stream at every layer, yet a format￾triggered MLP block overwrites it with a fixed wrong answer. The correct count persists in later layers but does not reach the output. lack of internal counting representations (Zhang et al., 2024; Hasani et al., 2026). The format does not contain the answer. A model that succeeds … view at source ↗
Figure 2
Figure 2. Figure 2: Model output vs. correct count across sequence lengths for repeated-token counting. The dashed diagonal denotes the correct answer. Llama-1B collapses to a dominant attractor state of “8” while Llama-3B collapses to “14” after n=10. Qwen-1.5B remains correct through n=8 before drifting to “10”. Qwen-3B and Qwen-7B remain mostly accurate with only late-sequence degradation. 4.3. Anomaly detection fails even… view at source ↗
Figure 3
Figure 3. Figure 3: shows probe R2 per layer for Llama-3.2-1B. Linear probes achieve R2>0.99 from L01 onward for the repeated￾token condition. The embedding layer produces chance￾level predictions (R2=−0.21), confirming the count signal emerges from transformer computation rather than token embeddings. The same experiment on Llama-3.2-3B yields R2>0.99 from L01 throughout all 28 layers for both condi￾tions. On Qwen2.5-1.5B, R… view at source ↗
Figure 4
Figure 4. Figure 4: Primary MLP writer depth across all three models show￾ing the 79–93 % depth regularity. Each marker shows where the wrong answer is first stably written. Llama-3B has two writers (L22 initiates, L26 reinstates after erasure at L23) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: shows that the banana is over-attended (ratio >1.5) in 10 of 36 layers, receiving 2–4× the attention of individual apple tokens. 0 5 10 15 20 25 30 35 0 2 4 uniform over-attended Layer Banana / apple attention ratio [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Large language models fail at counting repeated tokens despite strong performance on broader reasoning benchmarks. These failures are commonly attributed to limitations in internal count tracking. We show this attribution is wrong. Linear probes on the residual stream decode the correct count with near-perfect accuracy at every post-embedding layer, across all model depths. This holds even at the exact layers where the wrong answer crystallizes while the model simultaneously outputs an incorrect count. Attention patterns show no evidence of collapse over repeated tokens and tokenization artifacts account for none of the failure. Instead, a format-triggered multi-layer perceptron (MLP) block overwrites the correctly-encoded count with a fixed wrong answer at roughly 88--93,% network depth. This prior fires for repeated word-tokens in space-separated list format and is absent for repeated digit-tokens. It is suppressed by comma-separated delimiters in larger models but persists in smaller ones. The finding holds across Llama-3.2 (1B and 3B) and Qwen2.5 (1.5B, 3B and 7B) at consistent relative depth. Counting failure is a failure of routing not of representation and the two require different interventions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that failures of large language models to correctly count repeated tokens in lists are not due to inadequate internal count representations. Instead, linear probes trained on the residual stream can decode the correct count with near-perfect accuracy from every post-embedding layer, even at depths where the model's output has crystallized to the incorrect count. The authors identify a format-triggered MLP block at roughly 88-93% network depth that overwrites the correct count with a fixed incorrect value, specific to space-separated word token lists. This mechanism is absent for digit tokens and modulated by delimiters in larger models. The findings are replicated across Llama-3.2 (1B, 3B) and Qwen2.5 (1.5B, 3B, 7B) models, with attention patterns and tokenization ruled out as causes. The conclusion is that counting failure is a routing issue rather than a representation issue.

Significance. If the central claims are substantiated, particularly the causal role of the identified MLP block, this paper would make a notable contribution to mechanistic interpretability in NLP. It provides evidence for a dissociation between what is represented in the model's activations and what is output, highlighting that strong internal encodings can be overridden by late-stage computations triggered by input format. The cross-model consistency at similar relative depths suggests a general phenomenon. This could have implications for designing interventions that target specific layers or formats to improve counting and similar tasks. The empirical approach using linear probes to uncover hidden information is a positive aspect, though the absence of causal interventions limits the strength of the overwriting attribution.

major comments (2)
  1. [Identification of the MLP block (around 88-93% depth)] The paper attributes the overwriting of the correctly encoded count to a specific MLP block based on its location coinciding with the crystallization of the wrong output and its format specificity. However, this remains correlational. No activation patching, ablation, or other causal interventions are reported to isolate the block's contribution to the output error. As a result, it is unclear whether this MLP is the active agent performing the overwrite or merely correlated with downstream effects. This is central to the dissociation claim and requires strengthening.
  2. [Linear probe experiments] The near-perfect accuracy of linear probes in decoding the correct count is a key pillar of the argument that representations are intact. However, the manuscript lacks details on probe training procedures, including the training data used, any regularization, control tasks, or statistical tests for significance. Without these, it is difficult to rule out that the high accuracy is due to probe overfitting or artifacts rather than faithful extraction of the model's internal state. This affects the soundness of the representation claim.
minor comments (2)
  1. [Abstract] There is a typographical error in the abstract: '88--93,% network depth' where the comma appears misplaced after the percentage.
  2. The manuscript would benefit from clearer notation or a dedicated section defining how the 'count' target is constructed for probing and how input formats are exactly tokenized and presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential contribution of this work to mechanistic interpretability. We address each major comment below and indicate the revisions planned for the manuscript.

read point-by-point responses
  1. Referee: [Identification of the MLP block (around 88-93% depth)] The paper attributes the overwriting of the correctly encoded count to a specific MLP block based on its location coinciding with the crystallization of the wrong output and its format specificity. However, this remains correlational. No activation patching, ablation, or other causal interventions are reported to isolate the block's contribution to the output error. As a result, it is unclear whether this MLP is the active agent performing the overwrite or merely correlated with downstream effects. This is central to the dissociation claim and requires strengthening.

    Authors: We agree that the attribution of the overwriting role to the identified MLP block is based on correlational evidence, specifically the precise alignment between the block's depth and the crystallization of the incorrect output, the block's format-specific triggering (present for space-separated word lists but absent for digits and modulated by delimiters), and its consistency across the five models tested. No activation patching or ablation experiments were conducted. This limits the direct causal claim. In the revised manuscript we will add an explicit limitations paragraph acknowledging the correlational nature of the MLP identification and will include additional supporting analyses such as direct inspection of the MLP's contribution to the residual stream at that depth. We maintain that the overall dissociation between intact representations (via probes) and incorrect outputs is robustly evidenced even without these interventions. revision: partial

  2. Referee: [Linear probe experiments] The near-perfect accuracy of linear probes in decoding the correct count is a key pillar of the argument that representations are intact. However, the manuscript lacks details on probe training procedures, including the training data used, any regularization, control tasks, or statistical tests for significance. Without these, it is difficult to rule out that the high accuracy is due to probe overfitting or artifacts rather than faithful extraction of the model's internal state. This affects the soundness of the representation claim.

    Authors: We acknowledge the omission of detailed probe methodology. In the revised manuscript we will add a dedicated methods subsection and appendix that fully specifies the linear probe procedures: probes were trained as logistic regression classifiers on residual-stream activations using held-out lists with balanced count distributions; L2 regularization was applied with cross-validated strength; control tasks included probes trained on label-permuted data and on non-count features; statistical significance was evaluated via 5-fold cross-validation, bootstrap confidence intervals, and comparison against random-feature baselines. These additions will confirm that the reported near-perfect accuracies reflect genuine extraction of count information rather than overfitting or artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical probe-based dissociation analysis

full rationale

The paper's claims rest on linear probing experiments, attention pattern observations, and cross-model replication across Llama and Qwen variants. No derivation chain, equations, or self-referential definitions exist that reduce outputs to inputs by construction. Probe accuracies and MLP timing identifications are direct measurements, not fitted predictions renamed as results. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The work is self-contained against external benchmarks via replication and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that linear probes recover true internal representations and on the post-hoc identification of the MLP as the overwrite site. No explicit free parameters are stated in the abstract.

axioms (1)
  • domain assumption Linear probes on residual streams can decode linearly represented information such as token counts if it is present.
    Standard assumption in mechanistic interpretability work; invoked implicitly when claiming probes decode the correct count.
invented entities (1)
  • format-triggered MLP block no independent evidence
    purpose: Overwrites correctly encoded count with fixed wrong answer at late layers for space-separated word lists
    Identified from probe accuracy drop and layer-wise analysis; no independent falsifiable prediction provided in abstract.

pith-pipeline@v0.9.0 · 5503 in / 1275 out tokens · 54991 ms · 2026-05-12T04:15:09.501747+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:2212.03827 (2022) 3

    Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827,

  2. [2]

    From early encoding to late suppression: Interpret- ing llms on character counting tasks.arXiv preprint arXiv:2604.00778,

    Datta, A., Marreddy, M., Mehler, A., Zhao, Z., and Mamidi, R. From early encoding to late suppression: Interpret- ing llms on character counting tasks.arXiv preprint arXiv:2604.00778,

  3. [3]

    Transformer feed-forward layers are key-value memories

    Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. InProceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495,

  4. [4]

    Trans- former feed-forward layers build predictions by promot- ing concepts in the vocabulary space

    Geva, M., Caciularu, A., Wang, K., and Goldberg, Y . Trans- former feed-forward layers build predictions by promot- ing concepts in the vocabulary space. InProceedings of the 2022 conference on empirical methods in natural language processing, pp. 30–45,

  5. [5]

    2023 , archivePrefix=

    Goldowsky-Dill, N., MacLeod, C., Sato, L., and Arora, A. Localizing model behavior with path patching.arXiv preprint arXiv:2304.05969,

  6. [6]

    Hasani, H., Banayeeanzade, M., Nafisi, A., Mohammadian, S., Askari, F., Bagherian, M., Izadi, A., and Baghshah, M. S. Mechanistic interpretability of large-scale count- ing in llms through a system-2 strategy.arXiv preprint arXiv:2601.02989,

  7. [7]

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

    Marks, S. and Tegmark, M. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824,

  8. [8]

    com/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens

    URL https://www.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens. Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N. A., and Lewis, M. Measuring and narrowing the composi- tionality gap in language models. InFindings of the As- sociation for Computational Linguistics: EMNLP 2023, pp. 5687–5711,

  9. [9]

    A mechanistic interpretation of arithmetic reasoning in language mod- els using causal mediation analysis

    Stolfo, A., Belinkov, Y ., and Sachan, M. A mechanistic interpretation of arithmetic reasoning in language mod- els using causal mediation analysis. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7035–7052,

  10. [10]

    arXiv preprint arXiv:2310.15154 , year=

    Tigges, C., Hollinsworth, O. J., Geiger, A., and Nanda, N. Linear representations of sentiment in large language models.arXiv preprint arXiv:2310.15154,

  11. [11]

    Causal mediation analysis for in- terpreting neural nlp: The case of gender bias.arXiv preprint arXiv:2004.12265, 2020

    Vig, J., Gehrmann, S., Belinkov, Y ., Qian, S., Nevo, D., Sakenis, S., Huang, J., Singer, Y ., and Shieber, S. Causal mediation analysis for interpreting neural nlp: The case of gender bias.arXiv preprint arXiv:2004.12265,

  12. [12]

    Effi- cient streaming language models with attention sinks

    9 Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Effi- cient streaming language models with attention sinks. In International Conference on Learning Representations, volume 2024, pp. 21875–21895,

  13. [14]

    Qwen2.5 Technical Report

    URLhttps://arxiv.org/abs/2412.15115. Zhang, X., Cao, J., and You, C. Counting ability of large lan- guage models and impact of tokenization.arXiv preprint arXiv:2410.19730,