Recognition: no theorem link
Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs
Pith reviewed 2026-05-12 04:15 UTC · model grok-4.3
The pith
Linear probes show LLMs track repeated token counts correctly inside but a late MLP block overwrites them before output.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Linear probes on the residual stream decode the correct count with near-perfect accuracy at every post-embedding layer, across all model depths. This holds even at the exact layers where the wrong answer crystallizes while the model simultaneously outputs an incorrect count. A format-triggered multi-layer perceptron (MLP) block overwrites the correctly-encoded count with a fixed wrong answer at roughly 88-93% network depth for repeated word-tokens in space-separated list format.
What carries the argument
A format-triggered multi-layer perceptron (MLP) block that activates on repeated word-tokens in space-separated lists and overwrites the correct internal count with a fixed incorrect value at late network depth.
If this is right
- Interventions that target or bypass the late MLP block could restore correct counting outputs.
- Comma-separated formats suppress the overwriting effect in larger models but not smaller ones.
- The same dissociation between accurate representations and incorrect outputs occurs at consistent relative depths across Llama-3.2 and Qwen2.5 models.
- Attention patterns show no collapse on repeated tokens, eliminating attention-based explanations for the failure.
Where Pith is reading between the lines
- Analogous late overwriting blocks may underlie other LLM failures where internal probes reveal correct knowledge but outputs remain wrong.
- Targeted editing of specific MLP blocks at consistent relative depths could improve reliability on simple arithmetic tasks without full retraining.
- The persistence of the overwriting prior across model scales suggests that pure scaling will not automatically resolve routing errors of this kind.
Load-bearing premise
Linear probes faithfully extract the model's true internal count without probe-induced artifacts, and the identified MLP block is causally responsible for overwriting the count rather than merely correlated with the output error.
What would settle it
An intervention that ablates or edits only the MLP block at 88-93% depth and measures whether output counts become accurate for repeated word tokens while other behaviors remain unchanged.
Figures
read the original abstract
Large language models fail at counting repeated tokens despite strong performance on broader reasoning benchmarks. These failures are commonly attributed to limitations in internal count tracking. We show this attribution is wrong. Linear probes on the residual stream decode the correct count with near-perfect accuracy at every post-embedding layer, across all model depths. This holds even at the exact layers where the wrong answer crystallizes while the model simultaneously outputs an incorrect count. Attention patterns show no evidence of collapse over repeated tokens and tokenization artifacts account for none of the failure. Instead, a format-triggered multi-layer perceptron (MLP) block overwrites the correctly-encoded count with a fixed wrong answer at roughly 88--93,% network depth. This prior fires for repeated word-tokens in space-separated list format and is absent for repeated digit-tokens. It is suppressed by comma-separated delimiters in larger models but persists in smaller ones. The finding holds across Llama-3.2 (1B and 3B) and Qwen2.5 (1.5B, 3B and 7B) at consistent relative depth. Counting failure is a failure of routing not of representation and the two require different interventions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that failures of large language models to correctly count repeated tokens in lists are not due to inadequate internal count representations. Instead, linear probes trained on the residual stream can decode the correct count with near-perfect accuracy from every post-embedding layer, even at depths where the model's output has crystallized to the incorrect count. The authors identify a format-triggered MLP block at roughly 88-93% network depth that overwrites the correct count with a fixed incorrect value, specific to space-separated word token lists. This mechanism is absent for digit tokens and modulated by delimiters in larger models. The findings are replicated across Llama-3.2 (1B, 3B) and Qwen2.5 (1.5B, 3B, 7B) models, with attention patterns and tokenization ruled out as causes. The conclusion is that counting failure is a routing issue rather than a representation issue.
Significance. If the central claims are substantiated, particularly the causal role of the identified MLP block, this paper would make a notable contribution to mechanistic interpretability in NLP. It provides evidence for a dissociation between what is represented in the model's activations and what is output, highlighting that strong internal encodings can be overridden by late-stage computations triggered by input format. The cross-model consistency at similar relative depths suggests a general phenomenon. This could have implications for designing interventions that target specific layers or formats to improve counting and similar tasks. The empirical approach using linear probes to uncover hidden information is a positive aspect, though the absence of causal interventions limits the strength of the overwriting attribution.
major comments (2)
- [Identification of the MLP block (around 88-93% depth)] The paper attributes the overwriting of the correctly encoded count to a specific MLP block based on its location coinciding with the crystallization of the wrong output and its format specificity. However, this remains correlational. No activation patching, ablation, or other causal interventions are reported to isolate the block's contribution to the output error. As a result, it is unclear whether this MLP is the active agent performing the overwrite or merely correlated with downstream effects. This is central to the dissociation claim and requires strengthening.
- [Linear probe experiments] The near-perfect accuracy of linear probes in decoding the correct count is a key pillar of the argument that representations are intact. However, the manuscript lacks details on probe training procedures, including the training data used, any regularization, control tasks, or statistical tests for significance. Without these, it is difficult to rule out that the high accuracy is due to probe overfitting or artifacts rather than faithful extraction of the model's internal state. This affects the soundness of the representation claim.
minor comments (2)
- [Abstract] There is a typographical error in the abstract: '88--93,% network depth' where the comma appears misplaced after the percentage.
- The manuscript would benefit from clearer notation or a dedicated section defining how the 'count' target is constructed for probing and how input formats are exactly tokenized and presented.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for recognizing the potential contribution of this work to mechanistic interpretability. We address each major comment below and indicate the revisions planned for the manuscript.
read point-by-point responses
-
Referee: [Identification of the MLP block (around 88-93% depth)] The paper attributes the overwriting of the correctly encoded count to a specific MLP block based on its location coinciding with the crystallization of the wrong output and its format specificity. However, this remains correlational. No activation patching, ablation, or other causal interventions are reported to isolate the block's contribution to the output error. As a result, it is unclear whether this MLP is the active agent performing the overwrite or merely correlated with downstream effects. This is central to the dissociation claim and requires strengthening.
Authors: We agree that the attribution of the overwriting role to the identified MLP block is based on correlational evidence, specifically the precise alignment between the block's depth and the crystallization of the incorrect output, the block's format-specific triggering (present for space-separated word lists but absent for digits and modulated by delimiters), and its consistency across the five models tested. No activation patching or ablation experiments were conducted. This limits the direct causal claim. In the revised manuscript we will add an explicit limitations paragraph acknowledging the correlational nature of the MLP identification and will include additional supporting analyses such as direct inspection of the MLP's contribution to the residual stream at that depth. We maintain that the overall dissociation between intact representations (via probes) and incorrect outputs is robustly evidenced even without these interventions. revision: partial
-
Referee: [Linear probe experiments] The near-perfect accuracy of linear probes in decoding the correct count is a key pillar of the argument that representations are intact. However, the manuscript lacks details on probe training procedures, including the training data used, any regularization, control tasks, or statistical tests for significance. Without these, it is difficult to rule out that the high accuracy is due to probe overfitting or artifacts rather than faithful extraction of the model's internal state. This affects the soundness of the representation claim.
Authors: We acknowledge the omission of detailed probe methodology. In the revised manuscript we will add a dedicated methods subsection and appendix that fully specifies the linear probe procedures: probes were trained as logistic regression classifiers on residual-stream activations using held-out lists with balanced count distributions; L2 regularization was applied with cross-validated strength; control tasks included probes trained on label-permuted data and on non-count features; statistical significance was evaluated via 5-fold cross-validation, bootstrap confidence intervals, and comparison against random-feature baselines. These additions will confirm that the reported near-perfect accuracies reflect genuine extraction of count information rather than overfitting or artifacts. revision: yes
Circularity Check
No circularity: purely empirical probe-based dissociation analysis
full rationale
The paper's claims rest on linear probing experiments, attention pattern observations, and cross-model replication across Llama and Qwen variants. No derivation chain, equations, or self-referential definitions exist that reduce outputs to inputs by construction. Probe accuracies and MLP timing identifications are direct measurements, not fitted predictions renamed as results. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The work is self-contained against external benchmarks via replication and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Linear probes on residual streams can decode linearly represented information such as token counts if it is present.
invented entities (1)
-
format-triggered MLP block
no independent evidence
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2212.03827 (2022) 3
Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827,
-
[2]
Datta, A., Marreddy, M., Mehler, A., Zhao, Z., and Mamidi, R. From early encoding to late suppression: Interpret- ing llms on character counting tasks.arXiv preprint arXiv:2604.00778,
-
[3]
Transformer feed-forward layers are key-value memories
Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. InProceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495,
work page 2021
-
[4]
Trans- former feed-forward layers build predictions by promot- ing concepts in the vocabulary space
Geva, M., Caciularu, A., Wang, K., and Goldberg, Y . Trans- former feed-forward layers build predictions by promot- ing concepts in the vocabulary space. InProceedings of the 2022 conference on empirical methods in natural language processing, pp. 30–45,
work page 2022
-
[5]
Goldowsky-Dill, N., MacLeod, C., Sato, L., and Arora, A. Localizing model behavior with path patching.arXiv preprint arXiv:2304.05969,
-
[6]
Hasani, H., Banayeeanzade, M., Nafisi, A., Mohammadian, S., Askari, F., Bagherian, M., Izadi, A., and Baghshah, M. S. Mechanistic interpretability of large-scale count- ing in llms through a system-2 strategy.arXiv preprint arXiv:2601.02989,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Marks, S. and Tegmark, M. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824,
work page internal anchor Pith review arXiv
-
[8]
com/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens
URL https://www.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens. Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N. A., and Lewis, M. Measuring and narrowing the composi- tionality gap in language models. InFindings of the As- sociation for Computational Linguistics: EMNLP 2023, pp. 5687–5711,
work page 2023
-
[9]
Stolfo, A., Belinkov, Y ., and Sachan, M. A mechanistic interpretation of arithmetic reasoning in language mod- els using causal mediation analysis. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7035–7052,
work page 2023
-
[10]
arXiv preprint arXiv:2310.15154 , year=
Tigges, C., Hollinsworth, O. J., Geiger, A., and Nanda, N. Linear representations of sentiment in large language models.arXiv preprint arXiv:2310.15154,
-
[11]
Vig, J., Gehrmann, S., Belinkov, Y ., Qian, S., Nevo, D., Sakenis, S., Huang, J., Singer, Y ., and Shieber, S. Causal mediation analysis for interpreting neural nlp: The case of gender bias.arXiv preprint arXiv:2004.12265,
-
[12]
Effi- cient streaming language models with attention sinks
9 Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Effi- cient streaming language models with attention sinks. In International Conference on Learning Representations, volume 2024, pp. 21875–21895,
work page 2024
-
[14]
URLhttps://arxiv.org/abs/2412.15115. Zhang, X., Cao, J., and You, C. Counting ability of large lan- guage models and impact of tokenization.arXiv preprint arXiv:2410.19730,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.