pith. sign in

arxiv: 2606.08562 · v1 · pith:SB6SWXTSnew · submitted 2026-06-07 · 💻 cs.CL

Inside the LLM Word Factory

Pith reviewed 2026-06-27 18:20 UTC · model grok-4.3

classification 💻 cs.CL
keywords detokenizationactivation patchingsubword tokenstransformer layersattention mechanismMLP compositionpositional encodingLlama models
0
0 comments X

The pith

Transformer models detokenize subwords via a two-stage attention-then-MLP process in early layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models receive text as subword fragments but must recover word-level meaning for semantics. The paper traces this reconciliation using activation patching in paired experiments that isolate model components. It localizes the process in Llama2-7B to layer 1, with attention relaying a token-specific signal from nonfinal subwords and the MLP composing that signal with the local embedding. The same two-stage structure appears in twelve models across eight families, though the number of layers required varies with the type of positional encoding used. Early-layer activations alone support a probe that detects successful detokenization at 0.94-0.97 AUROC.

Core claim

In Llama2-7B, English detokenization is a two-stage process at Layer 1. Attention transmits a token-specific signal from nonfinal subwords, using sequential relays if necessary, while the MLP composes it with the local embedding. This two-stage structure generalizes to twelve models from eight families. The depth over which it takes place depends on the flavor of positional encoding: RoPE-based models detokenize over 1 to 5 layers, while learned-absolute models take 5 to 10. A probe for determining the success of the detokenization process based on early-layer activations alone performs at 0.94-0.97 AUROC depending on the amount of context.

What carries the argument

The two-stage detokenization process at early layers in which attention transmits signals from nonfinal subwords and the MLP composes them with the local embedding.

If this is right

  • Detokenization can be localized and studied by isolating attention signal transmission and MLP composition through activation patching.
  • The two-stage structure holds across twelve models from eight families, indicating a shared solution to subword aggregation.
  • The number of layers required for detokenization ranges from 1-5 in RoPE models to 5-10 in learned-absolute models.
  • Early-layer activations contain enough information for a probe that identifies successful detokenization at 0.94-0.97 AUROC.
  • Attention can use sequential relays to carry the token-specific signal across multiple nonfinal subwords when needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If detokenization completes so early, later layers can treat tokens as already word-level units when performing higher-level reasoning.
  • The probe based on early activations could monitor tokenization quality in deployed systems without requiring a full model forward pass.
  • The difference in layer depth between RoPE and learned-absolute encodings suggests that positional encoding choices shape how quickly models learn to aggregate subwords.
  • Applying the same patching method to non-English text could test whether the two-stage mechanism is language-specific or universal.

Load-bearing premise

The controlled paired experiments with activation patching isolate the causal contribution of attention and MLP components at the identified layers without residual confounding from later layers or unpatched pathways.

What would settle it

Running the paired activation-patching experiments on Llama2-7B and finding that patching the layer-1 attention and MLP leaves detokenization accuracy unchanged would falsify the localization claim.

Figures

Figures reproduced from arXiv: 2606.08562 by Benzi Busigin, Yuval Pinter.

Figure 1
Figure 1. Figure 1: Experimental design and activation patching. (a) Each word the tokenizer renders as a single token is run twice: as the canonical single token and as an artificial split. Canonicity compares the layer-30 residual streams at the canonical position and the split’s last subword position. (b) LST pairs share the second token; FST pairs share the first. Each pair has one successful and one failed split, with ca… view at source ↗
Figure 2
Figure 2. Figure 2: Activation patching at the second token. Values show the percentage of the canonicity [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Canonicity at the last position after corrupt [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Canonicity retained at the last subword po [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Layerwise probe AUROC on Llama2-7B (k=2). Class-mean-difference probes fit on isolated ac￾tivations (Isolated), in-context activations (In-context), or fit on isolated and applied to in-context activations of held-out words (Isolated→In-context). Vertical line: l ∗=L2, the gap-closed-80% depth (see [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-layer Spearman correlation between cosine similarity to the canonical representation and three [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Next-token KL divergence between split-input [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-position relay across word lengths. Canon [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Localizing the two-stage mechanism across twelve architectures. Red curve (left y-axis): percentage of [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Percentage of canonicitysrc gap closed per layer via resid_post patching at the final subword position, for words of length k ∈ {2, 3, 4} tokens. One panel per model. First-layer indices at which each curve reaches 80% gap-closed are listed in [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Layerwise probe AUROC for all 12 architectures at [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
read the original abstract

Transformer language models process input provided as subword fragments, but natural language semantics usually rely on word-level concepts. Detokenization is the process where models reconcile these two facts, aggregating subwords into word-level representations through their computation. Prior work has found that this takes place mostly in early-to-middle layers, but so far the exact mechanics of the process have not been pinned down. We venture deep into detokenization using activation patching in controlled paired experiments that isolate the contribution of different model components, localizing English detokenization in Llama2-7B to a two-stage process at Layer 1. Attention transmits a token-specific signal from nonfinal subwords, using sequential relays if necessary, while the MLP composes it with the local embedding. This two-stage structure generalizes to twelve models from eight families, but the depth over which it takes place depends on the flavor of positional encoding: RoPE-based models detokenize over 1 to 5 layers, while learned-absolute models take 5 to 10. Finally, we provide a probe for determining the success of the detokenization process based on early-layer activations alone, performing at 0.94-0.97 AUROC depending on the amount of context.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that detokenization in LLMs—aggregating subword tokens into word-level representations—occurs via a two-stage process localized to Layer 1 in Llama2-7B: attention heads transmit a token-specific signal from nonfinal subwords (with sequential relays), while the MLP composes this with the local embedding. Activation patching in controlled paired experiments is used to isolate these contributions. The two-stage structure is reported to generalize to twelve models across eight families, with the number of layers involved depending on positional encoding (RoPE: 1-5 layers; learned absolute: 5-10 layers). A probe based on early-layer activations achieves 0.94-0.97 AUROC for detecting successful detokenization.

Significance. If the causal localization via patching holds without residual confounding, the work would offer a concrete mechanistic account of an early-layer computation that bridges subword tokenization and word-level semantics, extending prior observations about early-to-middle layer involvement. The cross-model generalization and the early-layer probe represent potential strengths for interpretability research.

major comments (2)
  1. [Experimental setup and results on Llama2-7B (implied in abstract description of controlled paired experiments)] The central localization claim rests on activation patching isolating the causal role of Layer 1 attention and MLP. However, in a residual architecture, later layers can receive the original subword embeddings directly via the residual stream and potentially reconstruct or route around the patched signal. The manuscript does not report experiments that (a) patch all downstream pathways, (b) test whether the detokenization metric can be recovered by any combination of layers >1, or (c) compare against patching only layers >1. Without these controls, the strict localization to Layer 1 (and the narrow early band in other models) is not causally secured.
  2. [Generalization experiments across models] The generalization claim across twelve models and the dependence on positional encoding type (RoPE vs. learned absolute) is load-bearing for the broader contribution. The abstract provides no quantitative breakdown of per-model layer ranges, error bars, or controls for model size/family confounds that would allow evaluation of whether the reported depth differences are robust or post-hoc.
minor comments (2)
  1. The probe AUROC values (0.94-0.97) are reported as depending on context amount, but no table or figure details the exact context lengths, baseline comparisons, or how context is defined in the probe construction.
  2. Notation for 'nonfinal subwords' and 'sequential relays' should be defined more explicitly with reference to specific token positions or attention patterns in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the strength of our causal claims and generalization results. We respond to each major point below.

read point-by-point responses
  1. Referee: [Experimental setup and results on Llama2-7B (implied in abstract description of controlled paired experiments)] The central localization claim rests on activation patching isolating the causal role of Layer 1 attention and MLP. However, in a residual architecture, later layers can receive the original subword embeddings directly via the residual stream and potentially reconstruct or route around the patched signal. The manuscript does not report experiments that (a) patch all downstream pathways, (b) test whether the detokenization metric can be recovered by any combination of layers >1, or (c) compare against patching only layers >1. Without these controls, the strict localization to Layer 1 (and the narrow early band in other models) is not causally secured.

    Authors: The paired clean/corrupted activation patching design isolates the contribution of Layer 1 by showing that targeted interventions there reliably alter the final detokenization metric while the residual stream from earlier layers remains intact. Nevertheless, the referee's suggested controls would further rule out reconstruction by later layers. We will add experiments that (i) patch only layers >1 and (ii) compare against full downstream patching, reporting whether detokenization can be recovered without the Layer 1 stage. These results and an expanded discussion of residual pathways will appear in the revised manuscript. revision: yes

  2. Referee: [Generalization experiments across models] The generalization claim across twelve models and the dependence on positional encoding type (RoPE vs. learned absolute) is load-bearing for the broader contribution. The abstract provides no quantitative breakdown of per-model layer ranges, error bars, or controls for model size/family confounds that would allow evaluation of whether the reported depth differences are robust or post-hoc.

    Authors: Section 4 of the full manuscript already contains a per-model table listing the exact layer ranges for all twelve models, with results grouped by positional-encoding type and accompanied by standard-error bars from repeated runs. To address potential size/family confounds we selected models spanning 7B–13B parameters across eight families while holding the RoPE vs. absolute distinction as the primary variable; we will add an explicit paragraph discussing these design choices and any remaining limitations. A concise summary table will also be moved into the abstract for the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: localization rests on external activation-patching interventions

full rationale

The paper's central claim—that English detokenization occurs via a two-stage attention-then-MLP process localized to Layer 1 in Llama2-7B and early layers in other models—is obtained through controlled paired activation-patching experiments that measure causal contributions. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the provided text that would reduce this localization result to an input by construction. The work is self-contained against external benchmarks via interventional measurements rather than internal derivations or renamings, making any circularity score of 0 the appropriate finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities; all entries are therefore unknown.

pith-pipeline@v0.9.1-grok · 5741 in / 1296 out tokens · 27427 ms · 2026-06-27T18:20:26.885871+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 11 canonical work pages · 1 internal anchor

  1. [1]

    Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. https://arxiv.org/abs/2303.08112 Eliciting latent predictions from T ransformers with the T uned L ens . Preprint, arXiv:2303.08112

  2. [2]

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, and 1 others. 2023. P ythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397--2430

  3. [3]

    BigScience Workshop . 2022. BLOOM : A 176 B -parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100

  4. [4]

    Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. 2021. https://doi.org/10.5281/zenodo.5297715 GPT-Neo : Large scale autoregressive language modeling with Mesh-Tensorflow

  5. [5]

    Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer El - Showk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, and 14 others. 2022. https://transformer-circuits.pub/2022/solu/index.html Softmax lin...

  6. [6]

    Kawin Ethayarajh. 2019. https://doi.org/10.18653/v1/D19-1006 How contextual are contextualized word representations? C omparing the geometry of BERT , ELM o, and GPT -2 embeddings . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCN...

  7. [7]

    Sheridan Feucht, David Atkinson, Byron C Wallace, and David Bau. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.543 Token erasure as a footprint of implicit vocabulary items in LLM s . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9727--9739, Miami, Florida, USA. Association for Computational Linguistics

  8. [8]

    Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. 2021. https://proceedings.neurips.cc/paper/2021/hash/4f5c422f4d49a5a807eda27434231040-Abstract.html Causal abstractions of neural networks . In Advances in Neural Information Processing Systems, volume 34

  9. [9]

    Gemma Team . 2024. G emma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118

  10. [10]

    Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.3 Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30--45, Abu Dhabi, United Arab Emirates. Association f...

  11. [11]

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.446 Transformer feed-forward layers are key-value memories . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484--5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

  12. [12]

    Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. 2023. https://openreview.net/forum?id=JYs1R9IMJr Finding neurons in a haystack: Case studies with sparse probing . Transactions on Machine Learning Research

  13. [13]

    Go Kamoda, Benjamin Heinzerling, Tatsuro Inaba, Keito Kudo, Keisuke Sakaguchi, and Kentaro Inui. 2025. https://doi.org/10.18653/v1/2025.findings-naacl.355 Weight-based analysis of detokenization in language models: Understanding the first stage of inference without inference . In Findings of the Association for Computational Linguistics: NAACL 2025, pages...

  14. [14]

    Guy Kaplan, Matanel Oren, Yuval Reif, and Roy Schwartz. 2025. https://openreview.net/forum?id=328vch6tRs From tokens to words: On the inner lexicon of LLM s . In The Thirteenth International Conference on Learning Representations

  15. [15]

    Taku Kudo. 2018. https://doi.org/10.18653/v1/P18-1007 Subword regularization: Improving neural network translation models with multiple subword candidates . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66--75, Melbourne, Australia. Association for Computational Linguistics

  16. [16]

    Vedang Lad, Jin Hwa Lee, Wes Gurnee, and Max Tegmark. 2024. https://arxiv.org/abs/2406.19384 The remarkable robustness of LLM s: Stages of inference? arXiv preprint arXiv:2406.19384

  17. [17]

    Sander Land and Max Bartolo. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.649 Fishing for magikarp: Automatically detecting under-trained tokens in large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11631--11646, Miami, Florida, USA. Association for Computational Linguistics

  18. [18]

    Samuel Marks and Max Tegmark. 2024. https://arxiv.org/abs/2310.06824 The geometry of truth: Emergent linear structure in large language model representations of true/false datasets . In Proceedings of the First Conference on Language Modeling (COLM)

  19. [19]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. https://arxiv.org/abs/2202.05262 Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems, volume 35

  20. [20]

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. https://openreview.net/forum?id=Byj72udxe Pointer sentinel mixture models . In Proceedings of the 5th International Conference on Learning Representations

  21. [21]

    Smith, and Mike Lewis

    Ofir Press, Noah A. Smith, and Mike Lewis. 2022. https://openreview.net/forum?id=R8sQPpGCv0 Train short, test long: Attention with linear biases enables input length extrapolation . In International Conference on Learning Representations

  22. [22]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://openai.com/index/better-language-models/ Language models are unsupervised multitask learners . OpenAI blog, 1(8):9

  23. [23]

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. 2024. https://doi.org/10.18653/v1/2024.acl-long.828 Steering llama 2 via contrastive activation addition . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504--15522, Bangkok, Thailand. Assoc...

  24. [24]

    Craig W Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.40 Tokenization is more than compression . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 678--702, Miami, Florida, USA. Association for Computational...

  25. [25]

    Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. https://doi.org/10.18653/v1/P16-1162 Neural machine translation of rare words with subword units . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715--1725, Berlin, Germany. Association for Computational Linguistics

  26. [26]

    Singh and DJ Strouse

    Aaditya K. Singh and DJ Strouse. 2024. https://arxiv.org/abs/2402.14903 Tokenization counts: the impact of tokenization on arithmetic in frontier LLM s . arXiv preprint arXiv:2402.14903

  27. [27]

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. https://arxiv.org/abs/2104.09864 R o F ormer: Enhanced transformer with rotary position embedding . arXiv preprint arXiv:2104.09864

  28. [28]

    Rachael Tatman. 2017. English word frequency. https://www.kaggle.com/datasets/rtatman/english-word-frequency. Derived from the Google Web Trillion Word Corpus

  29. [29]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023. https://arxiv.org/abs/2307.09288 L lama 2: Open foundation and fine-tuned chat models . arXiv preprint arXiv:2307.09288

  30. [30]

    Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. 2020. https://proceedings.neurips.cc/paper/2020/hash/92650b2e92217715fe312e6fa7b90d82-Abstract.html Investigating gender bias in language models using causal mediation analysis . In Advances in Neural Information Processing Systems, volume 33

  31. [31]

    Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B : A 6 billion parameter autoregressive language model. https://github.com/kingoflolz/mesh-transformer-jax

  32. [32]

    Slamet Widodo, Herlambang Brawijaya, and Samudi Samudi. 2022. Stratified k-fold cross validation optimization on machine learning for prediction. Sinkron: jurnal dan penelitian teknik informatika, 6(4):2407--2414

  33. [33]

    Susan Zhang, Stephen Roller, Naman Goyal, and 1 others. 2022. OPT : Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068