arxiv: 2604.16656 · v1 · submitted 2026-04-17 · 💻 cs.CL

Recognition: unknown

Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion

Maitrey Mehta , Nishant Subramani , Zhichao Xu , Ashim Gupta , Vivek Srikumar

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:12 UTC · model grok-4.3

classification 💻 cs.CL

keywords vocabulary expansiontoken over-fragmentationinterpretabilitysubword detokenizationembedding initializationmultilingual language modelsnon-Latin scriptstoken efficiency

0 comments

The pith

Interpretability-based methods for vocabulary expansion offer a superior performance-token efficiency trade-off for non-Latin script languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that using interpretability to decide which items to add to an LLM's vocabulary and how to initialize their embeddings outperforms traditional frequency-based selection. This matters because many non-Latin languages currently require far more tokens to represent the same content, increasing costs and slowing inference. By analyzing how models merge subword fragments across layers, the authors develop a method that achieves substantial gains in efficiency and performance.

Core claim

Models exhibit subword detokenization, progressively combining fragmented tokens into larger units through their layers. This pattern informs both the selection of new vocabulary items and the initialization of their embeddings, leading to better token efficiency than frequency-based baselines, with gains of around 20 points on non-Latin languages. The proposed FragMend method leverages this to further optimize the expansion process.

What carries the argument

Subword detokenization, the phenomenon where language models merge fragmented subword tokens into larger subwords across successive layers, which guides the initialization of new embeddings.

If this is right

Interpretability-based item selection provides a better performance-token efficiency trade-off than frequency-based methods.
Interpretability-based embedding initialization yields large performance gains for non-Latin script languages.
The FragMend method further improves the efficiency ceiling of such expansions.
These approaches address token over-fragmentation in modern open-weight LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying similar interpretability analysis could help in expanding vocabularies for other underrepresented languages or domains.
Reducing token counts this way might lower the overall computational cost of deploying multilingual models.
Future work could test if these initialization strategies transfer across different model architectures without retraining.

Load-bearing premise

The subword detokenization pattern is stable and general enough across models to guide embedding initialization without introducing new fragmentation or instability.

What would settle it

Observing that FragMend-initialized models show no improvement or worse token efficiency on a test set of non-Latin languages compared to frequency-based initialization would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.16656 by Ashim Gupta, Maitrey Mehta, Nishant Subramani, Vivek Srikumar, Zhichao Xu.

**Figure 2.** Figure 2: (a) Subword-level detokenization typically occurs as progressive merges of prefix tokens. In the example, the word ‘Layla’ is tokenized as ‘L’+‘ay’+‘la’ by the original tokenizer. We observe that the subword ‘Lay’ gets detokenized in the first layer of the model at the token corresponding to ‘ay’ before the full word gets detokenized at the fifth layer on the last token. We find this phenomenon to occur ac… view at source ↗

**Figure 3.** Figure 3: Performance across initialization methods on ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Performance-efficiency trade-off between the original model (at least token [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Token disparity across different languages for (a) TinyAya-Global and (b) Qwen3.5- [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Pareto analysis of BPE methods as compared to the interpretability-based method [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: BPE-based (subscript ’bpe’) vs. Tokens2Words-based (subscript ’t2w’) vocabulary expansion: (a) token-reduction rate, and (b-d) performance comparisons with different initializations. Names indicate initialization methods, subscripts indicate the method to choose expansion items. |Ctrain| =10k sequences. E.4.1 BPB Results: Qwen3-30B-A3B [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Qwen3-30B-A3B (a) detokenization success rate and (b) token reduction when [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Qwen3.5-4B (a) detokenization success rate and (b) token reduction when expand [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: TinyAya-Global (a) detokenization success rate and (b) token reduction when [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Qwen3-30B results, BPB across the corpora: (a) [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Qwen3-30B results, 1k corpus eng hun vie fas hin srp gla mri tpi sot amh guj ory mya 50 60 70 80 90 100 SIB200 F1 Score (%) Original Tokens2Words FVT Random FOCUS (a) eng hun vie fas hin srp mri sot amh guj ory mya 40 50 60 70 80 90 100 Belebele Acuuracy (%) Original Tokens2Words FVT Random FOCUS (b) [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Qwen3.5-4B results, 10k corpus eng hun vie fas hin srp gla mri tpi sot amh guj ory mya 0 20 40 60 80 100 SIB200 F1 Score (%) Original Tokens2Words FVT Random FOCUS (a) eng hun vie fas hin srp mri sot amh guj ory mya 0 20 40 60 80 100 Belebele Accuracy (%) Original Tokens2Words FVT Random FOCUS (b) [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: TinyAya results, 10k corpus. 10 0 10 20 30 40 50 60 40 50 60 70 80 90 100 SIB200 F1 (%) eng hun vie 1k 10k Original 10 0 10 20 30 40 50 60 40 50 60 70 80 90 100 fas hin srp 1k 10k Original 10 0 10 20 30 40 50 60 Token Reduction (%) 40 50 60 70 80 90 100 SIB200 F1 (%) gla mri tpi sot 1k 10k Original 10 0 10 20 30 40 50 60 Token Reduction (%) 40 50 60 70 80 90 100 amh guj ory mya 1k 10k Original [PITH_FULL… view at source ↗

**Figure 15.** Figure 15: An increase in the training dataset size ( [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 16.** Figure 16: Token Reduction vs BPB Performance by Corpus Size: (a)Tokens2Words, (b) FVT, [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

**Figure 17.** Figure 17: Token Reduction vs SIB200 Performance by Intialization: (a)Tokens2Words, [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 18.** Figure 18: Token Reduction vs Belebele Performance by Initializations: (a)Tokens2Words, [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: Comparing (a) SIB200 and (b) Belebele peformance of FragMend initialization [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Comparing (a) SIB200 and (b) Belebele peformance of FragMend initialization [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗

**Figure 21.** Figure 21: Comparing (a) SIB200 and (b) Belebele peformance of FragMend initialization [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗

**Figure 22.** Figure 22: Token Reduction (top) and corresponding SIB200 performance (bottom) with [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗

**Figure 23.** Figure 23: Token Reduction (top) and corresponding Belebele performance (bottom) with [PITH_FULL_IMAGE:figures/full_fig_p032_23.png] view at source ↗

read the original abstract

All languages are equal; when it comes to tokenization, some are more equal than others. Tokens are the hidden currency that dictate the cost and latency of access to contemporary LLMs. However, many languages written in non-Latin scripts observe a poor exchange rate: LLMs take several multiples of tokens to encode the same information in many languages as they do for English. Our analysis reveals that this issue, known as 'token over-fragmentation', persists in modern open-weight LLMs. The standard remedy is vocabulary expansion that adds target language items missing from the model's vocabulary. In this work, we comprehensively study and advance interpretability-based vocabulary expansion, a new research direction. We focus on two core decisions in the vocabulary expansion process: What items should we add? and How should we initialize their corresponding input and output embeddings? First, we question the conventional use of frequency-based methods to choose candidate vocabulary items to add (a decision long treated as settled), and show that interpretability-based methods offer a superior performance-token efficiency trade-off. Next, we strengthen the case for interpretability-based embedding initialization by showing large gains (~20 pts) over baseline initialization methods for several languages written in non-Latin scripts. We identify the phenomenon of "subword detokenization" where models progressively merge fragmented subword tokens into larger subwords across layers. Grounded in our analysis of this phenomenon, we propose FragMend to further push the efficiency ceiling of interpretability-based expansion. We validate the effectiveness of FragMend through comparison against strong baselines and we present extensive analysis of its design choices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows interpretability signals can beat frequency for picking and initializing new vocabulary items, with FragMend delivering reported gains on non-Latin languages, but the supporting pattern needs tighter checks.

read the letter

The key point for you is that the authors move vocabulary expansion past the default of adding high-frequency tokens. They use internal model signals to choose what to add and how to set the embeddings, and they report roughly 20-point lifts over baselines for several non-Latin scripts. They also name a layer-wise merging behavior they call subword detokenization and build FragMend around it to push the efficiency side further. That combination is the actual advance over the frequency-based work they cite. The comparisons and the design-choice analysis are the parts that land cleanly. Anyone who has run into token bloat on non-English text will see a different lever to try. The empirical framing is straightforward and the gains are large enough to notice. The soft spots sit where the abstract leaves gaps. No error bars or full dataset breakdowns appear, so the size and stability of the improvements are hard to judge from the text alone. The bigger issue is that FragMend rests on the detokenization pattern being consistent enough to guide initialization without reintroducing fragmentation. The stress-test note is right to flag this: the abstract treats the pattern as observed and useful but gives no prevalence numbers, variance across checkpoints, or sensitivity tests to tokenizer choice. If the merging is tied to specific pretraining runs rather than a general property, the claimed ceiling lift shrinks. This is aimed at people working on multilingual token efficiency and interpretability. A reader who already cares about vocabulary expansion or low-resource language performance will find usable ideas and a new direction to test. It is coherent on its own terms and shows clear engagement with the prior literature, so it deserves a serious referee even though the generality claim will need more evidence in revision.

Referee Report

3 major / 2 minor

Summary. The paper claims that token over-fragmentation in non-Latin script languages persists in modern LLMs and can be addressed via interpretability-based vocabulary expansion. It shows that interpretability signals outperform frequency-based methods for item selection in the performance-token trade-off, identifies the 'subword detokenization' phenomenon (progressive merging of subwords across layers), proposes FragMend for embedding initialization grounded in this analysis, and reports ~20pt gains over baselines for several non-Latin languages along with design-choice analysis.

Significance. If the empirical claims hold, the work offers a concrete advance in multilingual LLM efficiency by shifting from surface-frequency heuristics to model-internal signals for both item selection and initialization. The identification of subword detokenization as a reusable phenomenon and the FragMend proposal are potentially high-impact for reducing token costs in underrepresented languages. The manuscript includes direct comparisons to strong baselines and extensive design analysis, which strengthen its contribution.

major comments (3)

The central claim that interpretability-based item selection and FragMend initialization yield a superior performance-token efficiency trade-off rests on the generality of subword detokenization, yet the analysis provides no quantitative bounds on its prevalence, layer-wise consistency, variance across checkpoints, or sensitivity to tokenizer choice; without these, it is unclear whether the observed pattern is stable enough to guide initialization without reintroducing fragmentation or instability.
The reported ~20pt gains for non-Latin scripts are presented without error bars, statistical significance tests, or detailed dataset splits and language coverage; this omission directly affects verifiability of the superiority claim over baseline initialization methods.
In the FragMend validation, the assumption that merging-based initialization avoids new fragmentation is not tested via ablations on cross-layer stability or post-expansion training dynamics, leaving the efficiency-ceiling improvement vulnerable to the weakest assumption identified in the work.

minor comments (2)

The abstract refers to 'strong baselines' and 'extensive analysis of its design choices' without naming the baselines or summarizing the key ablation outcomes, which reduces immediate clarity for readers.
Notation for the performance-token efficiency trade-off metric is introduced without an explicit equation or definition in the early sections, requiring readers to infer it from later experimental descriptions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below, agreeing where revisions are needed to strengthen the empirical claims and providing clarifications on the existing analysis.

read point-by-point responses

Referee: The central claim that interpretability-based item selection and FragMend initialization yield a superior performance-token efficiency trade-off rests on the generality of subword detokenization, yet the analysis provides no quantitative bounds on its prevalence, layer-wise consistency, variance across checkpoints, or sensitivity to tokenizer choice; without these, it is unclear whether the observed pattern is stable enough to guide initialization without reintroducing fragmentation or instability.

Authors: We agree that additional quantitative characterization would strengthen the foundation for FragMend. The manuscript already demonstrates the phenomenon through layer-wise merging patterns and cross-model examples, but we will expand the analysis in revision to include explicit prevalence metrics (e.g., percentage of subwords showing progressive detokenization), layer-wise consistency scores, variance across multiple checkpoints, and sensitivity tests to alternative tokenizers. These additions will directly address the stability concerns. revision: yes
Referee: The reported ~20pt gains for non-Latin scripts are presented without error bars, statistical significance tests, or detailed dataset splits and language coverage; this omission directly affects verifiability of the superiority claim over baseline initialization methods.

Authors: We acknowledge that the current presentation lacks the statistical details needed for full verifiability. The gains are measured on standard multilingual benchmarks with consistent language sets, but in the revised manuscript we will report error bars from multiple runs, conduct statistical significance tests (e.g., paired t-tests), and provide explicit details on dataset splits, exact language coverage, and evaluation protocols to allow direct replication and comparison. revision: yes
Referee: In the FragMend validation, the assumption that merging-based initialization avoids new fragmentation is not tested via ablations on cross-layer stability or post-expansion training dynamics, leaving the efficiency-ceiling improvement vulnerable to the weakest assumption identified in the work.

Authors: We recognize the value of targeted ablations for this assumption. While the manuscript validates FragMend through direct performance comparisons and design-choice sweeps against strong baselines, we will add new experiments in revision that track cross-layer token stability post-initialization and monitor fragmentation metrics during continued training. These ablations will test whether the merging-based approach preserves or improves stability over time. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparisons stand independently

full rationale

The paper's core contributions are empirical: comparative evaluations of item selection and embedding initialization methods, plus observation of layer-wise subword merging patterns used to motivate FragMend. No load-bearing equations, fitted parameters, or self-citations reduce the reported ~20pt gains or efficiency claims to quantities defined by the method itself. The subword detokenization phenomenon is presented as an analysis result drawn from model internals, not a self-referential definition or ansatz smuggled via prior work. The derivation chain remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claims rest on empirical observations of model behavior rather than explicit axioms or new postulated entities; no free parameters are introduced in the abstract description.

invented entities (1)

subword detokenization no independent evidence
purpose: Observed layer-wise merging behavior used to motivate FragMend initialization
Described as a discovered phenomenon rather than a postulated new object; independent evidence would require replication on held-out models.

pith-pipeline@v0.9.0 · 5604 in / 1306 out tokens · 36724 ms · 2026-05-10T08:12:04.416161+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Fast Vocabulary Transfer for Language Model Compression

ISSN 0898-9788. 12 Leonidas Gee, Andrea Zugarini, Leonardo Rigutini, and Paolo Torroni. Fast Vocabu- lary Transfer for Language Model Compression. In Yunyao Li and Angeliki Lazari- dou (eds.),Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp. 409–416, Abu Dhabi, UAE, December 2022. Associa- tion for...

work page doi:10.18653/v1/2022.emnlp-industry.41 2022
[2]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

URLhttps://openreview.net/forum?id=328vch6tRs. Seungduk Kim, Seungtaek Choi, and Myeongho Jeong. Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models, 2024. URL https://arxiv. org/abs/2402.14714. 13 Taku Kudo. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. In Iryna G...

work page doi:10.18653/v1/p18-1007 2024
[3]

Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models

Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012/. Vedang Lad, Wes Gurnee, and Max Tegmark. The Remarkable Robustness of LLMs: Stages of Inference? InICML 2024 Workshop on Mechanistic Interpretability, 2024. URL https: //openreview.net/forum?id=R5unwb9KPc. Chong Li, Jiajun Zhang, and Chengqing Zon...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d18-2012 2012
[4]

OFA : A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-naacl.68. URLhttps://aclanthology.org/2024.findings-naacl.68/. Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Lan- guage Models. InThe Thirteenth International C...

work page doi:10.18653/v1/2024.findings-naacl.68 2024
[5]

WECHSEL : Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models

Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.293. URLhttps://aclanthology.org/2022.naacl-main.293/. Benjamin Minixhofer, Edoardo M Ponti, and Ivan Vuli ´c. Zero-shot Tokenizer Transfer. Advances in Neural Information Processing Systems, 37:46791–46818, 2024. 14 Nandini Mundra, Aditya Nanda Kishore Khandavally, Raj Dabre, Rat...

work page doi:10.18653/v1/2022.naacl-main.293 2022
[6]

Zhang, J

URLhttps://aclanthology.org/2021.acl-long.243/. Haruki Sakajo, Yusuke Ide, Justin Vasselli, Yusuke Sakai, Yingtao Tian, Hidetaka Kamigaito, and Taro Watanabe. Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar ...

work page doi:10.18653/v1/ 2021
[7]

findings-acl.262/

Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp

work page doi:10.18653/v1/2020.findings-emnlp 2020
[8]

In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

URLhttps://aclanthology.org/2020.findings-emnlp.240/. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani- Tur, Iz Beltagy, Steven Bethard, Ryan Cotterel...

work page doi:10.18653/v1/2021.naacl-main.41 2020
[9]

In language l

study a multi-stage tuning strategy for expanded vocabularies in Korean. Yamaguchi et al. (2026) presents a detailed comparison and analysis of heuristics-based initialization and tuning methods. Hypernetwork-basedinitialization trains networks to directly predict the embeddings of a new vocabulary item, instead of relying on similarity heuristics. Such a...

2026