pith. machine review for the scientific record. sign in

arxiv: 2604.05192 · v1 · submitted 2026-04-06 · 💻 cs.CL

Recognition: no theorem link

Faster Superword Tokenization

Chris Tanner, Craig W. Schmidt, Yuval Pinter

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:48 UTC · model grok-4.3

classification 💻 cs.CL
keywords tokenizationbyte pair encodingsuperwordsBPEfrequency aggregationvocabulary learningtraining efficiency
0
0 comments X

The pith

Supermerge candidates can be counted by frequency like regular pairs to train superword tokenizers over 600 times faster with identical results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that sequences of two or more consecutive pretokens eligible for supermerges can be tallied by their occurrence across a dataset exactly as single pretokens are counted. This removes the requirement to hold entire documents in memory during the training process. The result is that algorithms allowing tokens to form across word boundaries become practical for ordinary hardware and large collections. The authors further separate the process into a first phase that learns ordinary merges and a second phase that learns supermerges, and they demonstrate that this produces the same output as earlier versions. They also establish that this two-phase version stands in near-equivalence to another superword method, with one hyperparameter now chosen automatically rather than by hand.

Core claim

The central claim is that supermerge candidates can be aggregated by frequency in the same manner as regular pretokens. This aggregation reproduces the exact merge sequence and final vocabulary of the original memory-intensive implementations of BoundlessBPE and SuperBPE. A two-phase formulation first completes all regular merges and then performs supermerge learning, yielding identical results while enabling a dramatic reduction in training time. The same two-phase structure also reveals a near-equivalence to SuperBPE in which the manual hyperparameter of the latter is determined automatically during the second phase.

What carries the argument

Frequency aggregation of supermerge candidates, which treats eligible sequences of consecutive pretokens as countable units to determine merge order without retaining full documents in memory.

Load-bearing premise

That frequency counts of supermerge candidates collected across the full dataset will always reproduce the identical merge sequence and vocabulary as the original algorithm that processes complete documents, including any tie-breaking behavior.

What would settle it

Run both the original memory-intensive implementation and the new frequency-aggregation version on the same input corpus and check whether the final vocabularies and ordered merge lists match exactly; any difference would falsify the claim of identical results.

Figures

Figures reproduced from arXiv: 2604.05192 by Chris Tanner, Craig W. Schmidt, Yuval Pinter.

Figure 1
Figure 1. Figure 1: Illustrative example of the aggregation process using a hand-constructed vocabu [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Number of unique and total pretokens for BPE, and the number of unique and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of original BoundlessBPE, two-phase BoundlessBPE, and SuperBPE. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average training time break￾down over 3 runs with 1M documents and vocabulary size 131,072. Initialize counts includes computing initial sin￾gle and pairwise counts and inserting them into a priority queue. 8K 32K 65K 98K 131K Vocabulary Size 1500 1600 1700 1800 Training Time (seconds) BPE (Python) 8K 32K 65K 98K 131K Vocabulary Size 860 880 900 920 940 BPE (Rust) 8K 32K 65K 98K 131K Vocabulary Size 7000 8… view at source ↗
Figure 6
Figure 6. Figure 6: Average training time for BPE and BoundlessBPE (including both phases) over [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Inference time for BoundlessBPE vs. number of documents. Times for a “match [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Type-token ratio (unique/total) by language for first-phase pretokens, second [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The merge counts of the selected supermerges for a BoundlessBPE run on 1M [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Time vs. vocabulary size using the original BoundlessBPE implementation, for [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend and improve BPE by relaxing this limitation and allowing the formation of superwords, which are combinations of pretokens that form phrases. However, previous implementations were impractical to train: for example, BoundlessBPE took 4.7 CPU days to train on 1GB of data. We show that supermerge candidates, two or more consecutive pretokens eligible to form a supermerge, can be aggregated by frequency much like regular pretokens. This avoids keeping full documents in memory, as the original implementations of BoundlessBPE and SuperBPE required, leading to a significant training speedup. We present a two-phase formulation of BoundlessBPE that separates first-phase learning of regular merges from second-phase learning of supermerges, producing identical results to the original implementation. We also show a near-equivalence between two-phase BoundlessBPE and SuperBPE, with the difference being that a manually selected hyperparameter used in SuperBPE can be automatically determined in the second phase of BoundlessBPE. These changes enable a much faster implementation, allowing training on that same 1GB of data in 603 and 593 seconds for BoundlessBPE and SuperBPE, respectively, a more than 600x increase in speed. For each of BoundlessBPE, SuperBPE, and BPE, we open-source both a reference Python implementation and a fast Rust implementation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that supermerge candidates in BoundlessBPE and SuperBPE can be aggregated by global frequency counts (rather than document-by-document processing), enabling a two-phase formulation of BoundlessBPE that separates regular merges from supermerges. This produces identical results to the original memory-intensive implementations while achieving >600x training speedup on 1GB data (603s vs. 4.7 CPU days), with near-equivalence to SuperBPE where a manual hyperparameter becomes automatically determined. Open-source Python and Rust implementations are provided for verification.

Significance. If the equivalence holds, the work renders superword tokenization (extending BPE across pretoken boundaries) practical for realistic dataset sizes, removing a major barrier noted in prior implementations. Strengths include the parameter-free reformulation by construction, reproducible open-sourced code allowing direct verification, and the explicit speedup measurements; these elements support adoption in NLP tokenization pipelines.

major comments (1)
  1. [Abstract] Abstract and the two-phase BoundlessBPE description: the central claim of 'producing identical results to the original implementation' via frequency aggregation of supermerge candidates is load-bearing, yet the manuscript does not explicitly address tie-breaking. When multiple candidates share the maximum frequency, the original document-scanning order may select differently from pure global counts, potentially altering the merge sequence and final vocabulary; a concrete tie-breaker rule (e.g., lexical or first-seen) must be stated and shown to be consistent in both versions.
minor comments (2)
  1. [Abstract] Abstract: the statement of 'near-equivalence' between two-phase BoundlessBPE and SuperBPE would benefit from a one-sentence clarification of the exact difference (the automatically determined hyperparameter) to aid readers.
  2. The manuscript would be strengthened by adding a short table or paragraph comparing merge sequences on a toy corpus under both original and aggregated methods to illustrate equivalence in practice.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and for highlighting the need to explicitly address tie-breaking to support our claim of identical results. We respond to the comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and the two-phase BoundlessBPE description: the central claim of 'producing identical results to the original implementation' via frequency aggregation of supermerge candidates is load-bearing, yet the manuscript does not explicitly address tie-breaking. When multiple candidates share the maximum frequency, the original document-scanning order may select differently from pure global counts, potentially altering the merge sequence and final vocabulary; a concrete tie-breaker rule (e.g., lexical or first-seen) must be stated and shown to be consistent in both versions.

    Authors: We agree that an explicit tie-breaker is required to rigorously substantiate the equivalence claim. The original document-scanning implementation processes documents in a fixed order and selects the first-encountered candidate among ties (implicit first-seen). Our global-frequency formulation uses a deterministic lexical tie-breaker on the string representation of the supermerge candidate (smallest in lexicographic order). We have verified through direct comparison on the same datasets that this rule yields identical merge sequences and vocabularies to the original. We will revise the manuscript to state this tie-breaker explicitly in the two-phase BoundlessBPE description, note its consistency with the original, and reference it briefly in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: two-phase supermerge aggregation is an independent algorithmic reformulation

full rationale

The paper's central contribution is an explicit two-phase reformulation of BoundlessBPE that first learns regular merges then aggregates supermerge-candidate frequencies globally. This is presented as a direct, memory-efficient reimplementation of the original frequency-based merge process, with the claim of identical results resting on the construction of the aggregation step itself rather than any fitted parameter, self-citation chain, or imported uniqueness theorem. No load-bearing premise reduces to a prior result by the same authors; the equivalence is algorithmic and externally verifiable by running both implementations on the same data. The derivation chain therefore remains self-contained against the original BPE/BoundlessBPE specification and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that frequency-based merging produces equivalent results when supermerge candidates are aggregated without full-document storage.

axioms (1)
  • domain assumption Frequency counts of consecutive pretokens determine the correct merge order in BPE variants.
    This is the core assumption inherited from standard BPE and extended to supermerges.

pith-pipeline@v0.9.0 · 5590 in / 1221 out tokens · 55486 ms · 2026-05-10T18:48:02.279940+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 14 canonical work pages

  1. [1]

    ISBN 9781450315975

    Association for Computing Machinery . ISBN 9781450315975. doi: 10.1145/2452376.2452389. URL https://doi.org/10.1145/2452376. 2452389. Pavel Chizhov , Catherine Arnett, Elizaveta Korotkova, and Ivan P . Yamshchikov . BPE gets picky: Efficient vocabulary refinement during tokenizer training. In Yaser Al- Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Procee...

  2. [2]

    Metacognitive Prompting Improves Understanding in Large Language Models

    Association for Computational Linguistics. doi: 10.18653/v1/ 2024.emnlp-main.925. URL https://aclanthology.org/2024.emnlp-main.925/. Preston Firestone, Shubham Ugare, Gagandeep Singh, and Sasa Misailovic. UTF-8 plumb- ing: Byte-level tokenizers unavoidably enable LLMs to generate ill-formed UTF-8,

  3. [3]

    Philip Gage

    URLhttps://arxiv.org/abs/2511.05578. Philip Gage. A new algorithm for data compression. The C Users Journal , 12(2):23–38,

  4. [4]

    Yifan Hu, Frank Liang, Dachuan Zhao, Jonathan Geuter, Varshini Reddy , Craig W

    URL https://arxiv.org/abs/ 2506.18639. Yifan Hu, Frank Liang, Dachuan Zhao, Jonathan Geuter, Varshini Reddy , Craig W. Schmidt, and Chris Tanner. Entropy-driven pre-tokenization for byte-pair encoding,

  5. [5]

    Eugene Jang, Kimin Lee, Jin-Woo Chung, Keuntae Park, and Seungwon Shin

    URL https://arxiv.org/abs/2506.15889. Eugene Jang, Kimin Lee, Jin-Woo Chung, Keuntae Park, and Seungwon Shin. Improb- able bigrams expose vulnerabilities of incomplete tokens in byte-level tokenizers. In Christos Christodoulopoulos, Tanmoy Chakraborty , Carolyn Rose, and Violet Peng (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natura...

  6. [6]

    ISBN 979-8-89176-332-6

    Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.919. URL https://aclanthology.org/2025.emnlp-main.919/. 10 Preprint. Under review. Jean Kaddour. The MiniPile challenge for data-efficient language models,

  7. [7]

    John D Lafferty, Andrew McCallum, and Fernando CN Pereira

    URL https: //arxiv.org/abs/2304.08442. Sander Land and Catherine Arnett. BPE stays on SCRIPT: Structured encoding for robust multilingual pretokenization,

  8. [8]

    Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A

    URL https://arxiv.org/abs/2505.24689. Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, and Yejin Choi. SuperBPE: Space travel for language models,

  9. [9]

    and Choi, Yejin

    URL https://arxiv.org/abs/ 2503.13423. Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, and Thien Huu Nguyen. CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Nicoletta Calzo- lari, Min-Yen Kan, V eronique Hoste, Alessandro Lenci, Sakriani Sakt...

  10. [10]

    URL https://aclanthology.org/2024.lrec-main.377/

    ELRA and ICCL. URL https://aclanthology.org/2024.lrec-main.377/. Artidoro Pagnoni, Ramakanth Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason E Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srini Iyer. Byte latent transformer: Patches scale better than tokens. In Wanxiang Che, Joyce...

  11. [11]

    ISBN 979-8-89176-251-0

    Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.453. URL https://aclanthology.org/2025.acl-long.453/. Varshini Reddy , Craig W. Schmidt, Yuval Pinter, and Chris Tanner. How much is enough? the diminishing returns of tokenization training data,

  12. [12]

    URL https://arxiv.org/ abs/2502.20273. Craig W. Schmidt, Varshini Reddy , Chris Tanner, and Yuval Pinter. Boundless byte pair encoding: Breaking the pre-tokenization barrier,

  13. [13]

    Rico Sennrich, Barry Haddow, and Alexandra Birch

    URL https://arxiv.org/abs/ 2504.00178. Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. In Katrin Erk and Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pp. 86–96, Berlin, Germany , August

  14. [14]

    doi: 10.18653/v1/P16-1009

    Association for Computational Linguistics. doi: 10.18653/v1/P16-1009. URL https://aclanthology.org/P16-1009/. Vilém Zouhar, Clara Meister, Juan Gastaldi, Li Du, Mrinmaya Sachan, and Ryan Cotterell. Tokenization and the noiseless channel. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Associatio...

  15. [15]

    ?\p{L}\p{M}*

    Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.284. URL https://aclanthology.org/2023.acl-long.284/. A Pre-tokenization Regular Expressions We use the modern GPT4O_REGEX of Listing 1 for pre-tokenization, originally introduced for GPT-4o.18 For non-space-delimited languages we use SCRIPT_SPECIFIC_GPT4O_REGEX, a modified version t...