Tokenization with Split Trees

Adam Wiemerslage; Chris Tanner; Craig W. Schmidt; Michael Krumdick; Seth Ebner; Varshini Reddy; Yuval Pinter

arxiv: 2605.22705 · v2 · pith:VTOLHSYYnew · submitted 2026-05-21 · 💻 cs.CL

Tokenization with Split Trees

Craig W. Schmidt , Michael Krumdick , Adam Wiemerslage , Seth Ebner , Varshini Reddy , Yuval Pinter , Chris Tanner This is my paper

Pith reviewed 2026-05-22 05:43 UTC · model grok-4.3

classification 💻 cs.CL

keywords subword tokenizationsplit treesinteger programminglanguage modelingvocabulary selectioncompressionBPERenyi efficiency

0 comments

The pith

ToaST reduces English token counts by more than 11% versus BPE at vocabularies of 40960 and larger while raising 1.5B model CORE scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ToaST, a tokenization method that first builds a full binary split tree for each pretoken from precomputed byte n-gram counts. It then chooses a vocabulary by solving an integer program that minimizes the number of tokens emitted when inference walks each tree and stops at the first in-vocabulary node. The linear-programming relaxation of this program stays near-integral, so good solutions are found quickly. Experiments on English data show the resulting tokenizers use over 11% fewer tokens than BPE, WordPiece, or UnigramLM at large vocabulary sizes and also improve Renyi efficiency by using fewer single-byte tokens. When 1.5B-parameter language models are trained with these tokenizers, ToaST records the highest CORE score among the tested methods.

Core claim

ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an integer program that minimizes the total token count over all split trees under this inference procedure, and the LP relaxation is near-integral in practice, yielding provably near-optimal vocabularies.

What carries the argument

The split tree, a full binary tree built greedily on byte n-grams for each pretoken, which supports recursive first-in-vocabulary emission and serves as the objective for the integer program that selects the vocabulary.

If this is right

Token counts drop by more than 11% on English text at vocabulary sizes of 40,960 and above, extending effective context length.
1.5B-parameter language models reach the highest CORE score and outperform baselines by 2.6% to 7.6% with significance in two of three comparisons.
Common single-byte tokens appear less often, producing a substantial gain in Renyi efficiency.
The LP relaxation remains near-integral, so the same optimization procedure scales to practical vocabulary sizes with quadratic training time in the number of split trees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same split-tree construction could be tried on non-English text by replacing the byte n-gram statistics with those of the target language.
Because training time grows quadratically with the number of pretokens, sampling a representative subset of the corpus may be needed before applying ToaST to very large datasets.
The reduction in token count may also improve performance on sequence-to-sequence tasks such as machine translation that are sensitive to sequence length.

Load-bearing premise

That the greedy binary splitting of pretokens using precomputed byte n-gram counts, followed by recursive descent to the first in-vocabulary node, produces a compression objective whose integer-program solution yields vocabularies that are near-optimal in actual downstream use.

What would settle it

Measuring token counts and downstream CORE scores for a 1.5B model trained with a ToaST vocabulary chosen at size 40960 on a fresh English corpus and checking whether the reported gains versus BPE still appear.

Figures

Figures reproduced from arXiv: 2605.22705 by Adam Wiemerslage, Chris Tanner, Craig W. Schmidt, Michael Krumdick, Seth Ebner, Varshini Reddy, Yuval Pinter.

**Figure 1.** Figure 1: Example split tree for ␣Kentucky. context window. The relationship between compression and downstream task performance is less clear. Some studies (Gallé, 2019; Rust et al., 2021; Goldman et al., 2024) find a correlation, while others (Schmidt et al., 2024; Ali et al., 2024) argue compression alone does not explain tokenization quality. However, these practical benefits are reason enough to optimize com… view at source ↗

**Figure 2.** Figure 2: Example tokenization, with white tokens not [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: A node (blue) appears in the tokenization [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Total training time as a function of the number [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Validation data bytes per token for ToaST and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 8.** Figure 8: Validation Rényi efficiency (α = 2.5) for ToaST and several baselines as a function of vocabulary size. Higher is better, indicating a more uniform token distribution. The four ToaST series are very similar to each other, as are BPE and WordPiece. tokens than all baselines with a ~14–19x reduction at a vocabulary size of 65,536. ToaST also produces substantially more Root tokens than the baselines. These … view at source ↗

**Figure 9.** Figure 9: Example split hierarchy of pretokens ≻ morphemes ≻ characters ≻ bytes for crème brûlée. use the multi-byte character level, as it requires no additional external data, although its effect on English is minimal since English text is almost entirely single-byte characters. Pretoken n-grams for superword construction and gold morpheme splits could be added to support the other levels. The vocabulary construc… view at source ↗

**Figure 10.** Figure 10: Cumulative total time to set up and solve the [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Individual timing of each resolve step, with [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 14.** Figure 14: Token categories for WordPiece, using the [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 12.** Figure 12: Validation bytes per token of ToaST, zoomed [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 15.** Figure 15: Token categories for UnigramLM, using the [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: Zipf plot of the validation token frequency [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗

**Figure 18.** Figure 18: Venn diagram of overlap in the tokens used [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗

read the original abstract

We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure. The Linear Programming (LP) relaxation is near-integral in practice, yielding provably near-optimal vocabularies, with training time empirically scaling quadratically in the number of split trees. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, reducing the number of inference tokens for models using this tokenizer, thus extending the effective context length. ToaST also uses common single-byte tokens less frequently than these baselines, leading to a substantial improvement in Renyi efficiency. In experiments training 1.5B parameter language models, ToaST achieves the highest CORE score, outperforming baselines by 2.6%--7.6%, with significance for two of three, and scoring best on 13 of 22 individual tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ToaST gives a clean new way to optimize vocabularies by building n-gram split trees first and then solving an IP directly on the recursive inference cost, but the LM gains rest on an unclear training protocol that could mix in extra data exposure.

read the letter

The core idea here is to precompute full binary split trees for every pretoken using byte n-gram counts, then define inference as walking down each tree until you hit the first vocabulary node. Vocabulary selection is turned into an explicit integer program whose objective is exactly the token count produced by that procedure. That separation and the direct objective are the genuinely new pieces compared with BPE-style or Unigram approaches in the citations.

Referee Report

2 major / 1 minor

Summary. The paper introduces Tokenization with Split Trees (ToaST), a subword tokenization method that greedily builds full binary split trees for pretokens from precomputed byte n-gram counts, then selects a vocabulary by solving an integer program minimizing total token count under recursive descent inference to the first in-vocabulary node. It claims the LP relaxation is near-integral and yields provably near-optimal vocabularies, reports >11% token reduction versus BPE/WordPiece/UnigramLM on English text at vocab sizes >=40960, reduced single-byte token usage, and superior CORE scores (outperforming baselines by 2.6%-7.6%) when training 1.5B-parameter LMs.

Significance. If the near-integrality of the LP solution and the downstream gains hold, the explicit IP formulation of the compression objective under the new inference rule would be a clear strength, offering a more direct optimization path than heuristic methods like BPE. The reported quadratic scaling of training time and the falsifiable token-reduction predictions are also positive features. However, the significance is tempered by the absence of supporting analysis for the LP claim and potential confounds in the LM experiments.

major comments (2)

[Abstract] Abstract: The statement that the LP relaxation is near-integral and yields provably near-optimal vocabularies is presented without derivation, error analysis, or verification that the reported 11% token reduction is robust to changes in pretokenization or domain.
[LM training experiments] LM training experiments: The description of the 1.5B-parameter LM training does not specify whether a fixed token budget or fixed optimizer steps was used; given ToaST's >11% token reduction, this leaves open whether the 2.6%-7.6% CORE gains are attributable to the tokenizer or to processing more raw text.

minor comments (1)

[Introduction] The interaction between the greedy binary splitting procedure and the IP objective could be clarified with a small illustrative example early in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The statement that the LP relaxation is near-integral and yields provably near-optimal vocabularies is presented without derivation, error analysis, or verification that the reported 11% token reduction is robust to changes in pretokenization or domain.

Authors: The manuscript formulates vocabulary selection as an integer program minimizing total token count under the recursive inference rule and reports that the LP relaxation is near-integral based on empirical solutions across vocabulary sizes. We agree the abstract states the near-optimality claim without a self-contained derivation or error analysis. We will revise the abstract to qualify the claim as empirically supported and expand the methods or appendix section with additional details on the observed integrality gaps. For robustness, our primary results use standard English pretokenization; we will add a brief discussion of sensitivity to alternative pretokenizers and note that the approach is domain-agnostic, with plans for broader verification. revision: yes
Referee: [LM training experiments] LM training experiments: The description of the 1.5B-parameter LM training does not specify whether a fixed token budget or fixed optimizer steps was used; given ToaST's >11% token reduction, this leaves open whether the 2.6%-7.6% CORE gains are attributable to the tokenizer or to processing more raw text.

Authors: We thank the referee for identifying this ambiguity. The 1.5B-parameter models were trained for a fixed number of optimizer steps across all tokenizers. This choice means ToaST's lower token count per sequence allows more raw text to be processed within the same step budget. We will revise the experimental description to state this explicitly and discuss the contribution of increased data exposure to the reported CORE improvements. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation: IP optimizes explicit objective with external baseline comparisons

full rationale

The paper defines split trees independently via byte n-gram counts, then formulates vocabulary selection as an IP minimizing total token count under the recursive descent inference rule. Reported token-count reductions (>11% vs BPE/WordPiece/UnigramLM) are direct empirical measurements against those external methods' own vocabularies and inference procedures, not a re-reporting of the IP objective on the same data. Downstream 1.5B LM CORE scores are presented as experimental results without any reduction to a fitted parameter or self-citation chain. No uniqueness theorems, ansatzes, or renamings from prior author work are invoked as load-bearing steps. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method depends on the domain assumption that byte n-gram counts computed once on a corpus are sufficient to guide optimal splits, and on the mathematical claim that the LP relaxation of the token-count integer program is near-integral; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Byte n-gram counts computed independently of any vocabulary can be used to construct greedy full binary split trees that support the subsequent recursive inference procedure.
Invoked in the description of how each pretoken is split.

pith-pipeline@v0.9.0 · 5795 in / 1403 out tokens · 36345 ms · 2026-05-22T05:43:51.391591+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean bare_distinguishability_of_absolute_floor unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.