pith. machine review for the scientific record. sign in

arxiv: 2604.16037 · v1 · submitted 2026-04-17 · 💻 cs.CL

Recognition: unknown

Stochasticity in Tokenisation Improves Robustness

Anya Sims, Arno Solin, Franz Pernkopf, Martin Trapp, Rui Li, Sofiane Ennadir, Sophie Steger

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:47 UTC · model grok-4.3

classification 💻 cs.CL
keywords stochastic tokenisationmodel robustnessadversarial perturbationstokenisation attacksLLM pre-trainingfine-tuninglanguage model robustness
0
0 comments X

The pith

Pre-training and fine-tuning LLMs with uniformly sampled stochastic tokenisations improves robustness to random and adversarial perturbations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether exposing language models to random tokenisations during training makes them less brittle when inputs are later tokenised in unexpected ways. It reports that models trained this way keep their accuracy on standard tasks while suffering smaller drops when tokenisation is perturbed randomly or adversarially. A canonically trained Llama-1b model loses 29.8 percent accuracy on non-canonical tokenisations, but the stochastic approach avoids most of that loss. The gains hold across pre-training, fine-tuning, multiple datasets, and model sizes, and they come at no extra inference cost.

Core claim

The authors establish that uniformly sampling stochastic tokenisations throughout pre-training and supervised fine-tuning produces models whose representations are less sensitive to tokenisation changes. This yields higher accuracy under both random perturbations and adversarial attacks on tokenisation compared with deterministic canonical training, while standard-task performance stays the same and inference speed is unchanged.

What carries the argument

Uniformly sampled stochastic tokenisation during training, which forces the model to process multiple possible token sequences for the same text and thereby builds tolerance to tokenisation variation.

If this is right

  • Models trained with stochastic tokenisation resist adversarial attacks that alter input tokenisation.
  • Accuracy on standard tasks remains comparable to deterministic training.
  • Inference cost stays identical to baseline models.
  • The robustness benefit appears in both pre-training and fine-tuning across varied architectures and datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be paired with other robustness techniques such as adversarial training to compound gains.
  • It may also increase tolerance to non-tokenisation input variations like spelling changes or dialect shifts.
  • Practical tests on larger models and noisy real-world data would show whether the robustness translates to deployment settings.

Load-bearing premise

That sampling different tokenisations during training does not introduce biases that impair learning of the core task and that testing on non-canonical tokenisations isolates robustness effects without confounding factors.

What would settle it

A replication in which stochastic training produces lower accuracy on clean canonical evaluations or shows no robustness gain on perturbed tokenisations compared with a canonically trained baseline would disprove the central claim.

Figures

Figures reproduced from arXiv: 2604.16037 by Anya Sims, Arno Solin, Franz Pernkopf, Martin Trapp, Rui Li, Sofiane Ennadir, Sophie Steger.

Figure 1
Figure 1. Figure 1: Effect of stochastic tokenisation during testing. As the level of stochasticity increases (increasing number of splits), the accuracy of Llama-1b trained on LANGUAGE GAME with canonical tokenisation (CANON) sharply drops while the same model fine-tuned with stochastic tokenisation (STOK, STOK-UNI, or UNI-K) remains robust to perturbations during testing. (Rafailov et al., 2023; Song et al., 2024), and red … view at source ↗
Figure 3
Figure 3. Figure 3: Average test accuracy for canonically pre-trained Llama-1b model and stochastic tokenisation during fine-tuning on LANGUAGE GAME. Stochastic fine-tuning with STOCHASTOK consistently improves robustness to random perturbations during testing without affecting performance on canonical test data. sufficient to provide robustness against noisy inputs? Previ￾ously, Geh et al. (2025) found that large models, pre… view at source ↗
Figure 4
Figure 4. Figure 4: Histogram of segmentation probabilities for two sam￾pling schemes, STOCHASTOK and STOCHASTOK-UNI. STOCHASTOK induces a biased distribution over a subset of all valid segmentations with limited support. STOCHASTOK-UNI (ours), on the other hand, generates samples with equal probability and has full support over the set of non-canonical tokenisations. STOCHASTOK-UNI. As a first step, we introduce a uni￾form e… view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy (%) under canonical and adversarial tokenisa￾tion for Llama-1b trained on LANGUAGE GAME under various fine-tuning strategies. smaller mean normalised distances, indicating smaller local Lipschitz constants and suggesting reduced vulnerability. 7. Conclusion & Discussion In this paper, we address the fundamental question: Can stochastic tokenisation improve robustness? and find that stochastic toke… view at source ↗
Figure 7
Figure 7. Figure 7: Histogram comparison of segmentation probabilities for two sampling schemes (UNIFORM vs. STOCHASTOK). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Histogram comparison of segmentation probabilities for two sampling schemes (UNIFORM vs. STOCHASTOK). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Average test accuracy of Tiny-LLM on LANGUAGE GAME under stochastic tokenisation during fine-tuning, for different levels of stochasticity during pre-training. Stochastic fine-tuning improves robustness to random perturbations during testing across pre-training settings. Data set: CUTE Model: Tiny-LLM 0 0.1 0.5 1 3 20 % 40 % 60 % 80 % 100 % αeval Avg. test accuracy → Canon. αfine = 0.1 αfine = 0.5 (a) αpre… view at source ↗
Figure 10
Figure 10. Figure 10: Average test accuracy of Tiny-LLM on CUTE under stochastic tokenisation during fine-tuning, for different levels of stochasticity during pre-training. Stochastic fine-tuning improves robustness to random perturbations during testing across pre-training settings. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Effect of perturbation strength αpre, αfine, and different training schemes on adversarial robustness for the LANGUAGE GAME data set. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Effect of perturbation strength αpre, αfine, and different training schemes on adversarial robustness for the CUTE data set. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Average accuracy of GPT2-xl for stochastic tokenisation during evaluation and fine-tuning on LANGUAGE GAME and CUTE. Stochastic fine-tuning improves robustness to random perturbations during evaluation. Data set: LANGUAGE GAME - 0.1 0.5 1.0 Canonical UNIFORM STOCHASTOK STOCHASTOK-UNI UNIFORM-K 0.83 0.50 0.87 0.87 0.85 0.86 0.85 0.86 0.87 0.87 0.83 αfine Canonical tokenisation - 0.1 0.5 1.0 0.07 0.04 0.03 … view at source ↗
Figure 14
Figure 14. Figure 14: Accuracy of GPT2-xl under canonical and adversarial tokenisation on LANGUAGE GAME and CUTE. Stochastic fine-tuning consistently improves robustness to perturbations during evaluation, with highest gains for methods that address support and sampling bias (STOCHASTOK-UNI, UNIFORM-K. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: b. 0.1 0.5 1.0 0% 20% 40% 60% 80% 100% αfine Adv. accuracy dk = 2, v0 = τ (x) STOCHASTOK (2-way merges) STOCHASTOK-UNI (2-way merges) STOCHASTOK (2- and 3-way merges) STOCHASTOK-UNI (all merges) 0.1 0.5 1.0 0% 20% 40% 60% 80% 100% αfine dk = 4, v0 = τ (x) 0.1 0.5 1.0 0% 20% 40% 60% 80% 100% αfine dk = 2, v0 ∼ Unif(TV (x)) (a) LANGUAGE GAME dataset 0.1 0.5 1.0 0% 20% 40% 60% 80% 100% αfine Adv. accuracy dk… view at source ↗
Figure 16
Figure 16. Figure 16: Histogram comparison of segmentation probabilities for two sampling schemes (UNIFORM vs. BPE-DROPOUT with pdrop = 0.5). 32 [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Average accuracy of Llama-1b fine-tuned with BPE-DROPOUT and evaluated on LANGUAGE GAME under canonical and STOCHASTOK tokenisation. BPE-DROPOUT improves robustness to non-canonical tokenisations. However, choosing an appropriate dropout rate pdrop is less intuitive, since the resulting number of splits depends on the underlying input text and its merge structure. Data set: LANGUAGE GAME - 0.1 0.25 0.5 Ca… view at source ↗
Figure 18
Figure 18. Figure 18: Accuracy of Llama-1b under canonical and adversarial tokenisation on LANGUAGE GAME and CUTE for stochastic fine￾tuning with BPE-DROPOUT, STOCHASTOK-UNI, and UNIFORM-K. All stochastic fine-tuning methods improve robustness compared to canonical fine-tuning. Note that αfine and pdrop cannot be compared directly as they induce different number of splits. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Average perturbation robustness of Qwen-0.6b fine-tuned on LANGUAGE GAME. - 0.1 0.25 0.5 Canonical UNIFORM-K BPE-DROPOUT STOCHASTOK-UNI 0.95 0.95 0.95 0.93 0.95 0.95 0.93 0.94 0.95 0.94 αfine/pdrop Clean tokenisation - 0.1 0.25 0.5 0.18 0.66 0.75 0.75 0.62 0.73 0.71 0.63 0.70 0.75 αfine/pdrop Adversarial tokenisation v0 canonical [PITH_FULL_IMAGE:figures/full_fig_p034_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Canonical and adversarial perturbation robustness of Qwen-0.6b fine-tuned on LANGUAGE GAME. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Average perturbation robustness of Gemma-1b (vocabulary size of 262k) fine-tuned and evaluated with STOCHASTOK on LANGUAGE GAME. G.2.6. ABLATION ON EMBEDDING DISTANCES We measure the mean normalised distance between the hidden representations of canonical and non-canonical tokenisations of the same word across layers. We use the 1,000 most frequent English words and uniformly sample 50 non-canonical token… view at source ↗
Figure 22
Figure 22. Figure 22: Canonical fine-tuning ( ) does not affect the distances of alternative tokenisations compared to the zero-shot Llama-1b ( ). Stochastic tokenisation (STOCHASTOK , STOCHASTOK-UNI , UNIFORM-K ) reduce distances in deeper layers. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_22.png] view at source ↗
read the original abstract

The widespread adoption of large language models (LLMs) has increased concerns about their robustness. Vulnerabilities in perturbations of tokenisation of the input indicate that models trained with a deterministic canonical tokenisation can be brittle to adversarial attacks. Recent studies suggest that stochastic tokenisation can deliver internal representations that are less sensitive to perturbations. In this paper, we analyse how stochastic tokenisations affect robustness to adversarial attacks and random perturbations. We systematically study this over a range of learning regimes (pre-training, supervised fine-tuning, and in-context learning), data sets, and model architectures. We show that pre-training and fine-tuning with uniformly sampled stochastic tokenisations improve robustness to random and adversarial perturbations. Evaluating on uniformly sampled non-canonical tokenisations reduces the accuracy of a canonically trained Llama-1b model by 29.8%. We find that training with stochastic tokenisation preserves accuracy without increasing inference cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

Empirical study with no circular derivations or self-referential predictions

full rationale

The paper is an empirical investigation of stochastic tokenisation effects on robustness, reporting experimental results across pre-training, fine-tuning, and in-context learning regimes. No mathematical derivations, first-principles predictions, or equations are claimed that could reduce to fitted inputs or self-definitions by construction. The central claims rest on accuracy measurements under canonical vs. stochastic tokenisations, with no load-bearing self-citations or ansatzes invoked as uniqueness theorems. This is a standard non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is empirical and relies on standard machine learning assumptions such as models generalizing from training data and tokenization affecting representations.

pith-pipeline@v0.9.0 · 5462 in / 950 out tokens · 57439 ms · 2026-05-10T08:47:45.153392+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 9 canonical work pages · 5 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    S., Ehghaghi, M., Lester, B., Liu, F., Zhao, W., Ciccone, M., and Raffel, C

    Altınta¸ s, G. S., Ehghaghi, M., Lester, B., Liu, F., Zhao, W., Ciccone, M., and Raffel, C. Toksuite: Measuring the impact of tokenizer choice on language model behavior. arXiv preprint arXiv:2512.20757,

  3. [3]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? Try ARC, the AI2 reasoning chal- lenge.arXiv preprint arXiv:1803.05457,

  4. [4]

    Distributional properties of subword regularization

    Cognetta, M., Zouhar, V ., and Okazaki, N. Distributional properties of subword regularization. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 10753–10763,

  5. [5]

    Attacks, defenses and evaluations for LLM conversation safety: A survey

    Dong, Z., Zhou, Z., Yang, C., Shao, J., and Qiao, Y . Attacks, defenses and evaluations for LLM conversation safety: A survey. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 6734–6747,

  6. [6]

    CUTE: Measuring LLMs’ understanding of their tokens

    Edman, L., Schmid, H., and Fraser, A. CUTE: Measuring LLMs’ understanding of their tokens. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3017–3026,

  7. [7]

    C., and Bau, D

    Feucht, S., Atkinson, D., Wallace, B. C., and Bau, D. Token erasure as a footprint of implicit vocabulary items in llms. InProceedings of the 2024 Conference on Empirical 9 Stochasticity in Tokenisation Improves Robustness Methods in Natural Language Processing (EMNLP), pp. 9727–9739,

  8. [8]

    Where is the signal in tokenization space? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp

    Geh, R., Zhang, H., Ahmed, K., Wang, B., and Van den Broeck, G. Where is the signal in tokenization space? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3966–3979,

  9. [9]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  10. [10]

    Super tiny language models.arXiv preprint arXiv:2405.14159,

    Hillier, D., Guertler, L., Tan, C., Agrawal, P., Ruirui, C., and Cheng, B. Super tiny language models.arXiv preprint arXiv:2405.14159,

  11. [11]

    and Inui, K

    Hiraoka, T. and Inui, K. Spelling-out is not straightforward: LLMs capability of tokenization from token to characters. arXiv preprint arXiv:2506.10641,

  12. [12]

    and Levy, O

    Itzhak, I. and Levy, O. Models in a spelling bee: Lan- guage models implicitly learn the character composition of tokens. InProceedings of the 2022 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, pp. 5061–5068,

  13. [13]

    and Mahowald, K

    Kaushal, A. and Mahowald, K. What do tokens know about their characters and how do they know it? InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2487–2507,

  14. [14]

    Red teaming language models with language models

    Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving, G. Red teaming language models with language models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3419–3448,

  15. [15]

    Com- monsenseQA: A question answering challenge targeting commonsense knowledge

    Talmor, A., Herzig, J., Lourie, N., and Berant, J. Com- monsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4149–4158,

  16. [16]

    Gemini: A Family of Highly Capable Multimodal Models

    10 Stochasticity in Tokenisation Improves Robustness Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Sori- cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  17. [17]

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    Wu, Y ., Schuster, M., Chen, Z., Le, Q. V ., Norouzi, M., Macherey, W., Krikun, M., Cao, Y ., Gao, Q., Macherey, K., et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144,

  18. [18]

    S., Liu, A., Ahia, O., Hayase, J., Choi, Y ., and Smith, N

    Zheng, B. S., Liu, A., Ahia, O., Hayase, J., Choi, Y ., and Smith, N. A. Broken tokens? Your language model can secretly handle non-canonical tokenizations.arXiv preprint arXiv:2506.19004,

  19. [19]

    Training Details /githubThe code for the experiments is available at: https://github.com/stegsoph/ stochastic-tokenisation-robustness

    11 Stochasticity in Tokenisation Improves Robustness A. Training Details /githubThe code for the experiments is available at: https://github.com/stegsoph/ stochastic-tokenisation-robustness. For LLMs trained from scratch, we use the architecture proposed in Hillier et al. (2024) and used in Sims et al. (2026). The architectural details are summarised in Table

  20. [20]

    Feed-forward network SwiGLU, dimension 1320 Normalization RMSNorm (no bias) Positional encoding Rotary positional embeddings (RoPE) Context length 512 Tokeniser type GPT V ocabulary size 50,257 Embedding tying Input/output embeddings tied Table 5.Pre-training setup forTiny-LLMon OpenWebText. Pre-training (Tiny-LLM) Training objective Autoregressive langua...

  21. [21]

    - stood-

    tests the Character-level Understanding of Tokens of models. The original data sets was intended for zero-shot evaluation, providing only a test split. Thus, we follow Sims et al. (2026) in generating a custom CUTEdata set for training and evaluating the smaller scale LLMs. We generate questions for seven subword tasks (contains letter, delete letter, ins...

  22. [22]

    contains multiple-choice questions that test knowledge about the physical world with two answer options. CANONICAL: •When- boiling- butter- ,- when- it- ’s- ready- ,- you- can- Pour- it- into- a- jar- •To- permanently- attach- metal- legs- to- a- chair- ,- you- can- Weld- the- metal- together- to- get- it- to- stay- firmly- in- place- •how- do- you- inden...

  23. [23]

    has at most k non-zero columns, each withℓ 2-norm √ 2 . We can consequently formulate the following: ∥H(s 1)−H(s 2)∥≤ √ 2k .(19) And consequently, we can also derive the following: ∥E(s1)−E(s 2)∥=∥W(H(s 1)−H(s 2))∥ ≤ ∥W∥ ∥H(s 1)−H(s 2)∥ ≤ √ 2 ∥W∥ √ k .(20) Now, let’s consider the second part, which relates to the self-attention block. Let ze (correspondin...

  24. [24]

    Importantly, during stochastic pre-training, the model is trained to predict non-canonical token sequences

    Increasing αpre slightly worsens the fit to the canonical distribution, but improves robustness to non-canonical tokenisations. Importantly, during stochastic pre-training, the model is trained to predict non-canonical token sequences. This can reduce accuracy in standard MCQ evaluations, where we compare the log-likelihood of canonically tokenised answer...