arxiv: 2605.13485 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.CL· cs.IT· math.IT

Recognition: no theorem link

Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

Amirmehdi Jafari Fesharaki , Mohammadamin Rami , Aslan Tchamkerten

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:36 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.ITmath.IT

keywords fragmentationtokenizationfinite-contextlog-lossMarkov sourcestransformersrepresentationBPE

0 comments

The pith

Fragmentation into smaller units can strictly raise the minimal log-loss achievable by any finite-context transformer on Markov sources.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the choice of sequence representation changes what information a fixed context window can expose to a predictor. On Markov sources, replacing each original symbol with multiple smaller units (fragmentation) can strictly increase the best possible finite-context log-loss, even after enlarging the window to cover the same source history. The opposite process, greedy tokenization into larger units, can make a short token window act like a longer source-context window, with an explicit loss guarantee that depends on how reliably the token windows span the required history together with the tokenizer's compression rate. This framework accounts for observed gaps between byte-level models and subword models without invoking optimization or capacity limits. A reader cares because it isolates an intrinsic effect of representation that any finite-context architecture must confront.

Core claim

Fragmentation is a lossless recoding that replaces each source symbol by several smaller units; it can strictly increase the optimal finite-context log-loss on Markov sources. Greedy tokenization groups symbols into larger units and can make a token window behave like a longer source-context window, yielding a loss guarantee controlled by spanning reliability and compression rate. Together these establish a finite-context information-theoretic account of representation choices.

What carries the argument

Fragmentation: a lossless recoding that replaces each source symbol by several smaller units, thereby increasing the number of steps needed to cover the same source history.

If this is right

Byte- and character-level models incur an intrinsic penalty relative to subword models that cannot be removed merely by enlarging the context window.
A tokenizer can be diagnosed by measuring the fraction of source history reliably covered by its fixed token windows.
The loss guarantee for tokenization improves when the tokenizer both compresses and ensures that token boundaries align with source transitions.
Representation choice and context length become coupled design decisions rather than independent ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The strict increase may weaken or disappear once long-range dependencies or non-stationarity dominate, suggesting a direct test on natural language corpora.
Hybrid schemes that fragment only where necessary while tokenizing elsewhere could be evaluated by measuring the resulting effective source-context length.
The diagnostic for tokenizers could be used to compare BPE, WordPiece, and learned tokenizers on the same source without retraining models.

Load-bearing premise

The data are generated by Markov sources of finite order.

What would settle it

A concrete Markov source on which some fragmentation strictly decreases the optimal finite-context log-loss, or an experiment on real text where enlarging a character-level context window closes the entire gap to a subword model.

Figures

Figures reproduced from arXiv: 2605.13485 by Amirmehdi Jafari Fesharaki, Aslan Tchamkerten, Mohammadamin Rami.

**Figure 3.** Figure 3: Tokenization extends effective context on a binary order- [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Greedy tokenization on a small example. Left: a prefix tree with vocabulary [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Slack term [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Empirical verification of the fragmentation decomposition in Theorem 3.1. Each subfigure [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: Matched-plateau comparison across vocabulary/window pairs on the synthetic Markov [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

**Figure 8.** Figure 8: Slack term ϵ(w, ws) log2 |Z|/αZ from Theorem 4.3 as a function of target source span ws on WikiText. Across window sizes, tokenizers allow the target source span to grow well beyond the token-window length before the slack becomes large. Figures 8 and 9 show that the same qualitative behavior persists across window sizes: tokenized windows reliably span substantially more source characters than the raw cha… view at source ↗

**Figure 9.** Figure 9: Empirical CDFs of the source span S(Z w 1 ) on WikiText for different token-window lengths. The character-level representation has deterministic span w, while tokenized representations shift the distribution to much larger source spans. This shows that the effective-context behavior observed in the main text is stable across window sizes. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

read the original abstract

Transformers predict over a representation of a sequence. The same data can be written as bytes, characters, or subword tokens, and these representations may be lossless. Yet, under a fixed context window, they need not expose the same information to the model. This raises a basic question: how does the choice of representation change what a finite-context predictor can achieve? We study this question on Markov sources and uncover two complementary phenomena. First, we observe that moving to smaller representation units can hurt prediction even when the context window is enlarged to cover the relevant source history. To explain this, we introduce fragmentation: a lossless recoding that replaces each source symbol by several smaller units. We prove that fragmentation can strictly increase the optimal finite-context log-loss, showing that the gap is not merely an optimization or capacity issue, but can be intrinsic to the representation. This gives a theoretical account of the finite-context gap observed in byte- and character-level models such as ByT5 and CANINE relative to subword-tokenized models. Second, we study the opposite direction: greedy tokenization -- BPE, WordPiece, and related methods -- which groups source symbols into larger units. We show that tokenization can make a short token window behave like a longer source-context window, and we give a loss guarantee describing when this is achievable. The guarantee depends on how reliably token windows span the needed source history, together with the compression rate of the tokenizer. This also yields a simple diagnostic for real tokenizers: measuring how much source context a fixed token window reliably contains. Together, the two directions establish a finite-context information-theoretic framework for reasoning about representation choices in Transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves that fragmentation can strictly raise optimal finite-context log-loss on Markov sources and supplies a loss guarantee for greedy tokenizers, but the Markov restriction is the main limit on how far the results reach.

read the letter

This paper shows two concrete things. Fragmentation, a lossless split of source symbols into smaller units, can make the best possible finite-context log-loss strictly worse even after the window is enlarged to match the original memory. Greedy tokenizers like BPE can sometimes recover longer effective context, and the paper gives a bound on when that happens based on how reliably token windows cover the needed source history plus the compression rate.

Referee Report

1 major / 1 minor

Summary. The paper claims that for data generated by stationary Markov sources, a lossless fragmentation recoding to smaller units can strictly increase the infimum log-loss achievable by any finite-context predictor, providing an intrinsic representational explanation for observed gaps between byte/character-level models (e.g., ByT5) and subword-tokenized models. In the opposite direction, it derives a loss guarantee showing when greedy tokenization (BPE, WordPiece) allows a short token window to behave like a longer source-context window, depending on reliable spanning of source history and compression rate, together with a diagnostic for real tokenizers.

Significance. If the results hold, the work supplies a clean information-theoretic framework that separates representational effects from optimization or capacity limitations in finite-context models. The explicit construction for the strict-increase result under fragmentation and the parameter-light loss guarantee for tokenization are concrete strengths; the diagnostic for measuring source-context coverage in token windows is immediately usable.

major comments (1)

[Abstract and §2–3] Abstract and the Markov-source setup (likely §2–3): the strict-increase proof and the tokenization loss guarantee are derived only for stationary finite-order Markov sources. The manuscript invokes these results to give 'a theoretical account' of finite-context gaps in real Transformers trained on natural language, yet the skeptic note correctly flags that long-range dependencies and non-stationarities are excluded by construction; a concrete discussion of whether the separation survives (or a counter-example) is load-bearing for the claimed explanatory power.

minor comments (1)

[Early sections / notation] The definitions of fragmentation and of the optimal finite-context log-loss would be clearer if introduced with explicit notation or a short equation block before the main theorems.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and for highlighting the scope of our theoretical results. We will revise the manuscript to include a dedicated discussion addressing the applicability of the Markov-source analysis to natural language, while preserving the rigor of the existing proofs.

read point-by-point responses

Referee: [Abstract and §2–3] Abstract and the Markov-source setup (likely §2–3): the strict-increase proof and the tokenization loss guarantee are derived only for stationary finite-order Markov sources. The manuscript invokes these results to give 'a theoretical account' of finite-context gaps in real Transformers trained on natural language, yet the skeptic note correctly flags that long-range dependencies and non-stationarities are excluded by construction; a concrete discussion of whether the separation survives (or a counter-example) is load-bearing for the claimed explanatory power.

Authors: We agree that an explicit discussion of scope is required for the claimed explanatory power. The strict-increase result for fragmentation is proved for stationary finite-order Markov sources because this setting permits a clean information-theoretic construction showing that the penalty is intrinsic to the representation rather than to optimization or capacity. This already supplies a lower bound: if fragmentation hurts even when all relevant history fits inside the window, the effect cannot be weaker for sources with additional long-range dependencies. Non-stationarities would only increase the amount of history that must be reliably spanned, making the fragmentation penalty at least as large. We will add a new subsection (likely §4.3) that (i) states the Markov assumption explicitly as a modeling choice, (ii) sketches why the separation is expected to survive for non-stationary sources, and (iii) notes that highly structured counter-examples (e.g., deterministic periodic sources) lie outside the stationary Markov regime but do not invalidate the positive result for the broad class of sources that exhibit local dependencies. The tokenization guarantee will be similarly qualified. This revision will be made without altering the theorems themselves. revision: yes

Circularity Check

0 steps flagged

No circularity: self-contained proofs on Markov sources

full rationale

The paper's central claims are established by explicit constructions and counting arguments for stationary Markov sources of finite order. Fragmentation is defined as a lossless recoding, and the strict increase in optimal finite-context log-loss is shown by exhibiting a source where relevant history is lost under any fixed window in the new alphabet; this follows directly from the Markov property equating the optimal predictor to a function of finite past symbols. The tokenization loss guarantee is derived from measurable properties of how token windows span source history plus compression rate, without any fitted parameters or redefinitions. No load-bearing step reduces to a self-citation, ansatz smuggled via prior work, or renaming of a known result. The derivations remain independent of the target claims and are falsifiable under the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the Markov source model and the definition of fragmentation as a lossless recoding; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Data is generated by a stationary Markov source of finite order
All results are derived under this generative model as stated in the abstract.

invented entities (1)

Fragmentation no independent evidence
purpose: A lossless recoding that replaces each source symbol by several smaller units
Introduced to formalize the increase in optimal log-loss; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5621 in / 1203 out tokens · 27319 ms · 2026-05-14T20:36:33.818963+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Bellard, Fabrice , year = 2021, month = feb, langid =

work page 2021
[2]

International Conference on Learning Representations , year =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =

work page
[3]

International Conference on Learning Representations , year =

Pointer Sentinel Mixture Models , author =. International Conference on Learning Representations , year =

work page
[4]

Bondaschi, Marco and Rajaraman, Nived and Wei, Xiuying and Pascanu, Razvan and Gulcehre, Caglar and Gastpar, Michael and Makkuva, Ashok Vardhan , year = 2025, month = oct, urldate =. From. The

work page 2025
[5]

2019 , url =

Language Models are Unsupervised Multitask Learners , author =. 2019 , url =

work page 2019
[6]

Language

Brown, Tom B and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and. Language. Advances in

work page
[7]

Tokenization

Chai, Yekun and Fang, Yewei and Peng, Qiwei and Li, Xuhong , editor =. Tokenization. Findings of the

work page
[8]

and Garrette, Dan and Turc, Iulia and Wieting, John , editor =

Clark, Jonathan H. and Garrette, Dan and Turc, Iulia and Wieting, John , editor =. Canine:. Transactions of the Association for Computational Linguistics , volume =

work page
[9]

and Thomas, Joy A

Cover, Thomas M. and Thomas, Joy A. , year = 2005, month = sep, edition =. Elements of

work page 2005
[10]

Language Modeling Is Compression , year =

Del. Language. arXiv , langid =:2309.10668 , primaryclass =

work page arXiv
[11]

Proceedings of the 2019

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , editor =. Proceedings of the 2019

work page 2019
[12]

Edelman, Ezra and Tsilivis, Nikolaos and Edelman, Benjamin L and Malach, Eran and Goel, Surbhi , langid =. The

work page
[13]

and Malach, Eran and Goel, Surbhi , year = 2024, month = nov, urldate =

Edelman, Ezra and Tsilivis, Nikolaos and Edelman, Benjamin L. and Malach, Eran and Goel, Surbhi , year = 2024, month = nov, urldate =. The. The

work page 2024
[14]

Erdogan, Mete and Gorle, Abhiram and Chandak, Shubham and Pilanci, Mert and Weissman, Tsachy , year = 2026, month = jan, number =. An. 2601.09039 , primaryclass =

work page arXiv 2026
[15]

Unpacking

Goldman, Omer and Caciularu, Avi and Eyal, Matan and Cao, Kris and Szpektor, Idan and Tsarfaty, Reut , year = 2024, pages =. Unpacking. Findings of the

work page 2024
[16]

Goyal, Mohit and Tatwawadi, Kedar and Chandak, Shubham and Ochoa, Idoia , year = 2021, month = mar, pages =. 2021

work page 2021
[17]

, year = 2025, month = sep, journal =

Haslett, David A. , year = 2025, month = sep, journal =. Tokenization

work page 2025
[18]

HUFFMANt, DAVID A , year = 1952, journal =. A

work page 1952
[19]

Proceedings of the 2018

Kudo, Taku and Richardson, John , editor =. Proceedings of the 2018

work page 2018
[20]

Kudo, Taku , editor =. Subword. Proceedings of the 56th

work page
[21]

Proceedings of the 33rd

Lu, Zeyi and Ma, Xiaoxiao and Huang, Yujun and Chen, Minxiao and Chen, Bin and An, Baoyi and Xia, Shu-Tao , year = 2025, month = oct, series =. Proceedings of the 33rd

work page 2025
[22]

Attention with

Makkuva, Ashok Vardhan and Bondaschi, Marco and Girish, Adway and Nagle, Alliot and Jaggi, Martin and Kim, Hyeji and Gastpar, Michael , year = 2024, month = oct, urldate =. Attention with. The

work page 2024
[23]

Accelerating

Mao, Yu and Cui, Yufei and Kuo, Tei-Wei and Xue, Chun Jason , year = 2022, month = oct, series =. Accelerating. Proceedings of the 30th

work page 2022
[24]

Faster and

Mao, Yu and Li, Jingzong and Cui, Yufei and Xue, Jason Chun , year = 2023, month = jul, pages =. Faster and. 2023 60th

work page 2023
[25]

Proceedings of the

Mao, Yu and Cui, Yufei and Kuo, Tei-Wei and Xue, Chun Jason , year = 2022, month = apr, pages =. Proceedings of the

work page 2022
[26]

, year = 2015, month = oct, series =

Piczak, Karol J. , year = 2015, month = oct, series =. Proceedings of the 23rd

work page 2015
[27]

Rajaraman, Nived and Jiao, Jiantao and Ramchandran, Kannan , year = 2024, month = nov, urldate =. An. The

work page 2024
[28]

Transformers on

Rajaraman, Nived and Bondaschi, Marco and Makkuva, Ashok Vardhan and Ramchandran, Kannan and Gastpar, Michael , year = 2024, month = nov, urldate =. Transformers on. The

work page 2024
[29]

Tokenization

Schmidt, Craig W and Reddy, Varshini and Zhang, Haoran and Alameddine, Alec and Uzan, Omri and Pinter, Yuval and Tanner, Chris , editor =. Tokenization. Proceedings of the 2024

work page 2024
[30]

Japanese and

Schuster, Mike and Nakajima, Kaisuke , year = 2012, month = mar, pages =. Japanese and. 2012

work page 2012
[31]

Sennrich, Rico and Haddow, Barry and Birch, Alexandra , year = 2016, month = jun, number =. Neural. 1508.07909 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv 2016
[32]

Shannon, C. E. , year = 1948, journal =. A

work page 1948
[33]

Tao, Chaofan and Liu, Qian and Dou, Longxu and Muennighoff, Niklas and Wan, Zhongwei and Luo, Ping and Lin, Min and Wong, Ngai , year = 2024, langid =. Scaling. Advances in

work page 2024
[34]

and Ruder, Sebastian and Gupta, Jai and Chung, Hyung Won and Bahri, Dara and Qin, Zhen and Baumgartner, Simon and Yu, Cong and Metzler, Donald , year = 2021, month = oct, urldate =

Tay, Yi and Tran, Vinh Q. and Ruder, Sebastian and Gupta, Jai and Chung, Hyung Won and Bahri, Dara and Qin, Zhen and Baumgartner, Simon and Yu, Cong and Metzler, Donald , year = 2021, month = oct, urldate =. Charformer:. International

work page 2021
[35]

LLaMA: Open and Efficient Foundation Language Models

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. 2302.13971 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Attention Is

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Attention Is. Advances in

work page
[37]

, year = 1992, month = feb, journal =

Wallace, G.K. , year = 1992, month = feb, journal =. The

work page 1992
[38]

Computer , volume =

A. Computer , volume =

work page
[39]

Transactions of the Association for Computational Linguistics , volume =

Xue, Linting and Barua, Aditya and Constant, Noah and. Transactions of the Association for Computational Linguistics , volume =

work page
[40]

Thirty-Seventh

Yu, Lili and Simig, Daniel and Flaherty, Colin and Aghajanyan, Armen and Zettlemoyer, Luke and Lewis, Mike , year = 2023, month = nov, urldate =. Thirty-Seventh

work page 2023
[41]

Tokenization

Zhang, Xiang and Cao, Juntai and Wei, Jiaqi and Xu, Yiwei and You, Chenyu , year = 2025, publisher =. Tokenization

work page 2025
[42]

IEEE Transactions on Information Theory , volume =

Compression of Individual Sequences via Variable-Rate Coding , author =. IEEE Transactions on Information Theory , volume =

work page
[43]

IEEE Transactions on Information Theory , volume =

A Universal Algorithm for Sequential Data Compression , author =. IEEE Transactions on Information Theory , volume =

work page
[44]

Tokenization and the

Zouhar, Vil. Tokenization and the. Proceedings of the 61st

work page