arxiv: 2605.00768 · v1 · submitted 2026-05-01 · 💻 cs.CL

Recognition: unknown

Characterizing the Expressivity of Local Attention in Transformers

Jiaoda Li, Ryan Cotterell

Pith reviewed 2026-05-09 18:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords local attentiontransformer expressivityglobal attentionlinear temporal logicregular languagessequence modelinghybrid attention

0 comments

The pith

Local attention introduces a second temporal operator in transformers, strictly enlarging the class of recognizable regular languages beyond what global attention alone achieves.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that fixed-precision transformers with only global attention match a fragment of linear temporal logic limited to a single past operator. Adding local attention brings in a second temporal operator, allowing the model to recognize a strictly larger set of regular languages. Global and local attention turn out to be complementary: each can express patterns the other cannot, and their combination produces the most powerful fragment overall. This supplies a formal reason why local attention sometimes improves model quality even when efficiency is not the main concern. Experiments on formal language recognition tasks and natural language modeling back up the theoretical distinctions.

Core claim

Fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator. Adding local attention introduces a second temporal operator, strictly enlarging the class of recognizable regular languages. Global and local attention are expressively complementary: neither subsumes the other, and combining them yields the richest fragment.

What carries the argument

The established correspondence between fixed-precision transformer attention and fragments of linear temporal logic, with local attention specifically adding a distinct second past operator.

If this is right

Hybrid global-local transformers recognize a strictly larger class of regular languages than global-only models.
Neither global attention nor local attention can express every pattern the other can handle.
The combination of global and local attention produces the richest fragment of recognizable languages.
Local attention supplies expressive power that cannot be recovered simply by increasing the capacity of a global-only model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Transformer designs for tasks with specific sequence patterns could deliberately mix global and local attention to match the required expressivity without excess computation.
The complementarity result suggests testing whether certain long-range dependency problems in practice are better solved by local windows than by global attention.
The logic-based characterization offers a route to prove upper bounds on what particular attention variants can achieve on regular-language benchmarks.

Load-bearing premise

The logical correspondence shown for global attention extends directly to local attention without extra restrictions that would limit the result to non-practical models.

What would settle it

A concrete regular language that any global-local hybrid transformer fails to recognize even though the two-operator logic predicts it should be recognizable, or an experiment in which hybrid models show no advantage on languages that require both operators.

Figures

Figures reproduced from arXiv: 2605.00768 by Jiaoda Li, Ryan Cotterell.

**Figure 1.** Figure 1: Forbidden configuration in the minimal DFAs of view at source ↗

**Figure 2.** Figure 2: Minimal DFAs. Nodes represent states and arrows view at source ↗

**Figure 3.** Figure 3: Heatmap of longest perfect lengths (maximum over runs) across formal languages, attention patterns, and positional view at source ↗

**Figure 4.** Figure 4: Perplexity on WikiText-2 for local, hybrid, and global attention patterns under different positional encodings. Curves view at source ↗

read the original abstract

The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generating the next token. One common variant of attention is called local attention, which restricts each token to aggregating information from a bounded window of predecessors, reducing the quadratic cost of global attention to linear. Although this restriction is usually motivated by efficiency, it has also been found to improve model quality, a phenomenon that has so far lacked a satisfactory explanation. We provide a formal account of this phenomenon in terms of recognizer expressivity. It has been shown that fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator. We additionally prove that adding local attention introduces a second temporal operator, strictly enlarging the class of recognizable regular languages. Moreover, global and local attention are expressively complementary: neither subsumes the other, and combining them yields the richest fragment. Experiments on formal language recognition and natural language modeling corroborate the theory, showing that hybrid global--local transformers outperform their global-only counterparts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Local attention adds a second independent temporal operator to the LTL fragment that global attention alone can recognize, and the two are complementary.

read the letter

The main point is that local attention is not just an efficiency trick. In the fixed-precision setting, it strictly enlarges the class of regular languages the transformer can recognize by introducing a second temporal operator that global attention cannot simulate on its own. Global and local turn out to be incomparable, and their combination gives the strongest fragment. That is the new formal result the paper claims to prove, extending the known single-past-operator characterization of global attention.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that fixed-precision global-attention transformers correspond to a single-past-operator fragment of linear temporal logic. It proves that local attention introduces a second temporal operator, strictly enlarging the class of recognizable regular languages. Global and local attention are expressively complementary (neither subsumes the other), and their combination yields the richest fragment. These theoretical results are corroborated by experiments on formal language recognition and natural language modeling.

Significance. If the proofs hold, the work supplies a formal explanation for why local attention can improve transformer quality: it augments expressivity in a way that is complementary to global attention. The LTL correspondence and complementarity results are notable strengths, as is the absence of free parameters in the characterization. Experiments on both formal languages and LM tasks provide direct corroboration rather than post-hoc fitting.

major comments (2)

[theoretical results on local attention expressivity] The central proof that local attention adds an independent second temporal operator (abstract and theoretical results section): the formalization of the bounded window w together with fixed-precision arithmetic must be shown not to permit simulation of the missing operator (e.g., a “since” or “until” fragment) via interaction with global context or to collapse the fragments through implicit restrictions on w relative to precision. Without an explicit argument or counter-example language that is recognized only after the local mechanism is added, the strict-enlargement and complementarity claims remain at risk.
[experiments on formal languages] Experimental validation of the LTL fragments (experiments section): the formal-language tasks should report results across multiple window sizes w and precision levels, with explicit controls showing that performance jumps precisely when the second operator becomes available and that hybrid models reach the richest fragment. Current description leaves open whether the observed gains are due to the claimed expressivity increase or to other factors such as optimization dynamics.

minor comments (2)

[introduction and background] Notation for the LTL fragments could be introduced earlier and used consistently when stating the global-only, local-only, and hybrid cases.
[abstract] The abstract is concise but would benefit from naming the specific LTL operators involved (past operator for global, second operator for local) to make the expressivity claims immediately precise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each of the major comments below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [theoretical results on local attention expressivity] The central proof that local attention adds an independent second temporal operator (abstract and theoretical results section): the formalization of the bounded window w together with fixed-precision arithmetic must be shown not to permit simulation of the missing operator (e.g., a “since” or “until” fragment) via interaction with global context or to collapse the fragments through implicit restrictions on w relative to precision. Without an explicit argument or counter-example language that is recognized only after the local mechanism is added, the strict-enlargement and complementarity claims remain at risk.

Authors: We appreciate the referee's careful reading of the proof. The theoretical results section establishes the correspondence by showing that local attention with bounded window w and fixed precision can express the 'yesterday' operator (Y) in addition to the global 'previous' (P), corresponding to two past operators in LTL. To show strict enlargement, we provide a regular language (e.g., the language of strings where the second-to-last symbol satisfies a certain property that requires looking back exactly two steps in a way not capturable by single P) that is not recognizable by single-past LTL but is by two-past. Regarding simulation via global context: since the local window is strictly bounded and attention is local, it cannot access arbitrary past positions like an unbounded 'since' operator would; the fixed precision limits the state to finite, preventing collapse. However, we acknowledge that the argument could be made more explicit. We will add a dedicated paragraph or subsection in the theoretical results detailing why interactions with global attention do not allow simulation of the second operator, and explicitly state the counterexample language used to prove strict inclusion. revision: partial
Referee: [experiments on formal languages] Experimental validation of the LTL fragments (experiments section): the formal-language tasks should report results across multiple window sizes w and precision levels, with explicit controls showing that performance jumps precisely when the second operator becomes available and that hybrid models reach the richest fragment. Current description leaves open whether the observed gains are due to the claimed expressivity increase or to other factors such as optimization dynamics.

Authors: We agree that additional controls would strengthen the experimental corroboration. Our current experiments do vary window sizes for local attention and compare global, local, and hybrid models on formal language recognition tasks, showing improved performance for hybrids. To address the concern, we will expand the experiments section to include results for multiple precision levels (e.g., 4-bit, 8-bit, 16-bit) and a wider range of w values. We will also add an analysis where we identify the point at which the second operator becomes expressible (based on w >=2 for certain languages) and show performance jumps there, while controlling for model size and training dynamics by using identical hyperparameters. This will help isolate the expressivity effect from optimization factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on independent proofs extending a cited correspondence.

full rationale

The paper cites a prior result establishing that fixed-precision global-attention transformers correspond to a single-past-operator fragment of LTL, then supplies new proofs that local attention adds a second operator and that the two mechanisms are complementary. These proofs are presented as self-contained mathematical arguments about recognizer expressivity over regular languages. No steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations whose content is merely renamed or assumed. The derivation chain therefore remains independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on a previously established correspondence between fixed-precision global-attention transformers and a fragment of linear temporal logic; the paper extends this with new proofs for local attention without introducing free parameters or new entities.

axioms (1)

domain assumption Fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator
Stated as 'it has been shown' in the abstract, so treated as background assumption from prior literature.

pith-pipeline@v0.9.0 · 5483 in / 1290 out tokens · 51695 ms · 2026-05-09T18:57:48.171664+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 22 canonical work pages · 8 internal anchors

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. https://arxiv.org/abs/1607.06450 Layer normalization . In NIPS Deep Learning Symposium

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. http://arxiv.org/abs/1409.0473 Neural machine translation by jointly learning to align and translate . In The Third International Conference on Learning Representations

work page internal anchor Pith review arXiv 2015
[3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. https://arxiv.org/abs/2004.05150 Longformer: The long-document transformer . Computing Research Repository, arXiv:2004.05150. Version 2

work page internal anchor Pith review arXiv 2020
[4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, and 12 others. 2020. https://proceedings.neurips.cc/paper_fil...

2020
[5]

Janusz Brzozowski, Baiyu Li, and David Liu. 2012. https://dl.acm.org/doi/abs/10.5555/3173440.3173443 Syntactic complexities of six classes of star-free languages . Journal of Automata, Languages and Combinatorics, 17(2):83–105

work page doi:10.5555/3173440.3173443 2012
[6]

Alexandra Butoi, Ghazal Khalighinejad, Anej Svete, Josef Valvoda, Ryan Cotterell, and Brian DuSell. 2025. https://openreview.net/forum?id=aWLQTbfFgV Training neural networks as recognizers of formal languages . In The Thirteenth International Conference on Learning Representations

2025
[7]

David Chiang, Peter Cholak, and Anand Pillay. 2023. https://proceedings.mlr.press/v202/chiang23a.html Tighter bounds on the expressivity of transformer encoders . In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 5544--5562. PMLR

2023
[8]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. http://arxiv.org/abs/1904.10509 Generating long sequences with sparse transformers . Computing Research Repository, arXiv:1904.10509

work page internal anchor Pith review arXiv 2019
[9]

Joëlle Cohen, Dominique Perrin, and Jean-Eric Pin. 1993. https://doi.org/10.1016/0022-0000(93)90005-H On the expressive power of temporal logic . Journal of Computer and System Sciences, 46(3):271--294

work page doi:10.1016/0022-0000(93)90005-h 1993
[10]

DeepSeek-AI . 2025. https://arxiv.org/abs/2501.12948 DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning . Computing Research Repository, arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Gr\'egoire Del\'etang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and Pedro A Ortega. 2023. https://openreview.net/forum?id=WbxHAzkeQcn Neural networks and the chomsky hierarchy . In The Eleventh International Conference on Learning Representations

2023
[12]

Samuel Eilenberg. 1974. https://books.google.ch/books?id=CZtduwEACAAJ Automata, Languages, and Machines . Number pt. 2 in 59/B. Academic Press

1974
[13]

Dov Gabbay, Amir Pnueli, Saharon Shelah, and Jonathan Stavi. 1980. https://doi.org/10.1145/567446.567462 On the temporal analysis of fairness . POPL '80, page 163–173, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/567446.567462 1980
[14]

James A. Green. 1951. http://www.jstor.org/stable/1969317 On the structure of semigroups . Annals of Mathematics, 54(1):163--172

work page arXiv 1951
[15]

John Hewitt, Michael Hahn, Surya Ganguli, Percy Liang, and Christopher D. Manning. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.156 RNN s can generate bounded hierarchical languages with optimal memory . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1978--2010, Online. Association for Computa...

work page doi:10.18653/v1/2020.emnlp-main.156 2020
[16]

IEEE . 2019. https://doi.org/10.1109/IEEESTD.2019.8766229 IEEE standard for floating-point arithmetic . IEEE Std 754-2019

work page doi:10.1109/ieeestd.2019.8766229 2019
[17]

Johan Anthony Wilem Kamp. 1968. https://www.proquest.com/docview/302320357 Tense Logic and the Theory of Linear Order . Ph.D. thesis, University of California, Los Angeles

work page arXiv 1968
[18]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. 2014. https://arxiv.org/abs/1412.6980 Adam: A method for stochastic optimization . Computing Research Repository, arXiv:1412.6980. Version 9

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Stephen C. Kleene. 1956. https://doi.org/doi:10.1515/9781400882618-002 Representation of Events in Nerve Nets and Finite Automata , pages 3--42. Princeton University Press, Princeton

work page doi:10.1515/9781400882618-002 1956
[20]

Jiaoda Li and Ryan Cotterell. 2025. https://openreview.net/forum?id=29LwAgLFpj Characterizing the expressivity of fixed-precision transformer language models . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

2025
[21]

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. 2024. https://openreview.net/forum?id=3EWTEy9MTM Chain of thought empowers transformers to solve inherently serial problems . In The Twelfth International Conference on Learning Representations

2024
[22]

Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled weight decay regularization . In The Seventh International Conference on Learning Representations

2019
[23]

Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. https://doi.org/10.18653/v1/D15-1166 Effective approaches to attention-based neural machine translation . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412--1421, Lisbon, Portugal. Association for Computational Linguistics

work page doi:10.18653/v1/d15-1166 2015
[24]

Dennett.The Intentional Stance

R. McNaughton and S. Papert. 1971. https://mitpress.mit.edu/9780262130769/counter-free-automata/ Counter-free Automata . M.I.T. Press research monographs. M.I.T. Press

work page arXiv 1971
[25]

William Merrill and Ashish Sabharwal. 2024. https://openreview.net/forum?id=NjNGlPh8Wh The expressive power of transformers with chain of thought . In The Twelfth International Conference on Learning Representations

2024
[26]

OpenAI. 2023. https://doi.org/10.48550/arXiv.2303.08774 GPT-4 technical report . Computing Research Repository, arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[27]

Perles, Michael O

Micha A. Perles, Michael O. Rabin, and Eli Shamir. 1963. https://api.semanticscholar.org/CorpusID:45448007 The theory of definite automata . IEEE Transactions on Electronic Computers, 12:233--243

1963
[28]

Amir Pnueli. 1977. https://doi.org/10.1109/SFCS.1977.32 The temporal logic of programs . In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 46--57

work page doi:10.1109/sfcs.1977.32 1977
[29]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf Improving language understanding by generative pre-training . OpenAI technical report

2018
[30]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Language models are unsupervised multitask learners . OpenAI technical report

2019
[31]

J \"u rgen Schmidhuber. 1992. https://ieeexplore.ieee.org/document/6796337 Learning to control fast-weight memories: An alternative to dynamic recurrent networks . In Neural Computation, volume 4, pages 131--139. MIT Press

work page arXiv 1992
[32]

Howard Straubing. 1994. https://doi.org/10.1007/978-1-4612-0289-9 Finite Automata, Formal Logic, and Circuit Complexity . Progress in Theoretical Computer Science. Birkh \"a user Boston, Boston, MA

work page doi:10.1007/978-1-4612-0289-9 1994
[33]

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. https://doi.org/10.1016/j.neucom.2023.127063 Roformer: Enhanced transformer with rotary position embedding . Neurocomputing, 568:127063

work page doi:10.1016/j.neucom.2023.127063 2024
[34]

OLMo Team, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Chang, Khyathi Chandu, Akshita Bhagia, Oyvind Tafjord, and 1 others. 2025. https://arxiv.org/abs/2501.00656 OLMo 2 : The best fully open language model to date . Computing Research Repository, arXiv:2501.00656

work page internal anchor Pith review arXiv 2025
[35]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Attention is all you need . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc

2017
[36]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H Chi, Quoc V Le, and Denny Zhou. 2022. https://openreview.net/forum?id=_VjQlMeSB_J Chain-of-thought prompting elicits reasoning in large language models . In Advances in Neural Information Processing Systems, volume 35

2022
[37]

Andy Yang, Micha \"e l Cadilhac, and David Chiang. 2025. https://openreview.net/forum?id=jPduiyxyfw Knee-deep in c- RASP : A transformer depth hierarchy . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

2025
[38]

Andy Yang, David Chiang, and Dana Angluin. 2024. https://openreview.net/forum?id=FBMsBdH0yz Masked hard-attention transformers recognize exactly the star-free languages . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

2024
[39]

Andy Yang, Lena Strobl, David Chiang, and Dana Angluin. 2026. https://aclanthology.org/2026.tacl-1.8/ Simulating hard attention using soft attention . Transactions of the Association for Computational Linguistics, 14

2026
[40]

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf Big bird: Transformers for longer sequences . In Advances in Neural Information Pr...

2020