Characterizing the Expressivity of Local Attention in Transformers

Jiaoda Li; Ryan Cotterell

arxiv: 2605.00768 · v3 · pith:UL5I747Mnew · submitted 2026-05-01 · 💻 cs.CL

Characterizing the Expressivity of Local Attention in Transformers

Jiaoda Li , Ryan Cotterell This is my paper

Pith reviewed 2026-07-01 07:51 UTC · model grok-4.3

classification 💻 cs.CL

keywords transformerslocal attentionexpressivityregular languageslinear temporal logicattention mechanisms

0 comments

The pith

Adding local attention to fixed-precision transformers introduces a second temporal operator and strictly enlarges the class of recognizable regular languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that global attention in fixed-precision transformers matches a fragment of linear temporal logic with only one past operator. Introducing local attention adds a second temporal operator, which strictly increases the set of regular languages the model can recognize. Global and local attention are complementary: each can recognize languages the other cannot, and their combination covers the largest fragment. This formal distinction provides an expressivity-based explanation for why local attention sometimes improves model performance even though it restricts the attention window. Experiments on formal languages and natural language modeling support the theoretical separation.

Core claim

Fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator. Adding local attention introduces a second temporal operator, strictly enlarging the class of recognizable regular languages. Global and local attention are expressively complementary: neither subsumes the other, and combining them yields the richest fragment.

What carries the argument

The mapping from fixed-precision transformer attention mechanisms to fragments of linear temporal logic, where global attention supplies one past operator and local attention supplies a second.

If this is right

Hybrid global-local transformers recognize strictly more regular languages than global-only transformers.
Local-attention-only transformers recognize some languages that global-attention-only transformers cannot.
Global-attention-only transformers recognize some languages that local-attention-only transformers cannot.
The richest fragment of regular languages is obtained only when both attention types are present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result suggests testing whether languages requiring the second operator appear more frequently in natural language data that benefits from local attention.
One could check if the enlarged fragment corresponds to specific syntactic constructions that local windows capture better than full context.
The complementarity implies that architectures restricted to only one attention type will have inherent limits on the temporal patterns they can model exactly.

Load-bearing premise

The prior result that fixed-precision transformers with global attention correspond exactly to a fragment of linear temporal logic containing a single past operator holds under the paper's definitions of fixed precision and attention variants.

What would settle it

A concrete regular language that global-attention transformers can recognize but hybrid global-local transformers cannot, or vice versa, under the paper's fixed-precision model.

Figures

Figures reproduced from arXiv: 2605.00768 by Jiaoda Li, Ryan Cotterell.

**Figure 1.** Figure 1: Forbidden configuration in the minimal DFAs of view at source ↗

**Figure 2.** Figure 2: Minimal DFAs. Nodes represent states and arrows view at source ↗

**Figure 3.** Figure 3: Heatmap of longest perfect lengths (maximum over runs) across formal languages, attention patterns, and positional view at source ↗

**Figure 4.** Figure 4: Perplexity on WikiText-2 for local, hybrid, and global attention patterns under different positional encodings. Curves view at source ↗

read the original abstract

The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generating the next token. One common variant of attention is called local attention, which restricts each token to aggregating information from a bounded window of predecessors, reducing the quadratic cost of global attention to linear. Although this restriction is usually motivated by efficiency, it has also been found to improve model quality, a phenomenon that has so far lacked a satisfactory explanation. We provide a formal account of this phenomenon in terms of recognizer expressivity. It has been shown that fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator. We additionally prove that adding local attention introduces a second temporal operator, strictly enlarging the class of recognizable regular languages. Moreover, global and local attention are expressively complementary: neither subsumes the other, and combining them yields the richest fragment. Experiments on formal language recognition and natural language modeling corroborate the theory, showing that hybrid global--local transformers outperform their global-only counterparts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Local attention adds a second temporal operator that global attention lacks, so the two are complementary rather than one subsuming the other.

read the letter

The main point is that the paper proves local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers. Global attention matches a fragment of linear temporal logic with one past operator; local attention adds a distinct second operator, and neither fragment contains the other. Hybrid attention gets the richest fragment. This supplies a formal reason why local attention can improve quality beyond just speed.

The work does two things cleanly. It gives an expressivity account that matches observed behavior in practice, and it includes experiments on both formal language recognition and natural language modeling that line up with the theory. The complementarity result is the genuinely new piece relative to earlier global-attention characterizations.

The soft spot is the base case. The argument rests on the cited prior result that global attention equals the single-past fragment. The abstract treats this as given without showing an independent check under the paper's own definitions of precision, masking, and state. If those definitions diverge even modestly, the enlargement claim does not go through. The experiments are described only at a high level, so it is unclear how tightly they test the logical distinction versus other factors.

This is for readers who care about formal accounts of attention mechanisms or hybrid architectures. Anyone working on theoretical CS or NLP expressivity questions will find the logic useful. It is solid enough to merit a serious referee, though the dependence on the prior correspondence needs explicit verification in review.

Referee Report

1 major / 2 minor

Summary. The paper claims that fixed-precision transformers with global attention correspond exactly to a fragment of linear temporal logic containing a single past operator. It proves that adding local attention introduces a second temporal operator, strictly enlarging the class of recognizable regular languages. Global and local attention are expressively complementary (neither subsumes the other), and their combination yields the richest fragment. Experiments on formal language recognition tasks and natural language modeling are said to corroborate the theory.

Significance. If the central correspondence and enlargement results hold, the work supplies a formal account of why local attention can improve model quality beyond efficiency gains, by increasing the recognizable regular languages. The explicit proofs of the new local-attention result and the complementarity claim, together with the experimental corroboration, constitute clear strengths. The framing in terms of LTL fragments is a useful lens for attention mechanism design.

major comments (1)

[§2] §2 (or the section invoking the global-attention base case): The claim that fixed-precision global attention 'corresponds to a fragment of linear temporal logic containing a single past operator' is introduced solely by citation ('it has been shown') without a self-contained re-derivation or explicit verification that the paper's definitions of fixed precision, state representation, and attention masking align with those of the cited prior work. This base case is load-bearing for the strict-enlargement argument that local attention adds a second operator.

minor comments (2)

[Abstract] Abstract: 'fixed-precision' is used without a brief parenthetical definition or pointer to its formalization later in the paper.
[Experiments] The experimental section would benefit from an explicit statement of the local window size(s) used in the hybrid models and from reporting variance across random seeds for the language-modeling perplexity numbers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive comment. We address it point by point below.

read point-by-point responses

Referee: [§2] §2 (or the section invoking the global-attention base case): The claim that fixed-precision global attention 'corresponds to a fragment of linear temporal logic containing a single past operator' is introduced solely by citation ('it has been shown') without a self-contained re-derivation or explicit verification that the paper's definitions of fixed precision, state representation, and attention masking align with those of the cited prior work. This base case is load-bearing for the strict-enlargement argument that local attention adds a second operator.

Authors: We agree that the base case is load-bearing and that sole reliance on citation leaves the alignment between our definitions (fixed precision, state representation, attention masking) and the cited prior work implicit. In the revised manuscript we will add a concise, self-contained re-derivation of the global-attention correspondence as a new subsection or appendix. This will explicitly verify that our setup yields precisely the single-past-operator LTL fragment, thereby making the strict-enlargement argument fully internal to the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claim rests on cited external prior result plus independent new proof

full rationale

The paper cites an external prior result ('It has been shown that fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator') as the base case and then states its own additional proof ('We additionally prove that adding local attention introduces a second temporal operator'). No equations or definitions in the provided text reduce the new local-attention result to the global case by construction, nor is there any fitted parameter renamed as prediction, self-definitional loop, or ansatz smuggled via self-citation. The cited base is treated as independent support, and the enlargement claim is presented as a fresh derivation. This is the normal case of a self-contained argument against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unexamined prior correspondence between global attention and single-operator LTL, plus standard definitions of regular languages and local attention windows. No free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator.
Invoked in the abstract as 'it has been shown' without re-derivation.

pith-pipeline@v0.9.1-grok · 5714 in / 1217 out tokens · 26757 ms · 2026-07-01T07:51:09.061060+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't
cs.LG 2026-05 unverdicted novelty 7.0

Padded transformers with constant or growing precision are equivalent to L-uniform AC^0 or TC^0, and with looping reach FO-uniform AC^d or TC^d, robust to width and attention mechanism.

Reference graph

Works this paper leans on

43 extracted references · 22 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. https://arxiv.org/abs/1607.06450 Layer normalization . In NIPS Deep Learning Symposium

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. http://arxiv.org/abs/1409.0473 Neural machine translation by jointly learning to align and translate . In The Third International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv 2015
[3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. https://arxiv.org/abs/2004.05150 Longformer: The long-document transformer . Computing Research Repository, arXiv:2004.05150. Version 2

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, and 12 others. 2020. https://proceedings.neurips.cc/paper_fil...

2020
[5]

Janusz Brzozowski, Baiyu Li, and David Liu. 2012. https://dl.acm.org/doi/abs/10.5555/3173440.3173443 Syntactic complexities of six classes of star-free languages . Journal of Automata, Languages and Combinatorics, 17(2):83–105

work page doi:10.5555/3173440.3173443 2012
[6]

Alexandra Butoi, Ghazal Khalighinejad, Anej Svete, Josef Valvoda, Ryan Cotterell, and Brian DuSell. 2025. https://openreview.net/forum?id=aWLQTbfFgV Training neural networks as recognizers of formal languages . In The Thirteenth International Conference on Learning Representations

2025
[7]

David Chiang, Peter Cholak, and Anand Pillay. 2023. https://proceedings.mlr.press/v202/chiang23a.html Tighter bounds on the expressivity of transformer encoders . In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 5544--5562. PMLR

2023
[8]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. http://arxiv.org/abs/1904.10509 Generating long sequences with sparse transformers . Computing Research Repository, arXiv:1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 2019
[9]

Joëlle Cohen, Dominique Perrin, and Jean-Eric Pin. 1993. https://doi.org/10.1016/0022-0000(93)90005-H On the expressive power of temporal logic . Journal of Computer and System Sciences, 46(3):271--294

work page doi:10.1016/0022-0000(93)90005-h 1993
[10]

DeepSeek-AI . 2025. https://arxiv.org/abs/2501.12948 DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning . Computing Research Repository, arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Gr\'egoire Del\'etang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and Pedro A Ortega. 2023. https://openreview.net/forum?id=WbxHAzkeQcn Neural networks and the chomsky hierarchy . In The Eleventh International Conference on Learning Representations

2023
[12]

Samuel Eilenberg. 1974. https://books.google.ch/books?id=CZtduwEACAAJ Automata, Languages, and Machines . Number pt. 2 in 59/B. Academic Press

1974
[13]

Dov Gabbay, Amir Pnueli, Saharon Shelah, and Jonathan Stavi. 1980. https://doi.org/10.1145/567446.567462 On the temporal analysis of fairness . POPL '80, page 163–173, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/567446.567462 1980
[14]

James A. Green. 1951. http://www.jstor.org/stable/1969317 On the structure of semigroups . Annals of Mathematics, 54(1):163--172

work page arXiv 1951
[15]

John Hewitt, Michael Hahn, Surya Ganguli, Percy Liang, and Christopher D. Manning. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.156 RNN s can generate bounded hierarchical languages with optimal memory . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1978--2010, Online. Association for Computa...

work page doi:10.18653/v1/2020.emnlp-main.156 2020
[16]

IEEE . 2019. https://doi.org/10.1109/IEEESTD.2019.8766229 IEEE standard for floating-point arithmetic . IEEE Std 754-2019

work page doi:10.1109/ieeestd.2019.8766229 2019
[17]

Johan Anthony Wilem Kamp. 1968. https://www.proquest.com/docview/302320357 Tense Logic and the Theory of Linear Order . Ph.D. thesis, University of California, Los Angeles

work page arXiv 1968
[18]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. 2014. https://arxiv.org/abs/1412.6980 Adam: A method for stochastic optimization . Computing Research Repository, arXiv:1412.6980. Version 9

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Stephen C. Kleene. 1956. https://doi.org/doi:10.1515/9781400882618-002 Representation of Events in Nerve Nets and Finite Automata , pages 3--42. Princeton University Press, Princeton

work page doi:10.1515/9781400882618-002 1956
[20]

Jiaoda Li and Ryan Cotterell. 2025. https://openreview.net/forum?id=29LwAgLFpj Characterizing the expressivity of fixed-precision transformer language models . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

2025
[21]

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. 2024. https://openreview.net/forum?id=3EWTEy9MTM Chain of thought empowers transformers to solve inherently serial problems . In The Twelfth International Conference on Learning Representations

2024
[22]

Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled weight decay regularization . In The Seventh International Conference on Learning Representations

2019
[23]

Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. https://doi.org/10.18653/v1/D15-1166 Effective approaches to attention-based neural machine translation . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412--1421, Lisbon, Portugal. Association for Computational Linguistics

work page doi:10.18653/v1/d15-1166 2015
[24]

Robert McNaughton and Seymour Papert. 1971. https://mitpress.mit.edu/9780262130769/counter-free-automata/ Counter-free Automata . M.I.T. Press research monographs. M.I.T. Press

work page arXiv 1971
[25]

William Merrill and Ashish Sabharwal. 2024. https://openreview.net/forum?id=NjNGlPh8Wh The expressive power of transformers with chain of thought . In The Twelfth International Conference on Learning Representations

2024
[26]

OpenAI. 2023. https://doi.org/10.48550/arXiv.2303.08774 GPT-4 technical report . Computing Research Repository, arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[27]

Perles, Michael O

Micha A. Perles, Michael O. Rabin, and Eli Shamir. 1963. https://api.semanticscholar.org/CorpusID:45448007 The theory of definite automata . IEEE Transactions on Electronic Computers, 12:233--243

1963
[28]

Amir Pnueli. 1977. https://doi.org/10.1109/SFCS.1977.32 The temporal logic of programs . In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 46--57

work page doi:10.1109/sfcs.1977.32 1977
[29]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf Improving language understanding by generative pre-training . OpenAI technical report

2018
[30]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Language models are unsupervised multitask learners . OpenAI technical report

2019
[31]

J \"u rgen Schmidhuber. 1992. https://ieeexplore.ieee.org/document/6796337 Learning to control fast-weight memories: An alternative to dynamic recurrent networks . In Neural Computation, volume 4, pages 131--139. MIT Press

work page arXiv 1992
[32]

Howard Straubing. 1994. https://doi.org/10.1007/978-1-4612-0289-9 Finite Automata, Formal Logic, and Circuit Complexity . Progress in Theoretical Computer Science. Birkh \"a user Boston, Boston, MA

work page doi:10.1007/978-1-4612-0289-9 1994
[33]

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. https://doi.org/10.1016/j.neucom.2023.127063 Roformer: Enhanced transformer with rotary position embedding . Neurocomputing, 568:127063

work page doi:10.1016/j.neucom.2023.127063 2024
[34]

OLMo Team, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Chang, Khyathi Chandu, Akshita Bhagia, Oyvind Tafjord, and 1 others. 2025. https://arxiv.org/abs/2501.00656 OLMo 2 : The best fully open language model to date . Computing Research Repository, arXiv:2501.00656

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Attention is all you need . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc

2017
[36]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H Chi, Quoc V Le, and Denny Zhou. 2022. https://openreview.net/forum?id=_VjQlMeSB_J Chain-of-thought prompting elicits reasoning in large language models . In Advances in Neural Information Processing Systems, volume 35

2022
[37]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://www.aclweb.org/anthology/2020.emnlp-demos.6 Transformers...

2020
[38]

Andy Yang, Micha \"e l Cadilhac, and David Chiang. 2025. https://openreview.net/forum?id=jPduiyxyfw Knee-deep in c- RASP : A transformer depth hierarchy . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

2025
[39]

Andy Yang, David Chiang, and Dana Angluin. 2024. https://openreview.net/forum?id=FBMsBdH0yz Masked hard-attention transformers recognize exactly the star-free languages . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

2024
[40]

Andy Yang, Lena Strobl, David Chiang, and Dana Angluin. 2026. https://aclanthology.org/2026.tacl-1.8/ Simulating hard attention using soft attention . Transactions of the Association for Computational Linguistics, 14

2026
[41]

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf Big bird: Transformers for longer sequences . In Advances in Neural Information Pr...

2020
[42]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[43]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[1] [1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. https://arxiv.org/abs/1607.06450 Layer normalization . In NIPS Deep Learning Symposium

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. http://arxiv.org/abs/1409.0473 Neural machine translation by jointly learning to align and translate . In The Third International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv 2015

[3] [3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. https://arxiv.org/abs/2004.05150 Longformer: The long-document transformer . Computing Research Repository, arXiv:2004.05150. Version 2

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, and 12 others. 2020. https://proceedings.neurips.cc/paper_fil...

2020

[5] [5]

Janusz Brzozowski, Baiyu Li, and David Liu. 2012. https://dl.acm.org/doi/abs/10.5555/3173440.3173443 Syntactic complexities of six classes of star-free languages . Journal of Automata, Languages and Combinatorics, 17(2):83–105

work page doi:10.5555/3173440.3173443 2012

[6] [6]

Alexandra Butoi, Ghazal Khalighinejad, Anej Svete, Josef Valvoda, Ryan Cotterell, and Brian DuSell. 2025. https://openreview.net/forum?id=aWLQTbfFgV Training neural networks as recognizers of formal languages . In The Thirteenth International Conference on Learning Representations

2025

[7] [7]

David Chiang, Peter Cholak, and Anand Pillay. 2023. https://proceedings.mlr.press/v202/chiang23a.html Tighter bounds on the expressivity of transformer encoders . In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 5544--5562. PMLR

2023

[8] [8]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. http://arxiv.org/abs/1904.10509 Generating long sequences with sparse transformers . Computing Research Repository, arXiv:1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 2019

[9] [9]

Joëlle Cohen, Dominique Perrin, and Jean-Eric Pin. 1993. https://doi.org/10.1016/0022-0000(93)90005-H On the expressive power of temporal logic . Journal of Computer and System Sciences, 46(3):271--294

work page doi:10.1016/0022-0000(93)90005-h 1993

[10] [10]

DeepSeek-AI . 2025. https://arxiv.org/abs/2501.12948 DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning . Computing Research Repository, arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Gr\'egoire Del\'etang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and Pedro A Ortega. 2023. https://openreview.net/forum?id=WbxHAzkeQcn Neural networks and the chomsky hierarchy . In The Eleventh International Conference on Learning Representations

2023

[12] [12]

Samuel Eilenberg. 1974. https://books.google.ch/books?id=CZtduwEACAAJ Automata, Languages, and Machines . Number pt. 2 in 59/B. Academic Press

1974

[13] [13]

Dov Gabbay, Amir Pnueli, Saharon Shelah, and Jonathan Stavi. 1980. https://doi.org/10.1145/567446.567462 On the temporal analysis of fairness . POPL '80, page 163–173, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/567446.567462 1980

[14] [14]

James A. Green. 1951. http://www.jstor.org/stable/1969317 On the structure of semigroups . Annals of Mathematics, 54(1):163--172

work page arXiv 1951

[15] [15]

John Hewitt, Michael Hahn, Surya Ganguli, Percy Liang, and Christopher D. Manning. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.156 RNN s can generate bounded hierarchical languages with optimal memory . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1978--2010, Online. Association for Computa...

work page doi:10.18653/v1/2020.emnlp-main.156 2020

[16] [16]

IEEE . 2019. https://doi.org/10.1109/IEEESTD.2019.8766229 IEEE standard for floating-point arithmetic . IEEE Std 754-2019

work page doi:10.1109/ieeestd.2019.8766229 2019

[17] [17]

Johan Anthony Wilem Kamp. 1968. https://www.proquest.com/docview/302320357 Tense Logic and the Theory of Linear Order . Ph.D. thesis, University of California, Los Angeles

work page arXiv 1968

[18] [18]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. 2014. https://arxiv.org/abs/1412.6980 Adam: A method for stochastic optimization . Computing Research Repository, arXiv:1412.6980. Version 9

work page internal anchor Pith review Pith/arXiv arXiv 2014

[19] [19]

Stephen C. Kleene. 1956. https://doi.org/doi:10.1515/9781400882618-002 Representation of Events in Nerve Nets and Finite Automata , pages 3--42. Princeton University Press, Princeton

work page doi:10.1515/9781400882618-002 1956

[20] [20]

Jiaoda Li and Ryan Cotterell. 2025. https://openreview.net/forum?id=29LwAgLFpj Characterizing the expressivity of fixed-precision transformer language models . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

2025

[21] [21]

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. 2024. https://openreview.net/forum?id=3EWTEy9MTM Chain of thought empowers transformers to solve inherently serial problems . In The Twelfth International Conference on Learning Representations

2024

[22] [22]

Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled weight decay regularization . In The Seventh International Conference on Learning Representations

2019

[23] [23]

Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. https://doi.org/10.18653/v1/D15-1166 Effective approaches to attention-based neural machine translation . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412--1421, Lisbon, Portugal. Association for Computational Linguistics

work page doi:10.18653/v1/d15-1166 2015

[24] [24]

Robert McNaughton and Seymour Papert. 1971. https://mitpress.mit.edu/9780262130769/counter-free-automata/ Counter-free Automata . M.I.T. Press research monographs. M.I.T. Press

work page arXiv 1971

[25] [25]

William Merrill and Ashish Sabharwal. 2024. https://openreview.net/forum?id=NjNGlPh8Wh The expressive power of transformers with chain of thought . In The Twelfth International Conference on Learning Representations

2024

[26] [26]

OpenAI. 2023. https://doi.org/10.48550/arXiv.2303.08774 GPT-4 technical report . Computing Research Repository, arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023

[27] [27]

Perles, Michael O

Micha A. Perles, Michael O. Rabin, and Eli Shamir. 1963. https://api.semanticscholar.org/CorpusID:45448007 The theory of definite automata . IEEE Transactions on Electronic Computers, 12:233--243

1963

[28] [28]

Amir Pnueli. 1977. https://doi.org/10.1109/SFCS.1977.32 The temporal logic of programs . In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 46--57

work page doi:10.1109/sfcs.1977.32 1977

[29] [29]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf Improving language understanding by generative pre-training . OpenAI technical report

2018

[30] [30]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf Language models are unsupervised multitask learners . OpenAI technical report

2019

[31] [31]

J \"u rgen Schmidhuber. 1992. https://ieeexplore.ieee.org/document/6796337 Learning to control fast-weight memories: An alternative to dynamic recurrent networks . In Neural Computation, volume 4, pages 131--139. MIT Press

work page arXiv 1992

[32] [32]

Howard Straubing. 1994. https://doi.org/10.1007/978-1-4612-0289-9 Finite Automata, Formal Logic, and Circuit Complexity . Progress in Theoretical Computer Science. Birkh \"a user Boston, Boston, MA

work page doi:10.1007/978-1-4612-0289-9 1994

[33] [33]

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. https://doi.org/10.1016/j.neucom.2023.127063 Roformer: Enhanced transformer with rotary position embedding . Neurocomputing, 568:127063

work page doi:10.1016/j.neucom.2023.127063 2024

[34] [34]

OLMo Team, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Chang, Khyathi Chandu, Akshita Bhagia, Oyvind Tafjord, and 1 others. 2025. https://arxiv.org/abs/2501.00656 OLMo 2 : The best fully open language model to date . Computing Research Repository, arXiv:2501.00656

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Attention is all you need . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc

2017

[36] [36]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H Chi, Quoc V Le, and Denny Zhou. 2022. https://openreview.net/forum?id=_VjQlMeSB_J Chain-of-thought prompting elicits reasoning in large language models . In Advances in Neural Information Processing Systems, volume 35

2022

[37] [37]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://www.aclweb.org/anthology/2020.emnlp-demos.6 Transformers...

2020

[38] [38]

Andy Yang, Micha \"e l Cadilhac, and David Chiang. 2025. https://openreview.net/forum?id=jPduiyxyfw Knee-deep in c- RASP : A transformer depth hierarchy . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

2025

[39] [39]

Andy Yang, David Chiang, and Dana Angluin. 2024. https://openreview.net/forum?id=FBMsBdH0yz Masked hard-attention transformers recognize exactly the star-free languages . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

2024

[40] [40]

Andy Yang, Lena Strobl, David Chiang, and Dana Angluin. 2026. https://aclanthology.org/2026.tacl-1.8/ Simulating hard attention using soft attention . Transactions of the Association for Computational Linguistics, 14

2026

[41] [41]

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf Big bird: Transformers for longer sequences . In Advances in Neural Information Pr...

2020

[42] [42]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

[43] [43]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...