Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete

Qian Li; Shang-Hua Teng; Xinyu Mao

arxiv: 2606.01532 · v2 · pith:UCMEJF3Qnew · submitted 2026-06-01 · 💻 cs.LG · cs.CC

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete

Qian Li , Xinyu Mao , Shang-Hua Teng This is my paper

Pith reviewed 2026-06-28 15:46 UTC · model grok-4.3

classification 💻 cs.LG cs.CC

keywords positional encodingsliding windowTuring completenesstransformersHIST modelPost machinespermutation symmetryautoregressive model

0 comments

The pith

Sliding-window transformers without positional encoding remain Turing complete.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the assumption that positional encodings are required for transformers to process ordered sequences during long-form reasoning with sliding context windows. It introduces the HIST model as an abstract autoregressive system whose state updates depend solely on a fixed-size internal memory and the histogram of token counts inside the current window. The authors prove the HIST model is Turing complete by showing that successive histograms uniquely identify the token that just exited the window, which allows simulation of Post machines. They then build an explicit sliding-window transformer over a finite token alphabet that reproduces the HIST updates without any positional encoding. This shows the sliding action itself supplies enough order information to support arbitrary computation.

Core claim

A sliding-window transformer over a constant-size token alphabet, without positional encoding, can simulate the HIST model, which is Turing complete because the evolution of token-count histograms inside the window reveals the exact token that has departed and thereby permits simulation of Post machines.

What carries the argument

The HIST model, an abstract autoregressive model whose updates depend only on constant-size internal state and the token-count histogram within the current window.

If this is right

Transformers achieve Turing completeness in the sliding-window regime without relying on positional encodings.
The window-sliding operation alone breaks the permutation invariance of attention sufficiently for universal computation.
Universal computation holds for any transformer whose attention is restricted to a fixed-size context and whose token alphabet is finite.
Order information can be recovered implicitly from changes in token histograms over successive steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Long-context models that already use sliding windows may achieve reasoning capabilities without adding explicit positional encodings.
Similar histogram-tracking mechanisms could be examined in other architectures to see whether they substitute for positional signals in tasks beyond pure computation.
The result raises the question of how much explicit position information is truly needed once context windows move.

Load-bearing premise

The evolution of token counts inside the sliding window is sufficient to recover the exact token that has just left the window.

What would settle it

A concrete sequence of tokens for which two different departed tokens produce identical subsequent histogram trajectories inside the window, so that the departed token cannot be uniquely recovered.

Figures

Figures reproduced from arXiv: 2606.01532 by Qian Li, Shang-Hua Teng, Xinyu Mao.

read the original abstract

Positional encoding (PE) is widely viewed as necessary for transformers to process ordered sequences: without them, the next-token map appears permutation-invariant in its context tokens. This intuition underlies all prior universality results, which rely on positional information to prove that transformers with chain-of-thought can perform arbitrary computation, i.e., they are Turing complete. We revisit this belief in the regime most relevant to long-form reasoning, where generation proceeds through a finite sliding context window. Our opening perception is that the window mechanism itself (mildly) breaks the permutation symmetry. To distill and precisely capture the degree of this added expressiveness, we introduce an abstract autoregressive model, the HIST model, in which each update depends only on constant-size internal state and the token-count histogram within the current window. We prove that this HIST model is Turing complete by showing that the evolution of the window can reveal the token that has just left the window, which suffices to simulate Turing-complete Post machines. We then construct a sliding-window transformer over a constant-size token alphabet, without PE, and show that it can simulate the HIST model. Our result demonstrates that positional encodings are not indispensable for transformers to perform universal computation: The window sliding itself already breaks permutation symmetry and captures sufficient positional information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sliding-window transformers without PE can be Turing complete via histogram recovery of exited tokens, but the uniqueness of that recovery needs verification.

read the letter

The punchline for this paper is that it shows how the sliding window in transformers can provide enough positional information through token count histograms to achieve Turing completeness without any positional encoding.

What the paper does is introduce the HIST model as an abstraction where each step depends on the histogram of tokens in the current window and some internal state. They prove this model is Turing complete by demonstrating that the sequence of histograms allows recovery of the token that has just left the window, which is enough to simulate a Post machine. Then they construct a specific sliding-window transformer over a constant-size alphabet that can emulate the HIST model without using positional encodings.

This is new because previous results on the Turing completeness of transformers with chain-of-thought all depended on positional encodings to break the symmetry of the context. The window mechanism alone is shown to be sufficient here.

The construction is explicit and reduces to a standard Turing-complete model, which is a strength. There is no circularity or fitting involved.

The main soft spot is whether the histogram evolution truly uniquely determines the departed token in all cases. The argument rests on this recovery being possible without ambiguity. If for some window states and token sequences two different departing tokens lead to the same future histogram path, then the simulation would fail to distinguish them and the reduction to Post machines would not hold. The paper asserts this works, but the strength of the result depends on how thoroughly they prove the injectivity of this recovery map.

This work is for people studying the theoretical capabilities of transformer architectures, particularly in the context of long-form reasoning and minimal designs for universal computation. A reader focused on alternatives to positional encodings in attention models would find it relevant.

It deserves a serious referee because the result challenges a common assumption and provides a constructive proof that can be checked.

I would recommend sending this to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that sliding-window transformers without positional encodings (PE) remain Turing complete. It introduces the HIST model, an abstract autoregressive model whose updates depend only on a constant-size internal state and the token-count histogram in the current window. The authors prove HIST is Turing complete by arguing that the evolution of these histograms reveals the exact token that has just left the window, enabling simulation of Turing-complete Post machines. They then construct a sliding-window transformer over a constant-size alphabet, without PE, that simulates the HIST model, concluding that the window-sliding mechanism itself suffices to break permutation symmetry and capture enough positional information for universal computation.

Significance. If the constructions and the histogram-to-token recovery map hold, the result would show that PE is not required for Turing completeness in the practically relevant sliding-window regime. This would be a notable clarification of the role of positional information, as it isolates the contribution of the window mechanism itself rather than relying on explicit position encodings or chain-of-thought with positional markers. The explicit simulation route (HIST to Post machines, transformer to HIST) is a strength if the details are complete.

major comments (2)

[Proof of HIST Turing completeness (Post-machine simulation)] The central reduction for HIST Turing completeness rests on the claim that the sequence of token-count histograms uniquely determines the departed token for every reachable window state and every possible exit token (required to simulate the Post-machine tape head). No explicit inductive argument, construction of the recovery map, or enumeration of edge cases for non-injective trajectories is visible in the provided abstract or sketch; if two distinct exit tokens can produce identical histogram sequences under the update rule, the simulation fails. This is the load-bearing step for the Post-machine reduction.
[Transformer simulation of HIST model] The transformer-to-HIST simulation is asserted via a construction over a constant-size alphabet, but the manuscript must supply the precise attention and feed-forward layers that recover the histogram state without PE. If the construction relies on any implicit positional cue or fails for certain window sizes or alphabets, the universality claim does not transfer.

minor comments (2)

[HIST model definition] Clarify the precise definition of the HIST internal state and the window-update rule; the abstract states it depends on 'constant-size internal state' but does not specify its dimension or initialization.
[Introduction] The abstract mentions 'mildly breaks the permutation symmetry'; a short formal statement of the symmetry group preserved by the window mechanism versus full permutation invariance would help readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thoughtful review and constructive comments on our work. The two major comments raise important points about the completeness of the proofs and constructions, which we address below. We will make revisions to provide more explicit details as suggested.

read point-by-point responses

Referee: [Proof of HIST Turing completeness (Post-machine simulation)] The central reduction for HIST Turing completeness rests on the claim that the sequence of token-count histograms uniquely determines the departed token for every reachable window state and every possible exit token (required to simulate the Post-machine tape head). No explicit inductive argument, construction of the recovery map, or enumeration of edge cases for non-injective trajectories is visible in the provided abstract or sketch; if two distinct exit tokens can produce identical histogram sequences under the update rule, the simulation fails. This is the load-bearing step for the Post-machine reduction.

Authors: We thank the referee for identifying this critical aspect of the proof. While the manuscript provides a sketch of the argument that the histogram evolution reveals the departed token (see the proof of Theorem 3.2), we agree that an explicit inductive argument and the recovery map construction would strengthen the presentation. In the revised manuscript, we will include a detailed inductive proof showing uniqueness for reachable states, construct the recovery function explicitly, and enumerate why non-injective cases do not arise in the simulation of Post machines. This addresses the concern that the simulation could fail if distinct exit tokens produce identical histograms. revision: yes
Referee: [Transformer simulation of HIST model] The transformer-to-HIST simulation is asserted via a construction over a constant-size alphabet, but the manuscript must supply the precise attention and feed-forward layers that recover the histogram state without PE. If the construction relies on any implicit positional cue or fails for certain window sizes or alphabets, the universality claim does not transfer.

Authors: We appreciate the referee's call for precision in the construction. The manuscript outlines how a constant-size alphabet transformer can simulate the HIST model using attention to compute token counts in the window and feed-forward layers to update the state (Section 4). However, to ensure clarity and verifiability, we will expand this section with the precise mathematical definitions of the attention weights and feed-forward functions that recover the histogram without relying on positional encodings. We confirm that the construction holds for arbitrary window sizes and the chosen alphabet, as it depends only on the histogram aggregation which is position-agnostic beyond the window boundary. revision: yes

Circularity Check

0 steps flagged

No circularity; explicit constructions establish Turing completeness independently

full rationale

The paper defines the HIST model as an abstract autoregressive process depending only on constant-size state and window histograms, then proves its Turing completeness via an explicit reduction to Post machines that relies on showing (by construction) that histogram trajectories uniquely identify departed tokens. It next gives an explicit transformer construction (over fixed alphabet, no PE) that simulates HIST. Neither step reduces a result to a fitted parameter, self-referential definition, or load-bearing self-citation; both are direct mathematical constructions whose validity can be checked against the stated assumptions without circular reduction to the target claim. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the standard definition of Turing completeness via Post machines and on the newly introduced HIST abstraction. No numerical parameters are fitted; the only invented entity is the HIST model itself.

axioms (1)

standard math Post machines are Turing complete
Invoked to conclude that HIST is Turing complete once it can simulate a Post machine.

invented entities (1)

HIST model no independent evidence
purpose: Abstract autoregressive model whose state consists of constant-size internal memory plus token-count histogram inside the current window
Introduced to isolate the expressiveness contributed by the sliding window alone

pith-pipeline@v0.9.1-grok · 5761 in / 1414 out tokens · 31424 ms · 2026-06-28T15:46:34.933811+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

105 extracted references · 42 canonical work pages · 15 internal anchors

[1]

The Thirty Seventh Annual Conference on Learning Theory , pages=

Undetectable watermarks for language models , author=. The Thirty Seventh Annual Conference on Learning Theory , pages=. 2024 , organization=

2024
[2]

arXiv preprint arXiv:2406.02633 , year=

Edit Distance Robust Watermarks for Language Models , author=. arXiv preprint arXiv:2406.02633 , year=

work page arXiv
[3]

arXiv preprint arXiv:2310.18491 , year=

Publicly-detectable watermarking for language models , author=. arXiv preprint arXiv:2310.18491 , year=

work page arXiv
[4]

Proceedings of the 57th Annual ACM Symposium on Theory of Computing , pages=

Ideal pseudorandom codes , author=. Proceedings of the 57th Annual ACM Symposium on Theory of Computing , pages=
[5]

arXiv preprint arXiv:2409.07580 , year=

New constructions of pseudorandom codes , author=. arXiv preprint arXiv:2409.07580 , year=

work page arXiv
[6]

Annual International Cryptology Conference , pages=

Pseudorandom error-correcting codes , author=. Annual International Cryptology Conference , pages=. 2024 , organization=

2024
[7]

arXiv preprint arXiv:2402.12875 , volume=

Chain of thought empowers transformers to solve inherently serial problems , author=. arXiv preprint arXiv:2402.12875 , volume=

work page arXiv
[8]

arXiv preprint arXiv:2412.02975 , year=

Theoretical limitations of multi-layer transformer , author=. arXiv preprint arXiv:2412.02975 , year=

work page arXiv
[9]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=
[10]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[11]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

2024
[12]

The Linear Representation Hypothesis and the Geometry of Large Language Models

The linear representation hypothesis and the geometry of large language models , author=. arXiv preprint arXiv:2311.03658 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

arXiv preprint arXiv:2406.14197 , year=

On the representational capacity of neural language models with chain-of-thought reasoning , author=. arXiv preprint arXiv:2406.14197 , year=

work page arXiv
[14]

Submitted to ICLR 2026 , year=

Softmax Transformers are Turing-Complete , author=. Submitted to ICLR 2026 , year=

2026
[15]

arXiv preprint arXiv:2310.07923 , year=

The expressive power of transformers with chain of thought , author=. arXiv preprint arXiv:2310.07923 , year=

work page arXiv
[16]

arXiv preprint arXiv:2411.01992 , year=

Ask, and it shall be given: On the Turing completeness of prompting , author=. arXiv preprint arXiv:2411.01992 , year=

work page arXiv
[17]

Journal of Machine Learning Research , volume=

Attention is turing-complete , author=. Journal of Machine Learning Research , volume=
[18]

Unique Hard Attention: A Tale of Two Sides , url =

Jerad, Selim and Svete, Anej and Li, Jiaoda and Cotterell, Ryan. Unique Hard Attention: A Tale of Two Sides. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025. doi:10.18653/v1/2025.acl-short.76

work page doi:10.18653/v1/2025.acl-short.76 2025
[19]

arXiv preprint arXiv:2502.07266 , year=

When more is less: Understanding chain-of-thought length in llms , author=. arXiv preprint arXiv:2502.07266 , year=

work page arXiv
[20]

arXiv preprint arXiv:2502.18080 , year=

Towards thinking-optimal scaling of test-time compute for llm reasoning , author=. arXiv preprint arXiv:2502.18080 , year=

work page arXiv
[21]

arXiv preprint arXiv:2401.04925 , year=

The impact of reasoning step length on large language models , author=. arXiv preprint arXiv:2401.04925 , year=

work page arXiv
[22]

Journal of the ACM (JACM) , volume=

Relations among complexity measures , author=. Journal of the ACM (JACM) , volume=. 1979 , publisher=

1979
[23]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

OpenAI o1-mini: Advancing Cost-Efficient Reasoning , year =
[25]

Learning to Reason with LLMs , year =
[26]

GPT-4 Technical Report

GPT-4 Technical Report , author =. arXiv preprint arXiv:2303.08774 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[27]

and Jauhri, A

Dubey, A. and Jauhri, A. and Pandey, A. and Kadian, A. and Al-Dahle, A. and Letman, A. and Mathur, A. and Schelten, A. and Yang, A. and Fan, A. and others , journal =. The. 2024 , archivePrefix =

2024
[28]

2024 , howpublished =

The. 2024 , howpublished =

2024
[29]

2024 , howpublished =

Introducing. 2024 , howpublished =

2024
[30]

2025 , howpublished =

DeepSeek-R1: Incentivizing Reasoning Capability in. 2025 , howpublished =

2025
[31]

2024 , date =

Hessam Akhlaghpour , title =. 2024 , date =

2024
[32]

arXiv preprint arXiv:2502.06773 , year =

On the Emergence of Thinking in LLMs I: Searching for the Right Intuition , author =. arXiv preprint arXiv:2502.06773 , year =

work page arXiv
[33]

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Towards large reasoning models: A survey of reinforced reasoning with large language models , author=. arXiv preprint arXiv:2501.09686 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

The journal of symbolic logic , volume=

Finite combinatory processes—formulation1 , author=. The journal of symbolic logic , volume=. 1936 , publisher=

1936
[35]

Advances in neural information processing systems , volume=

Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting , author=. Advances in neural information processing systems , volume=
[36]

arXiv preprint arXiv:2503.03588 , year=

PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention , author=. arXiv preprint arXiv:2503.03588 , year=

work page arXiv
[37]

The Twelfth International Conference on Learning Representations,

William Merrill and Ashish Sabharwal , title =. The Twelfth International Conference on Learning Representations,
[38]

CoRR , volume =

Ruizhong Qiu and Zhe Xu and Wenxuan Bao and Hanghang Tong , title =. CoRR , volume =
[39]

Attention is Turing-Complete , journal =

Jorge P. Attention is Turing-Complete , journal =. 2021 , url =

2021
[40]

The Twelfth International Conference on Learning Representations,

Zhiyuan Li and Hong Liu and Denny Zhou and Tengyu Ma , title =. The Twelfth International Conference on Learning Representations,
[41]

CoRR , volume =

William Merrill and Ashish Sabharwal , title =. CoRR , volume =
[42]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
[43]

Transactions of the Association for Computational Linguistics , volume=

The parallelism tradeoff: Limitations of log-precision transformers , author=. Transactions of the Association for Computational Linguistics , volume=
[44]

Proceedings of the 24th Conference on Computational Natural Language Learning , pages=

On the Computational Power of Transformers and Its Implications in Sequence Modeling , author=. Proceedings of the 24th Conference on Computational Natural Language Learning , pages=
[45]

arXiv preprint arXiv:2503.14337 , year=

PENCIL: Long thoughts with short memory , author=. arXiv preprint arXiv:2503.14337 , year=

work page arXiv
[46]

arXiv preprint arXiv:2405.16674 , year=

Limits of deep learning: Sequence modeling through the lens of complexity theory , author=. arXiv preprint arXiv:2405.16674 , year=

work page arXiv
[47]

2005 , publisher=

Universal artificial intelligence: Sequential decisions based on algorithmic probability , author=. 2005 , publisher=

2005
[48]

1994 , publisher=

Computability, complexity, and languages: fundamentals of theoretical computer science , author=. 1994 , publisher=

1994
[49]

Evaluating & Ranking GPT-5 Reasoning Ability , author=
[50]

GPT-5 New Params and Tools , author=
[51]

2024 , month =

Learning to Reason with. 2024 , month =

2024
[52]

Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the

Luong, Thang and Lockhart, Edward , year =. Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the
[53]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Problemy peredachi informatsii , volume=

Universal sequential search problems , author=. Problemy peredachi informatsii , volume=
[59]

Information and Control , volume=

Randomness conservation inequalities; information and independence in mathematical theories , author=. Information and Control , volume=. 1984 , publisher=

1984
[60]

Machine Learning , volume=

Optimal ordered problem solver , author=. Machine Learning , volume=. 2004 , publisher=

2004
[61]

Theoretical Computer Science , volume=

PSPACE-completeness of sliding-block puzzles and other problems through the nondeterministic constraint logic model of computation , author=. Theoretical Computer Science , volume=. 2005 , publisher=

2005
[62]

Journal of the ACM (JACM) , volume=

The complexity of propositional linear temporal logics , author=. Journal of the ACM (JACM) , volume=. 1985 , publisher=

1985
[63]

Journal of Computer and System Sciences , volume=

On the complexity of k-SAT , author=. Journal of Computer and System Sciences , volume=. 2001 , publisher=

2001
[64]

2009 , publisher=

Computational complexity: a modern approach , author=. 2009 , publisher=

2009
[65]

2024 , note =

Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI , author =. 2024 , note =

2024
[66]

Micah Goldblum and Marc Anton Finzi and Keefer Rowan and Andrew Gordon Wilson , title =
[67]

Advances in Neural Information Processing Systems , volume=

Towards revealing the mystery behind chain of thought: a theoretical perspective , author=. Advances in Neural Information Processing Systems , volume=
[68]

arXiv preprint arXiv:2502.17779 , year=

Simulating Time With Square-Root Space , author=. arXiv preprint arXiv:2502.17779 , year=

work page arXiv
[69]

Ramamohan Paturi , title =
[70]

arXiv preprint arXiv:2506.12027 , year=

Constant Bit-size Transformers Are Turing Complete , author=. arXiv preprint arXiv:2506.12027 , year=

work page arXiv
[71]

Transactions of the Association for Computational Linguistics , volume=

What formal languages can transformers express? a survey , author=. Transactions of the Association for Computational Linguistics , volume=
[72]

On the Power of Several Queues , journal =

Martin H. On the Power of Several Queues , journal =
[73]

SIAM Journal on Computing , volume=

The power of the queue , author=. SIAM Journal on Computing , volume=. 1992 , publisher=

1992
[74]

Some Remarks on Lower Bounds for Queue Machines (Preliminary Report)

Some Remarks on Lower Bounds for Queue Machines (Preliminary Report) , author=. arXiv preprint arXiv:1310.6398 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[75]

Advances in Neural Information Processing Systems , volume=

Masked hard-attention transformers recognize exactly the star-free languages , author=. Advances in Neural Information Processing Systems , volume=
[76]

Advances in neural information processing systems , volume=

A logic for expressing log-precision transformers , author=. Advances in neural information processing systems , volume=
[77]

Self-Attention Networks Can Process Bounded Hierarchical Languages , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
[78]

International Conference on Machine Learning , pages=

Tighter bounds on the expressivity of transformer encoders , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[79]

The Twelfth International Conference on Learning Representations , year=

Logical Languages Accepted by Transformer Encoders with Hard Attention , author=. The Twelfth International Conference on Learning Representations , year=
[80]

Michael Hahn , title =. Trans. Assoc. Comput. Linguistics , volume =

Showing first 80 references.

[1] [1]

The Thirty Seventh Annual Conference on Learning Theory , pages=

Undetectable watermarks for language models , author=. The Thirty Seventh Annual Conference on Learning Theory , pages=. 2024 , organization=

2024

[2] [2]

arXiv preprint arXiv:2406.02633 , year=

Edit Distance Robust Watermarks for Language Models , author=. arXiv preprint arXiv:2406.02633 , year=

work page arXiv

[3] [3]

arXiv preprint arXiv:2310.18491 , year=

Publicly-detectable watermarking for language models , author=. arXiv preprint arXiv:2310.18491 , year=

work page arXiv

[4] [4]

Proceedings of the 57th Annual ACM Symposium on Theory of Computing , pages=

Ideal pseudorandom codes , author=. Proceedings of the 57th Annual ACM Symposium on Theory of Computing , pages=

[5] [5]

arXiv preprint arXiv:2409.07580 , year=

New constructions of pseudorandom codes , author=. arXiv preprint arXiv:2409.07580 , year=

work page arXiv

[6] [6]

Annual International Cryptology Conference , pages=

Pseudorandom error-correcting codes , author=. Annual International Cryptology Conference , pages=. 2024 , organization=

2024

[7] [7]

arXiv preprint arXiv:2402.12875 , volume=

Chain of thought empowers transformers to solve inherently serial problems , author=. arXiv preprint arXiv:2402.12875 , volume=

work page arXiv

[8] [8]

arXiv preprint arXiv:2412.02975 , year=

Theoretical limitations of multi-layer transformer , author=. arXiv preprint arXiv:2412.02975 , year=

work page arXiv

[9] [9]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

[10] [10]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[11] [11]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

2024

[12] [12]

The Linear Representation Hypothesis and the Geometry of Large Language Models

The linear representation hypothesis and the geometry of large language models , author=. arXiv preprint arXiv:2311.03658 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

arXiv preprint arXiv:2406.14197 , year=

On the representational capacity of neural language models with chain-of-thought reasoning , author=. arXiv preprint arXiv:2406.14197 , year=

work page arXiv

[14] [14]

Submitted to ICLR 2026 , year=

Softmax Transformers are Turing-Complete , author=. Submitted to ICLR 2026 , year=

2026

[15] [15]

arXiv preprint arXiv:2310.07923 , year=

The expressive power of transformers with chain of thought , author=. arXiv preprint arXiv:2310.07923 , year=

work page arXiv

[16] [16]

arXiv preprint arXiv:2411.01992 , year=

Ask, and it shall be given: On the Turing completeness of prompting , author=. arXiv preprint arXiv:2411.01992 , year=

work page arXiv

[17] [17]

Journal of Machine Learning Research , volume=

Attention is turing-complete , author=. Journal of Machine Learning Research , volume=

[18] [18]

Unique Hard Attention: A Tale of Two Sides , url =

Jerad, Selim and Svete, Anej and Li, Jiaoda and Cotterell, Ryan. Unique Hard Attention: A Tale of Two Sides. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025. doi:10.18653/v1/2025.acl-short.76

work page doi:10.18653/v1/2025.acl-short.76 2025

[19] [19]

arXiv preprint arXiv:2502.07266 , year=

When more is less: Understanding chain-of-thought length in llms , author=. arXiv preprint arXiv:2502.07266 , year=

work page arXiv

[20] [20]

arXiv preprint arXiv:2502.18080 , year=

Towards thinking-optimal scaling of test-time compute for llm reasoning , author=. arXiv preprint arXiv:2502.18080 , year=

work page arXiv

[21] [21]

arXiv preprint arXiv:2401.04925 , year=

The impact of reasoning step length on large language models , author=. arXiv preprint arXiv:2401.04925 , year=

work page arXiv

[22] [22]

Journal of the ACM (JACM) , volume=

Relations among complexity measures , author=. Journal of the ACM (JACM) , volume=. 1979 , publisher=

1979

[23] [23]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

OpenAI o1-mini: Advancing Cost-Efficient Reasoning , year =

[25] [25]

Learning to Reason with LLMs , year =

[26] [26]

GPT-4 Technical Report

GPT-4 Technical Report , author =. arXiv preprint arXiv:2303.08774 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

and Jauhri, A

Dubey, A. and Jauhri, A. and Pandey, A. and Kadian, A. and Al-Dahle, A. and Letman, A. and Mathur, A. and Schelten, A. and Yang, A. and Fan, A. and others , journal =. The. 2024 , archivePrefix =

2024

[28] [28]

2024 , howpublished =

The. 2024 , howpublished =

2024

[29] [29]

2024 , howpublished =

Introducing. 2024 , howpublished =

2024

[30] [30]

2025 , howpublished =

DeepSeek-R1: Incentivizing Reasoning Capability in. 2025 , howpublished =

2025

[31] [31]

2024 , date =

Hessam Akhlaghpour , title =. 2024 , date =

2024

[32] [32]

arXiv preprint arXiv:2502.06773 , year =

On the Emergence of Thinking in LLMs I: Searching for the Right Intuition , author =. arXiv preprint arXiv:2502.06773 , year =

work page arXiv

[33] [33]

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Towards large reasoning models: A survey of reinforced reasoning with large language models , author=. arXiv preprint arXiv:2501.09686 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

The journal of symbolic logic , volume=

Finite combinatory processes—formulation1 , author=. The journal of symbolic logic , volume=. 1936 , publisher=

1936

[35] [35]

Advances in neural information processing systems , volume=

Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting , author=. Advances in neural information processing systems , volume=

[36] [36]

arXiv preprint arXiv:2503.03588 , year=

PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention , author=. arXiv preprint arXiv:2503.03588 , year=

work page arXiv

[37] [37]

The Twelfth International Conference on Learning Representations,

William Merrill and Ashish Sabharwal , title =. The Twelfth International Conference on Learning Representations,

[38] [38]

CoRR , volume =

Ruizhong Qiu and Zhe Xu and Wenxuan Bao and Hanghang Tong , title =. CoRR , volume =

[39] [39]

Attention is Turing-Complete , journal =

Jorge P. Attention is Turing-Complete , journal =. 2021 , url =

2021

[40] [40]

The Twelfth International Conference on Learning Representations,

Zhiyuan Li and Hong Liu and Denny Zhou and Tengyu Ma , title =. The Twelfth International Conference on Learning Representations,

[41] [41]

CoRR , volume =

William Merrill and Ashish Sabharwal , title =. CoRR , volume =

[42] [42]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

[43] [43]

Transactions of the Association for Computational Linguistics , volume=

The parallelism tradeoff: Limitations of log-precision transformers , author=. Transactions of the Association for Computational Linguistics , volume=

[44] [44]

Proceedings of the 24th Conference on Computational Natural Language Learning , pages=

On the Computational Power of Transformers and Its Implications in Sequence Modeling , author=. Proceedings of the 24th Conference on Computational Natural Language Learning , pages=

[45] [45]

arXiv preprint arXiv:2503.14337 , year=

PENCIL: Long thoughts with short memory , author=. arXiv preprint arXiv:2503.14337 , year=

work page arXiv

[46] [46]

arXiv preprint arXiv:2405.16674 , year=

Limits of deep learning: Sequence modeling through the lens of complexity theory , author=. arXiv preprint arXiv:2405.16674 , year=

work page arXiv

[47] [47]

2005 , publisher=

Universal artificial intelligence: Sequential decisions based on algorithmic probability , author=. 2005 , publisher=

2005

[48] [48]

1994 , publisher=

Computability, complexity, and languages: fundamentals of theoretical computer science , author=. 1994 , publisher=

1994

[49] [49]

Evaluating & Ranking GPT-5 Reasoning Ability , author=

[50] [50]

GPT-5 New Params and Tools , author=

[51] [51]

2024 , month =

Learning to Reason with. 2024 , month =

2024

[52] [52]

Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the

Luong, Thang and Lockhart, Edward , year =. Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the

[53] [53]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[56] [56]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Problemy peredachi informatsii , volume=

Universal sequential search problems , author=. Problemy peredachi informatsii , volume=

[59] [59]

Information and Control , volume=

Randomness conservation inequalities; information and independence in mathematical theories , author=. Information and Control , volume=. 1984 , publisher=

1984

[60] [60]

Machine Learning , volume=

Optimal ordered problem solver , author=. Machine Learning , volume=. 2004 , publisher=

2004

[61] [61]

Theoretical Computer Science , volume=

PSPACE-completeness of sliding-block puzzles and other problems through the nondeterministic constraint logic model of computation , author=. Theoretical Computer Science , volume=. 2005 , publisher=

2005

[62] [62]

Journal of the ACM (JACM) , volume=

The complexity of propositional linear temporal logics , author=. Journal of the ACM (JACM) , volume=. 1985 , publisher=

1985

[63] [63]

Journal of Computer and System Sciences , volume=

On the complexity of k-SAT , author=. Journal of Computer and System Sciences , volume=. 2001 , publisher=

2001

[64] [64]

2009 , publisher=

Computational complexity: a modern approach , author=. 2009 , publisher=

2009

[65] [65]

2024 , note =

Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI , author =. 2024 , note =

2024

[66] [66]

Micah Goldblum and Marc Anton Finzi and Keefer Rowan and Andrew Gordon Wilson , title =

[67] [67]

Advances in Neural Information Processing Systems , volume=

Towards revealing the mystery behind chain of thought: a theoretical perspective , author=. Advances in Neural Information Processing Systems , volume=

[68] [68]

arXiv preprint arXiv:2502.17779 , year=

Simulating Time With Square-Root Space , author=. arXiv preprint arXiv:2502.17779 , year=

work page arXiv

[69] [69]

Ramamohan Paturi , title =

[70] [70]

arXiv preprint arXiv:2506.12027 , year=

Constant Bit-size Transformers Are Turing Complete , author=. arXiv preprint arXiv:2506.12027 , year=

work page arXiv

[71] [71]

Transactions of the Association for Computational Linguistics , volume=

What formal languages can transformers express? a survey , author=. Transactions of the Association for Computational Linguistics , volume=

[72] [72]

On the Power of Several Queues , journal =

Martin H. On the Power of Several Queues , journal =

[73] [73]

SIAM Journal on Computing , volume=

The power of the queue , author=. SIAM Journal on Computing , volume=. 1992 , publisher=

1992

[74] [74]

Some Remarks on Lower Bounds for Queue Machines (Preliminary Report)

Some Remarks on Lower Bounds for Queue Machines (Preliminary Report) , author=. arXiv preprint arXiv:1310.6398 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[75] [75]

Advances in Neural Information Processing Systems , volume=

Masked hard-attention transformers recognize exactly the star-free languages , author=. Advances in Neural Information Processing Systems , volume=

[76] [76]

Advances in neural information processing systems , volume=

A logic for expressing log-precision transformers , author=. Advances in neural information processing systems , volume=

[77] [77]

Self-Attention Networks Can Process Bounded Hierarchical Languages , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

[78] [78]

International Conference on Machine Learning , pages=

Tighter bounds on the expressivity of transformer encoders , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[79] [79]

The Twelfth International Conference on Learning Representations , year=

Logical Languages Accepted by Transformer Encoders with Hard Attention , author=. The Twelfth International Conference on Learning Representations , year=

[80] [80]

Michael Hahn , title =. Trans. Assoc. Comput. Linguistics , volume =