pith. sign in

arxiv: 2606.01532 · v2 · pith:UCMEJF3Qnew · submitted 2026-06-01 · 💻 cs.LG · cs.CC

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete

Pith reviewed 2026-06-28 15:46 UTC · model grok-4.3

classification 💻 cs.LG cs.CC
keywords positional encodingsliding windowTuring completenesstransformersHIST modelPost machinespermutation symmetryautoregressive model
0
0 comments X

The pith

Sliding-window transformers without positional encoding remain Turing complete.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the assumption that positional encodings are required for transformers to process ordered sequences during long-form reasoning with sliding context windows. It introduces the HIST model as an abstract autoregressive system whose state updates depend solely on a fixed-size internal memory and the histogram of token counts inside the current window. The authors prove the HIST model is Turing complete by showing that successive histograms uniquely identify the token that just exited the window, which allows simulation of Post machines. They then build an explicit sliding-window transformer over a finite token alphabet that reproduces the HIST updates without any positional encoding. This shows the sliding action itself supplies enough order information to support arbitrary computation.

Core claim

A sliding-window transformer over a constant-size token alphabet, without positional encoding, can simulate the HIST model, which is Turing complete because the evolution of token-count histograms inside the window reveals the exact token that has departed and thereby permits simulation of Post machines.

What carries the argument

The HIST model, an abstract autoregressive model whose updates depend only on constant-size internal state and the token-count histogram within the current window.

If this is right

  • Transformers achieve Turing completeness in the sliding-window regime without relying on positional encodings.
  • The window-sliding operation alone breaks the permutation invariance of attention sufficiently for universal computation.
  • Universal computation holds for any transformer whose attention is restricted to a fixed-size context and whose token alphabet is finite.
  • Order information can be recovered implicitly from changes in token histograms over successive steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Long-context models that already use sliding windows may achieve reasoning capabilities without adding explicit positional encodings.
  • Similar histogram-tracking mechanisms could be examined in other architectures to see whether they substitute for positional signals in tasks beyond pure computation.
  • The result raises the question of how much explicit position information is truly needed once context windows move.

Load-bearing premise

The evolution of token counts inside the sliding window is sufficient to recover the exact token that has just left the window.

What would settle it

A concrete sequence of tokens for which two different departed tokens produce identical subsequent histogram trajectories inside the window, so that the departed token cannot be uniquely recovered.

Figures

Figures reproduced from arXiv: 2606.01532 by Qian Li, Shang-Hua Teng, Xinyu Mao.

Figure 1
Figure 1. Figure 1: Illustration of the sliding window mechanism natively simulating a Post machine queue [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Positional encoding (PE) is widely viewed as necessary for transformers to process ordered sequences: without them, the next-token map appears permutation-invariant in its context tokens. This intuition underlies all prior universality results, which rely on positional information to prove that transformers with chain-of-thought can perform arbitrary computation, i.e., they are Turing complete. We revisit this belief in the regime most relevant to long-form reasoning, where generation proceeds through a finite sliding context window. Our opening perception is that the window mechanism itself (mildly) breaks the permutation symmetry. To distill and precisely capture the degree of this added expressiveness, we introduce an abstract autoregressive model, the HIST model, in which each update depends only on constant-size internal state and the token-count histogram within the current window. We prove that this HIST model is Turing complete by showing that the evolution of the window can reveal the token that has just left the window, which suffices to simulate Turing-complete Post machines. We then construct a sliding-window transformer over a constant-size token alphabet, without PE, and show that it can simulate the HIST model. Our result demonstrates that positional encodings are not indispensable for transformers to perform universal computation: The window sliding itself already breaks permutation symmetry and captures sufficient positional information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that sliding-window transformers without positional encodings (PE) remain Turing complete. It introduces the HIST model, an abstract autoregressive model whose updates depend only on a constant-size internal state and the token-count histogram in the current window. The authors prove HIST is Turing complete by arguing that the evolution of these histograms reveals the exact token that has just left the window, enabling simulation of Turing-complete Post machines. They then construct a sliding-window transformer over a constant-size alphabet, without PE, that simulates the HIST model, concluding that the window-sliding mechanism itself suffices to break permutation symmetry and capture enough positional information for universal computation.

Significance. If the constructions and the histogram-to-token recovery map hold, the result would show that PE is not required for Turing completeness in the practically relevant sliding-window regime. This would be a notable clarification of the role of positional information, as it isolates the contribution of the window mechanism itself rather than relying on explicit position encodings or chain-of-thought with positional markers. The explicit simulation route (HIST to Post machines, transformer to HIST) is a strength if the details are complete.

major comments (2)
  1. [Proof of HIST Turing completeness (Post-machine simulation)] The central reduction for HIST Turing completeness rests on the claim that the sequence of token-count histograms uniquely determines the departed token for every reachable window state and every possible exit token (required to simulate the Post-machine tape head). No explicit inductive argument, construction of the recovery map, or enumeration of edge cases for non-injective trajectories is visible in the provided abstract or sketch; if two distinct exit tokens can produce identical histogram sequences under the update rule, the simulation fails. This is the load-bearing step for the Post-machine reduction.
  2. [Transformer simulation of HIST model] The transformer-to-HIST simulation is asserted via a construction over a constant-size alphabet, but the manuscript must supply the precise attention and feed-forward layers that recover the histogram state without PE. If the construction relies on any implicit positional cue or fails for certain window sizes or alphabets, the universality claim does not transfer.
minor comments (2)
  1. [HIST model definition] Clarify the precise definition of the HIST internal state and the window-update rule; the abstract states it depends on 'constant-size internal state' but does not specify its dimension or initialization.
  2. [Introduction] The abstract mentions 'mildly breaks the permutation symmetry'; a short formal statement of the symmetry group preserved by the window mechanism versus full permutation invariance would help readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thoughtful review and constructive comments on our work. The two major comments raise important points about the completeness of the proofs and constructions, which we address below. We will make revisions to provide more explicit details as suggested.

read point-by-point responses
  1. Referee: [Proof of HIST Turing completeness (Post-machine simulation)] The central reduction for HIST Turing completeness rests on the claim that the sequence of token-count histograms uniquely determines the departed token for every reachable window state and every possible exit token (required to simulate the Post-machine tape head). No explicit inductive argument, construction of the recovery map, or enumeration of edge cases for non-injective trajectories is visible in the provided abstract or sketch; if two distinct exit tokens can produce identical histogram sequences under the update rule, the simulation fails. This is the load-bearing step for the Post-machine reduction.

    Authors: We thank the referee for identifying this critical aspect of the proof. While the manuscript provides a sketch of the argument that the histogram evolution reveals the departed token (see the proof of Theorem 3.2), we agree that an explicit inductive argument and the recovery map construction would strengthen the presentation. In the revised manuscript, we will include a detailed inductive proof showing uniqueness for reachable states, construct the recovery function explicitly, and enumerate why non-injective cases do not arise in the simulation of Post machines. This addresses the concern that the simulation could fail if distinct exit tokens produce identical histograms. revision: yes

  2. Referee: [Transformer simulation of HIST model] The transformer-to-HIST simulation is asserted via a construction over a constant-size alphabet, but the manuscript must supply the precise attention and feed-forward layers that recover the histogram state without PE. If the construction relies on any implicit positional cue or fails for certain window sizes or alphabets, the universality claim does not transfer.

    Authors: We appreciate the referee's call for precision in the construction. The manuscript outlines how a constant-size alphabet transformer can simulate the HIST model using attention to compute token counts in the window and feed-forward layers to update the state (Section 4). However, to ensure clarity and verifiability, we will expand this section with the precise mathematical definitions of the attention weights and feed-forward functions that recover the histogram without relying on positional encodings. We confirm that the construction holds for arbitrary window sizes and the chosen alphabet, as it depends only on the histogram aggregation which is position-agnostic beyond the window boundary. revision: yes

Circularity Check

0 steps flagged

No circularity; explicit constructions establish Turing completeness independently

full rationale

The paper defines the HIST model as an abstract autoregressive process depending only on constant-size state and window histograms, then proves its Turing completeness via an explicit reduction to Post machines that relies on showing (by construction) that histogram trajectories uniquely identify departed tokens. It next gives an explicit transformer construction (over fixed alphabet, no PE) that simulates HIST. Neither step reduces a result to a fitted parameter, self-referential definition, or load-bearing self-citation; both are direct mathematical constructions whose validity can be checked against the stated assumptions without circular reduction to the target claim. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the standard definition of Turing completeness via Post machines and on the newly introduced HIST abstraction. No numerical parameters are fitted; the only invented entity is the HIST model itself.

axioms (1)
  • standard math Post machines are Turing complete
    Invoked to conclude that HIST is Turing complete once it can simulate a Post machine.
invented entities (1)
  • HIST model no independent evidence
    purpose: Abstract autoregressive model whose state consists of constant-size internal memory plus token-count histogram inside the current window
    Introduced to isolate the expressiveness contributed by the sliding window alone

pith-pipeline@v0.9.1-grok · 5761 in / 1414 out tokens · 31424 ms · 2026-06-28T15:46:34.933811+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

105 extracted references · 42 canonical work pages · 15 internal anchors

  1. [1]

    The Thirty Seventh Annual Conference on Learning Theory , pages=

    Undetectable watermarks for language models , author=. The Thirty Seventh Annual Conference on Learning Theory , pages=. 2024 , organization=

  2. [2]

    arXiv preprint arXiv:2406.02633 , year=

    Edit Distance Robust Watermarks for Language Models , author=. arXiv preprint arXiv:2406.02633 , year=

  3. [3]

    arXiv preprint arXiv:2310.18491 , year=

    Publicly-detectable watermarking for language models , author=. arXiv preprint arXiv:2310.18491 , year=

  4. [4]

    Proceedings of the 57th Annual ACM Symposium on Theory of Computing , pages=

    Ideal pseudorandom codes , author=. Proceedings of the 57th Annual ACM Symposium on Theory of Computing , pages=

  5. [5]

    arXiv preprint arXiv:2409.07580 , year=

    New constructions of pseudorandom codes , author=. arXiv preprint arXiv:2409.07580 , year=

  6. [6]

    Annual International Cryptology Conference , pages=

    Pseudorandom error-correcting codes , author=. Annual International Cryptology Conference , pages=. 2024 , organization=

  7. [7]

    arXiv preprint arXiv:2402.12875 , volume=

    Chain of thought empowers transformers to solve inherently serial problems , author=. arXiv preprint arXiv:2402.12875 , volume=

  8. [8]

    arXiv preprint arXiv:2412.02975 , year=

    Theoretical limitations of multi-layer transformer , author=. arXiv preprint arXiv:2412.02975 , year=

  9. [9]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  10. [10]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  11. [11]

    Neurocomputing , volume=

    Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

  12. [12]

    The Linear Representation Hypothesis and the Geometry of Large Language Models

    The linear representation hypothesis and the geometry of large language models , author=. arXiv preprint arXiv:2311.03658 , year=

  13. [13]

    arXiv preprint arXiv:2406.14197 , year=

    On the representational capacity of neural language models with chain-of-thought reasoning , author=. arXiv preprint arXiv:2406.14197 , year=

  14. [14]

    Submitted to ICLR 2026 , year=

    Softmax Transformers are Turing-Complete , author=. Submitted to ICLR 2026 , year=

  15. [15]

    arXiv preprint arXiv:2310.07923 , year=

    The expressive power of transformers with chain of thought , author=. arXiv preprint arXiv:2310.07923 , year=

  16. [16]

    arXiv preprint arXiv:2411.01992 , year=

    Ask, and it shall be given: On the Turing completeness of prompting , author=. arXiv preprint arXiv:2411.01992 , year=

  17. [17]

    Journal of Machine Learning Research , volume=

    Attention is turing-complete , author=. Journal of Machine Learning Research , volume=

  18. [18]

    Unique Hard Attention: A Tale of Two Sides , url =

    Jerad, Selim and Svete, Anej and Li, Jiaoda and Cotterell, Ryan. Unique Hard Attention: A Tale of Two Sides. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2025. doi:10.18653/v1/2025.acl-short.76

  19. [19]

    arXiv preprint arXiv:2502.07266 , year=

    When more is less: Understanding chain-of-thought length in llms , author=. arXiv preprint arXiv:2502.07266 , year=

  20. [20]

    arXiv preprint arXiv:2502.18080 , year=

    Towards thinking-optimal scaling of test-time compute for llm reasoning , author=. arXiv preprint arXiv:2502.18080 , year=

  21. [21]

    arXiv preprint arXiv:2401.04925 , year=

    The impact of reasoning step length on large language models , author=. arXiv preprint arXiv:2401.04925 , year=

  22. [22]

    Journal of the ACM (JACM) , volume=

    Relations among complexity measures , author=. Journal of the ACM (JACM) , volume=. 1979 , publisher=

  23. [23]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  24. [24]

    OpenAI o1-mini: Advancing Cost-Efficient Reasoning , year =

  25. [25]

    Learning to Reason with LLMs , year =

  26. [26]

    GPT-4 Technical Report

    GPT-4 Technical Report , author =. arXiv preprint arXiv:2303.08774 , year =

  27. [27]

    and Jauhri, A

    Dubey, A. and Jauhri, A. and Pandey, A. and Kadian, A. and Al-Dahle, A. and Letman, A. and Mathur, A. and Schelten, A. and Yang, A. and Fan, A. and others , journal =. The. 2024 , archivePrefix =

  28. [28]

    2024 , howpublished =

    The. 2024 , howpublished =

  29. [29]

    2024 , howpublished =

    Introducing. 2024 , howpublished =

  30. [30]

    2025 , howpublished =

    DeepSeek-R1: Incentivizing Reasoning Capability in. 2025 , howpublished =

  31. [31]

    2024 , date =

    Hessam Akhlaghpour , title =. 2024 , date =

  32. [32]

    arXiv preprint arXiv:2502.06773 , year =

    On the Emergence of Thinking in LLMs I: Searching for the Right Intuition , author =. arXiv preprint arXiv:2502.06773 , year =

  33. [33]

    Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

    Towards large reasoning models: A survey of reinforced reasoning with large language models , author=. arXiv preprint arXiv:2501.09686 , year=

  34. [34]

    The journal of symbolic logic , volume=

    Finite combinatory processes—formulation1 , author=. The journal of symbolic logic , volume=. 1936 , publisher=

  35. [35]

    Advances in neural information processing systems , volume=

    Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting , author=. Advances in neural information processing systems , volume=

  36. [36]

    arXiv preprint arXiv:2503.03588 , year=

    PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention , author=. arXiv preprint arXiv:2503.03588 , year=

  37. [37]

    The Twelfth International Conference on Learning Representations,

    William Merrill and Ashish Sabharwal , title =. The Twelfth International Conference on Learning Representations,

  38. [38]

    CoRR , volume =

    Ruizhong Qiu and Zhe Xu and Wenxuan Bao and Hanghang Tong , title =. CoRR , volume =

  39. [39]

    Attention is Turing-Complete , journal =

    Jorge P. Attention is Turing-Complete , journal =. 2021 , url =

  40. [40]

    The Twelfth International Conference on Learning Representations,

    Zhiyuan Li and Hong Liu and Denny Zhou and Tengyu Ma , title =. The Twelfth International Conference on Learning Representations,

  41. [41]

    CoRR , volume =

    William Merrill and Ashish Sabharwal , title =. CoRR , volume =

  42. [42]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  43. [43]

    Transactions of the Association for Computational Linguistics , volume=

    The parallelism tradeoff: Limitations of log-precision transformers , author=. Transactions of the Association for Computational Linguistics , volume=

  44. [44]

    Proceedings of the 24th Conference on Computational Natural Language Learning , pages=

    On the Computational Power of Transformers and Its Implications in Sequence Modeling , author=. Proceedings of the 24th Conference on Computational Natural Language Learning , pages=

  45. [45]

    arXiv preprint arXiv:2503.14337 , year=

    PENCIL: Long thoughts with short memory , author=. arXiv preprint arXiv:2503.14337 , year=

  46. [46]

    arXiv preprint arXiv:2405.16674 , year=

    Limits of deep learning: Sequence modeling through the lens of complexity theory , author=. arXiv preprint arXiv:2405.16674 , year=

  47. [47]

    2005 , publisher=

    Universal artificial intelligence: Sequential decisions based on algorithmic probability , author=. 2005 , publisher=

  48. [48]

    1994 , publisher=

    Computability, complexity, and languages: fundamentals of theoretical computer science , author=. 1994 , publisher=

  49. [49]

    Evaluating & Ranking GPT-5 Reasoning Ability , author=

  50. [50]

    GPT-5 New Params and Tools , author=

  51. [51]

    2024 , month =

    Learning to Reason with. 2024 , month =

  52. [52]

    Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the

    Luong, Thang and Lockhart, Edward , year =. Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the

  53. [53]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  54. [54]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  55. [55]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Swe-bench: Can language models resolve real-world github issues? , author=. arXiv preprint arXiv:2310.06770 , year=

  56. [56]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

  57. [57]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

  58. [58]

    Problemy peredachi informatsii , volume=

    Universal sequential search problems , author=. Problemy peredachi informatsii , volume=

  59. [59]

    Information and Control , volume=

    Randomness conservation inequalities; information and independence in mathematical theories , author=. Information and Control , volume=. 1984 , publisher=

  60. [60]

    Machine Learning , volume=

    Optimal ordered problem solver , author=. Machine Learning , volume=. 2004 , publisher=

  61. [61]

    Theoretical Computer Science , volume=

    PSPACE-completeness of sliding-block puzzles and other problems through the nondeterministic constraint logic model of computation , author=. Theoretical Computer Science , volume=. 2005 , publisher=

  62. [62]

    Journal of the ACM (JACM) , volume=

    The complexity of propositional linear temporal logics , author=. Journal of the ACM (JACM) , volume=. 1985 , publisher=

  63. [63]

    Journal of Computer and System Sciences , volume=

    On the complexity of k-SAT , author=. Journal of Computer and System Sciences , volume=. 2001 , publisher=

  64. [64]

    2009 , publisher=

    Computational complexity: a modern approach , author=. 2009 , publisher=

  65. [65]

    2024 , note =

    Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI , author =. 2024 , note =

  66. [66]

    Micah Goldblum and Marc Anton Finzi and Keefer Rowan and Andrew Gordon Wilson , title =

  67. [67]

    Advances in Neural Information Processing Systems , volume=

    Towards revealing the mystery behind chain of thought: a theoretical perspective , author=. Advances in Neural Information Processing Systems , volume=

  68. [68]

    arXiv preprint arXiv:2502.17779 , year=

    Simulating Time With Square-Root Space , author=. arXiv preprint arXiv:2502.17779 , year=

  69. [69]

    Ramamohan Paturi , title =

  70. [70]

    arXiv preprint arXiv:2506.12027 , year=

    Constant Bit-size Transformers Are Turing Complete , author=. arXiv preprint arXiv:2506.12027 , year=

  71. [71]

    Transactions of the Association for Computational Linguistics , volume=

    What formal languages can transformers express? a survey , author=. Transactions of the Association for Computational Linguistics , volume=

  72. [72]

    On the Power of Several Queues , journal =

    Martin H. On the Power of Several Queues , journal =

  73. [73]

    SIAM Journal on Computing , volume=

    The power of the queue , author=. SIAM Journal on Computing , volume=. 1992 , publisher=

  74. [74]

    Some Remarks on Lower Bounds for Queue Machines (Preliminary Report)

    Some Remarks on Lower Bounds for Queue Machines (Preliminary Report) , author=. arXiv preprint arXiv:1310.6398 , year=

  75. [75]

    Advances in Neural Information Processing Systems , volume=

    Masked hard-attention transformers recognize exactly the star-free languages , author=. Advances in Neural Information Processing Systems , volume=

  76. [76]

    Advances in neural information processing systems , volume=

    A logic for expressing log-precision transformers , author=. Advances in neural information processing systems , volume=

  77. [77]

    Self-Attention Networks Can Process Bounded Hierarchical Languages , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

  78. [78]

    International Conference on Machine Learning , pages=

    Tighter bounds on the expressivity of transformer encoders , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  79. [79]

    The Twelfth International Conference on Learning Representations , year=

    Logical Languages Accepted by Transformer Encoders with Hard Attention , author=. The Twelfth International Conference on Learning Representations , year=

  80. [80]

    Michael Hahn , title =. Trans. Assoc. Comput. Linguistics , volume =

Showing first 80 references.