pith. machine review for the scientific record. sign in

arxiv: 2605.07705 · v1 · submitted 2026-05-08 · 💻 cs.LO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Cross-Attention and Encoder-Decoder Transformers: A Logical Characterization

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:05 UTC · model grok-4.3

classification 💻 cs.LO cs.AI
keywords encoder-decoder transformerstemporal logiccross-attentionlogical characterizationdistributed automatasoft attentionautoregressive modelsmasking
0
0 comments X

The pith

Encoder-decoder transformers match a temporal logic with a counting global modality over the encoder and a past modality over the decoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a logical characterization of encoder-decoder transformers, the architecture underlying many large language models and cross-attention systems. It introduces a temporal logic that starts from propositional logic and adds a counting global modality to track the encoder input sequence together with a past modality to handle the decoder input. A sympathetic reader cares because this gives an exact description of what computations these models perform, without relying on their internal floating-point arithmetic or soft-attention weights. The same transformers are also shown to correspond to a form of distributed automata, and the characterizations remain valid under changes such as different masking patterns.

Core claim

Encoder-decoder transformers over text, in the setting of floating-point numbers and soft attention, are exactly characterized by a temporal logic that extends propositional logic with a counting global modality over the encoder input and a past modality over the decoder input. The same class of transformers is also characterized by a type of distributed automata. These equivalences hold even when architectural details such as masking are varied, and they extend to the autoregressive case.

What carries the argument

A temporal logic with a counting global modality over the encoder input sequence and a past modality over the decoder input sequence, which together capture the cross-attention mechanism between the two parts of the model.

If this is right

  • Any property expressible in the new logic can be decided for a given transformer by checking the corresponding formula.
  • Transformers with altered masking rules remain within the same logical class.
  • The automata characterization supplies an alternative computational model that can simulate the same cross-attention behavior.
  • The results apply directly to autoregressive generation settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The logic may serve as a specification language for verifying safety properties of transformer-based systems before deployment.
  • Connections between the counting modality and automata might allow complexity bounds on what encoder-decoder models can compute in polynomial time.
  • Designers could use the logic to generate candidate architectures that satisfy desired input-output behaviors.

Load-bearing premise

The continuous and probabilistic behavior of soft attention can be faithfully reproduced by the discrete counting and past modalities without any loss of expressive power.

What would settle it

An input-output pair on a small encoder-decoder transformer where the output token probabilities cannot be matched by any formula in the proposed logic or by any run of the corresponding distributed automata.

Figures

Figures reproduced from arXiv: 2605.07705 by Antti Kuusisto, Damian Heiman, Matias Selin, Miguel Moreno, Veeti Ahvonen.

Figure 1
Figure 1. Figure 1: The encoder–decoder transformer of [20]. In this paper, we do not consider positional encod [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

We give a novel logical characterization of encoder-decoder transformers, the foundational architecture for LLMs that also sees use in various settings that benefit from cross-attention. We study such transformers over text in the practical setting of floating-point numbers and soft-attention, characterizing them with a new temporal logic. This logic extends propositional logic with a counting global modality over the encoder input and a past modality over the decoder input. We also give an additional characterization of such transformers via a type of distributed automata, and show that our results are not limited to the specific choices in the architecture and can account for changes in, e.g., masking. Finally, we discuss encoder-decoder transformers in the autoregressive setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims to provide a novel logical characterization of encoder-decoder transformers (with cross-attention, in the floating-point soft-attention setting) via a temporal logic extending propositional logic with a counting global modality over the encoder input and a past modality over the decoder input. It additionally characterizes such transformers via distributed automata, shows the results are robust to variations such as masking, and discusses the autoregressive setting.

Significance. If the claimed exact equivalences hold, the work would offer a valuable formal bridge between transformer architectures and logical/automata-theoretic methods, supporting verification and analysis of cross-attention mechanisms in LLMs. The automata characterization and explicit handling of masking changes are positive features that broaden applicability.

major comments (2)
  1. [§3 (logic definition) and §4 (equivalence proof)] The central equivalence (likely Theorem 1 or the main result in the section on logical characterization) asserts that every transformer computation, including dot-product attention scores, softmax normalization, and real-valued weighted sums, is expressible in the logic. However, the counting global modality only tracks cardinalities of positions satisfying formulas, and the past modality tracks temporal order; it is unclear from the definitions how these discrete modalities exactly encode the continuous, position-pair-dependent floating-point operations of soft cross-attention without an intermediate discretization that would break exact equivalence.
  2. [§5 (automata characterization)] The automata characterization (in the section following the logic) claims a precise match via distributed automata, but the transition rules must be shown to simulate the numerical attention matrix computation exactly; if the automata are defined only over discrete states derived from the logic, this risks the same gap with respect to floating-point values and masking.
minor comments (2)
  1. [Abstract and §1] The abstract and introduction could more explicitly state the precise class of encoder-decoder transformers considered (e.g., number of layers, head count, or restrictions on value embeddings) to clarify the scope of the characterization.
  2. [§3] Notation for the modalities (e.g., how the counting operator is written and how it interacts with propositional variables) should be introduced with a small concrete example of a simple cross-attention step to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. The comments highlight important points regarding clarity in the logical and automata characterizations, which we will address through revisions to improve the exposition of the proofs and examples.

read point-by-point responses
  1. Referee: [§3 (logic definition) and §4 (equivalence proof)] The central equivalence (likely Theorem 1 or the main result in the section on logical characterization) asserts that every transformer computation, including dot-product attention scores, softmax normalization, and real-valued weighted sums, is expressible in the logic. However, the counting global modality only tracks cardinalities of positions satisfying formulas, and the past modality tracks temporal order; it is unclear from the definitions how these discrete modalities exactly encode the continuous, position-pair-dependent floating-point operations of soft cross-attention without an intermediate discretization that would break exact equivalence.

    Authors: We appreciate this observation on the need for greater explicitness. The logic encodes the floating-point operations by using the counting modality to express exact cardinalities and comparisons that correspond to the results of dot-product computations and softmax (via formulas that define thresholds and weighted sums over positions satisfying prior subformulas). The equivalence in §4 is established by induction on transformer layers, where each real-valued operation is captured by a corresponding logical formula without requiring discretization of the values themselves. The past modality handles decoder dependencies in the same manner. To make this connection clearer, we will revise §3 and §4 to include a detailed worked example of encoding a cross-attention head, along with expanded proof steps showing how the modalities simulate the numerical matrix operations exactly. revision: yes

  2. Referee: [§5 (automata characterization)] The automata characterization (in the section following the logic) claims a precise match via distributed automata, but the transition rules must be shown to simulate the numerical attention matrix computation exactly; if the automata are defined only over discrete states derived from the logic, this risks the same gap with respect to floating-point values and masking.

    Authors: The distributed automata are constructed directly from the logical formulas, with states and transitions that preserve the exact numerical semantics of the attention computations (including floating-point values) through the same inductive encoding used in the logic. Masking is handled by restricting the transition rules to the relevant positions, as shown in the robustness results. We acknowledge that the current presentation of the automata transitions could be more detailed. In the revision, we will expand §5 with additional formal definitions and a proof sketch explicitly demonstrating the simulation of the attention matrix and masking variations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; independent formalisms shown equivalent to architecture

full rationale

The paper defines a new temporal logic (propositional logic plus counting global modality over encoder and past modality over decoder) and a class of distributed automata as standalone formalisms. It then proves these exactly characterize encoder-decoder transformers under floating-point soft-attention. No equation or definition reduces the claimed equivalence to a fitted parameter, self-referential construction, or self-citation chain. The derivation proceeds by explicit translation of transformer operations into the logic/automata and vice versa, remaining self-contained without importing load-bearing results from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on extending propositional logic with two new modalities and showing equivalence to the transformer model under standard assumptions about attention; no numerical parameters are fitted and no new physical entities are postulated.

axioms (2)
  • standard math Standard axioms and semantics of propositional logic
    The new logic is explicitly described as an extension of propositional logic.
  • domain assumption Transformer computations use floating-point arithmetic and soft attention
    The characterization is given specifically in this practical setting.
invented entities (2)
  • Counting global modality no independent evidence
    purpose: To express aggregate properties over the entire encoder input
    New operator introduced to capture encoder behavior in the logic.
  • Past modality over decoder input no independent evidence
    purpose: To refer to previous decoder states
    New operator introduced to capture decoder behavior in the logic.

pith-pipeline@v0.9.0 · 5423 in / 1501 out tokens · 52624 ms · 2026-05-11T02:05:03.326828+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 2 internal anchors

  1. [1]

    Expressive Power of Graph Transformers via Logic.Proceedings of the AAAI Conference on Artificial Intelli- gence, 40(24):19569–19579, Mar

    Veeti Ahvonen, Maurice Funk, Damian Heiman, Antti Kuusisto, and Carsten Lutz. Expressive Power of Graph Transformers via Logic.Proceedings of the AAAI Conference on Artificial Intelli- gence, 40(24):19569–19579, Mar. 2026. URL:https://ojs.aaai.org/index.php/AAAI/article/ view/39036,doi:10.1609/aaai.v40i24.39036

  2. [2]

    Expres- sive Power of Graph Transformers via Logic, 2026

    Veeti Ahvonen, Maurice Funk, Damian Heiman, Antti Kuusisto, and Carsten Lutz. Expres- sive Power of Graph Transformers via Logic, 2026. URL:https://arxiv.org/abs/2508.01067, arXiv:2508.01067. 11

  3. [3]

    Logical Character- izations of Recurrent Graph Neural Networks with Reals and Floats

    Veeti Ahvonen, Damian Heiman, Antti Kuusisto, and Carsten Lutz. Logical Character- izations of Recurrent Graph Neural Networks with Reals and Floats. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 104205–104249. Curran Asso- ciates, Inc., 202...

  4. [4]

    Lin, and Vladimir Podolskii

    Pablo Barcelo, Alexander Kozachinskiy, Anthony W. Lin, and Vladimir Podolskii. Logical Lan- guages Accepted by Transformer Encoders with Hard Attention. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors,International Conference on Learning Represen- tations, volume 2024, pages 22077–22087, 2024. URL:https://proceedings.iclr.cc/...

  5. [5]

    Overcoming a Theoretical Limitation of Self-Attention

    David Chiang and Peter Cholak. Overcoming a Theoretical Limitation of Self-Attention. InPro- ceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7654–7664, 2022

  6. [6]

    Tighter Bounds on the Expressivity of Transformer Encoders

    David Chiang, Peter Cholak, and Anand Pillay. Tighter Bounds on the Expressivity of Transformer Encoders. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  7. [7]

    No Language Left Behind: Scaling Human-Centered Machine Translation

    Marta R Costa-Jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffer- nan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No Language Left Behind: Scaling Human-Centered Machine Translation.arXiv preprint arXiv:2207.04672, 2022. URL: https://arxiv.org/abs/2207.04672,arXiv:2207.04672

  8. [8]

    TheAccuracyofFloatingPointSummation.SIAM J

    NicholasJ.Higham. TheAccuracyofFloatingPointSummation.SIAM J. Sci. Comput., 14(4):783– 799, 1993.doi:10.1137/0914050

  9. [9]

    Softmax Transformers Are Turing ­Complete,

    Hongjian Jiang, Michael Hahn, Georg Zetzsche, and Anthony Widjaja Lin. Softmax Transformers are Turing-Complete, 2025. URL:https://arxiv.org/abs/2511.20038,arXiv:2511.20038

  10. [10]

    Modal Logic and Distributed Message Passing Automata

    Antti Kuusisto. Modal Logic and Distributed Message Passing Automata. In Simona Ronchi Della Rocca, editor,Computer Science Logic 2013 (CSL 2013), volume 23 ofLeibniz International Proceedings in Informatics (LIPIcs), pages 452–468, Dagstuhl, Germany, 2013. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. URL:https://drops.dagstuhl.de/entities/document/...

  11. [11]

    Characterizing the Expressivity of Fixed-precision Transformer Language Models

    Jiaoda Li and Ryan Cotterell. Characterizing the Expressivity of Fixed-precision Transformer Language Models. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 159510– 159554. Curran Associates, Inc., 2025. URL:https://proceedings.neurips.cc/paper...

  12. [12]

    Chain of Thought Empowers Transform- ers to Solve Inherently Serial Problems

    Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of Thought Empowers Transform- ers to Solve Inherently Serial Problems. InThe Twelfth International Conference on Learning Representations, 2024

  13. [13]

    A Logic for Expressing Log-Precision Transformers

    William Merrill and Ashish Sabharwal. A Logic for Expressing Log-Precision Transformers. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Ad- vances in Neural Information Processing Systems, volume 36, pages 52453–52463. Curran As- sociates, Inc., 2023. URL:https://proceedings.neurips.cc/paper_files/paper/2023/file/ a48e5877...

  14. [14]

    The Expressive Power of Transformers with Chain of Thought

    William Merrill and Ashish Sabharwal. The Expressive Power of Transformers with Chain of Thought. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, editors,International Conference on Learning Representations, volume 2024, 12 pages 7690–7706, 2024. URL:https://proceedings.iclr.cc/paper_files/paper/2024/file/ 1f59721c106ea80f61329903911...

  15. [15]

    Attention is Turing Complete.Journal of Machine Learning Research, 22(75):1–35, 2021

    Jorge Pérez, Pablo Barceló, and Javier Marinkovic. Attention is Turing Complete.Journal of Machine Learning Research, 22(75):1–35, 2021. URL:http://jmlr.org/papers/v22/20-302. html

  16. [16]

    Ordering

    Thomas G. Robertazzi and Stuart C. Schwartz. Best "Ordering" for Floating-Point Addition.ACM Trans. Math. Softw., 14(1):101–110, 1988.doi:10.1145/42288.42343

  17. [17]

    High- Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- Resolution Image Synthesis with Latent Diffusion Models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 10684–10695, 2022

  18. [18]

    What Formal Languages Can Transformers Express? A Survey

    Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What Formal Languages Can Transformers Express? A Survey.Transactions of the Association for Com- putational Linguistics, 12:543–561, 2024. URL:http://dx.doi.org/10.1162/tacl_a_00663, doi:10.1162/tacl_a_00663

  19. [19]

    RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024

  20. [20]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- nett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Asso- ciates, Inc., 2017. URL...

  21. [21]

    Rounding Errors in Algebraic Processes

    James Hardy Wilkinson. Rounding Errors in Algebraic Processes. InInformation Processing, Proceedings of the 1st International Conference on Information Processing, UNESCO, Paris 15-20 June 1959, pages 44–53. UNESCO (Paris), 1959

  22. [22]

    Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers

    Andy Yang and David Chiang. Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers. InFirst Conference on Language Modeling, 2024. URL:https: //openreview.net/forum?id=FmhPg4UJ9K

  23. [23]

    MaskedHard-AttentionTransformersRecognizeEx- actlytheStar-FreeLanguages

    AndyYang, DavidChiang, andDanaAngluin. MaskedHard-AttentionTransformersRecognizeEx- actlytheStar-FreeLanguages. InA.Globerson, L.Mackey, D.Belgrave, A.Fan, U.Paquet, J.Tom- czak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 10202–10235. Curran Associates, Inc., 2024. URL:https://proceedings.neurips.cc/paper_ fi...

  24. [24]

    Knee-Deep in C-RASP: A Transformer Depth Hierarchy

    Andy J Yang, Michaël Cadilhac, and David Chiang. Knee-Deep in C-RASP: A Transformer Depth Hierarchy. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 37184–37225. Cur- ran Associates, Inc., 2025. URL:https://proceedings.neurips.cc/paper_files/pap...

  25. [25]

    Open-Sora: Democratizing Efficient Video Production for All,

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-Sora: Democratizing Efficient Video Production for All,

  26. [26]

    Open-Sora: Democratizing Efficient Video Production for All

    URL:https://arxiv.org/abs/2412.20404,arXiv:2412.20404. 13 A Proofs for Theorem 6 A.1 Proof of Theorem 3 In this section, we prove Theorem 3. Theorem 3(Logic⇒transformers).For each tuple of formulae ofGPTL −, we can construct a feature- wise equivalent encoder–decoder transformer without the finalsoftmax. Proof.We intuitively use theMLPs to simulate Boolea...

  27. [27]

    Ifn pre τ0 (G, i)≥kfor someτ 0 ∈S ψ, thenmin(n pre τ0 (G, i), k) = min(n pre τ0 (H, j), k) =k, and hence both sums are at leastk

  28. [28]

    We have thus shown thatG, i|=⟨G⟩pre ≥ℓ ψif and only ifH, j|=⟨G⟩ pre ≥ℓ ψ

    Ifn pre τ (G, i)< kfor allτ∈S ψ, then sincemin(n pre τ (G, i), k) = min(n pre τ (H, j), k), we must have npre τ (G, i) =n pre τ (H, j), meaning the sums are equal. We have thus shown thatG, i|=⟨G⟩pre ≥ℓ ψif and only ifH, j|=⟨G⟩ pre ≥ℓ ψ. The caseφ=⟨P⟩ suf ≥ℓ is handled analogously. Next, we define type automata for the above defined types. These are, intu...

  29. [29]

    They satisfy the sameGPTL− ≥-type of widthkand modal depth0

  30. [30]

    For allGPTL − ≥-typesτof widthkand modal depthd, there arel < kvertices in the prefix of (G, w)that satisfyτif and only if there arelvertices in the prefix of(H, v)that satisfyτ

  31. [31]

    For allGPTL − ≥-typesτof widthkand modal depthd, there arel≥kvertices in the prefix of (G, w)that satisfyτif and only if there are at leastkvertices in the prefix of(H, v)that satisfyτ

  32. [32]

    For allGPTL − ≥-typesτof widthkand modal depthd, there arel < kvertices in the suffix of (G, w)beforewthat satisfyτif and only if there arelvertices in the suffix of(H, v)beforevthat satisfyτ

  33. [33]

    For allGPTL − ≥-typesτof widthkand modal depthd, there arel≥kvertices in the suffix of (G, w)beforewthat satisfyτif and only if there are at leastkvertices in the suffix of(H, v) beforevthat satisfyτ. Item 1. holds if and only if each initialization functionπassigns(G, w)and(H, v)the same initial state. Notice that by Proposition 11,τ(G,w) k,d =τ (H,v) k,...