Recognition: no theorem link
Lower bounds for one-layer transformers that compute parity
Pith reviewed 2026-05-13 07:21 UTC · model grok-4.3
The pith
No self-attention layer post-processed by a rational function can sign-represent parity unless the product of heads and degree grows linearly with input length.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
No self-attention layer post-processed by a rational function can sign-represent the parity function unless the product of the number of heads and the degree of the post-processing function grows linearly with the input length.
What carries the argument
Lower bound on sign-representation of parity using the algebraic properties of self-attention outputs combined with the degree of rational functions.
If this is right
- Single-layer attention models must increase either the number of heads or the degree of post-processing linearly to handle parity as input size grows.
- The same linear growth requirement applies to ReLU post-processing when the margin is taken into account via rational approximation.
- Architectures relying on one attention layer face inherent resource costs for functions like parity that require tracking global parity.
- Multi-layer or alternative mechanisms become necessary to avoid this linear blow-up for parity-like tasks.
Where Pith is reading between the lines
- The bound implies that depth in transformers may be required to compute parity without paying the linear cost in heads or degree.
- Similar techniques could yield lower bounds for other simple boolean functions that attention layers are expected to handle in practice.
- Designers of efficient attention variants might target reductions in effective degree or head requirements to bypass the bound.
Load-bearing premise
The model is limited to exactly one self-attention layer with rational post-processing and the standard sign-representation task, without extra layers or nonstandard architectural modifications.
What would settle it
An explicit construction of a single self-attention layer with rational post-processing where the product of heads and degree remains sublinear in input length yet still sign-represents parity on every input vector.
read the original abstract
This note shows that no self-attention layer post-processed by a rational function can sign-represent the parity function unless the product of the number of heads and the degree of the post-processing function grows linearly with the input length. Combining this lower bound with rational approximation of ReLU networks yields a margin-dependent extension for self-attention layers post-processed by ReLU networks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proves that no single self-attention layer followed by a rational post-processing function can sign-represent the parity function on n bits unless the product of the number of heads and the degree of the rational function grows linearly with n. It further derives a margin-dependent lower bound for ReLU post-processing by composing the rational bound with known rational approximation results for ReLU networks.
Significance. If the algebraic degree bound holds, the result supplies a clean, parameter-free limitation on the expressive power of one-layer attention models for parity, a function whose sign-representation requires degree n. The per-head degree argument (softmax of linear forms yields bounded-degree rationals, independent of n) is a technical strength that directly yields the Omega(n) product lower bound without fitted parameters or empirical fits.
minor comments (2)
- [§2] §2 (model definition): explicitly state whether positional encodings are permitted and whether the value matrix is allowed to depend on the input length n, as these choices affect the degree bound.
- [final section] The extension to ReLU networks in the final section relies on a specific rational approximation rate; citing the precise theorem used for the approximation error would strengthen the margin-dependent claim.
Simulated Author's Rebuttal
We thank the referee for their positive summary of the manuscript and the recommendation for minor revision. We appreciate the recognition that the per-head degree argument yields a clean, parameter-free Omega(n) lower bound on the product of heads and rational degree for sign-representing parity.
Circularity Check
No significant circularity in the lower bound derivation
full rationale
The paper derives a lower bound via algebraic degree analysis: each attention head computes a rational function of bounded degree (from softmax over linear forms), which composes with an outer rational post-processor of degree d to yield total degree O(h*d). Parity requires degree n for sign-representation, forcing h*d = Omega(n). This is a direct proof from standard definitions of scaled dot-product attention and rational functions, with no fitted parameters, self-referential definitions, load-bearing self-citations, or ansatz smuggling. The derivation is self-contained and does not reduce any claim to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard algebraic and analytic properties of self-attention layers and rational functions
Reference graph
Works this paper leans on
-
[1]
Spectral properties of threshold functions
Craig Gotsman and Nathan Linial. Spectral properties of threshold functions. Combinatorica, 14 0 (1): 0 35--50, 1994
work page 1994
-
[2]
The correct exponent for the Gotsman-Linial conjecture
Daniel M Kane. The correct exponent for the Gotsman-Linial conjecture. In IEEE Conference on Computational Complexity, pages 56--64, 2013
work page 2013
-
[3]
Parity, Sensitivity, and Transformers
Alexander Kozachinskiy, Tomasz Steifer, and Przemys aw Wa e ga. Parity, sensitivity, and transformers. arXiv preprint arXiv:2602.05896, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Donald J Newman. Rational approximation to |x| . Michigan Math. J., 11: 0 11--14, 1964
work page 1964
-
[5]
Approximating threshold circuits by rational functions
Ramamohan Paturi and Michael E Saks. Approximating threshold circuits by rational functions. Information and Computation, 112 0 (2): 0 257--272, 1994
work page 1994
-
[6]
Neural networks and rational functions
Matus Telgarsky. Neural networks and rational functions. In International Conference on Machine Learning, pages 3387--3393, 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.