arxiv: 2605.12171 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

Lower bounds for one-layer transformers that compute parity

Daniel Hsu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords transformersself-attentionparity functionlower boundsrational functionsReLU networkssign representation

0 comments

The pith

No self-attention layer post-processed by a rational function can sign-represent parity unless the product of heads and degree grows linearly with input length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a single self-attention layer followed by rational post-processing cannot correctly sign-represent the parity function on n-bit inputs unless the number of heads times the degree of the rational function grows at least linearly in n. A reader would care because parity is a basic boolean function whose computation often exposes fundamental limits in neural models, and the result quantifies how attention alone falls short without scaling resources. The bound combines with rational approximation theorems to extend a similar margin-dependent lower bound to ReLU post-processing. This clarifies why one-layer attention architectures need growing capacity to handle even simple high-frequency patterns like parity.

Core claim

No self-attention layer post-processed by a rational function can sign-represent the parity function unless the product of the number of heads and the degree of the post-processing function grows linearly with the input length.

What carries the argument

Lower bound on sign-representation of parity using the algebraic properties of self-attention outputs combined with the degree of rational functions.

If this is right

Single-layer attention models must increase either the number of heads or the degree of post-processing linearly to handle parity as input size grows.
The same linear growth requirement applies to ReLU post-processing when the margin is taken into account via rational approximation.
Architectures relying on one attention layer face inherent resource costs for functions like parity that require tracking global parity.
Multi-layer or alternative mechanisms become necessary to avoid this linear blow-up for parity-like tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bound implies that depth in transformers may be required to compute parity without paying the linear cost in heads or degree.
Similar techniques could yield lower bounds for other simple boolean functions that attention layers are expected to handle in practice.
Designers of efficient attention variants might target reductions in effective degree or head requirements to bypass the bound.

Load-bearing premise

The model is limited to exactly one self-attention layer with rational post-processing and the standard sign-representation task, without extra layers or nonstandard architectural modifications.

What would settle it

An explicit construction of a single self-attention layer with rational post-processing where the product of heads and degree remains sublinear in input length yet still sign-represents parity on every input vector.

read the original abstract

This note shows that no self-attention layer post-processed by a rational function can sign-represent the parity function unless the product of the number of heads and the degree of the post-processing function grows linearly with the input length. Combining this lower bound with rational approximation of ReLU networks yields a margin-dependent extension for self-attention layers post-processed by ReLU networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

One-layer attention plus rational post-processing needs heads times degree linear in n to sign-represent parity, via a clean per-head degree bound.

read the letter

Hi, the core result here is that a single self-attention layer followed by a rational function cannot sign-represent parity on n bits unless the product of heads and rational degree is Omega(n). The extension to ReLU post-processing follows from standard approximation arguments. This is the punchline worth knowing right away. The argument is algebraic and direct: softmax applied to linear forms in each head produces rational functions whose degree stays bounded independent of n, so h heads contribute total degree linear in h before the outer rational of degree d multiplies through. Parity requires exact degree n for sign-representation, which forces the linear scaling. The derivation sticks to standard scaled dot-product attention and rational definitions with no extra positional tricks or hidden normalizations, and the stress-test confirms the steps hold without circularity or fitting. What is new is the specific heads-by-degree product for this function; earlier transformer lower bounds did not isolate this combination so precisely. The paper does this part cleanly and quantitatively. Soft spots are limited and proportionate. The result applies only to exactly one layer and to sign-representation, which is a narrow but well-defined setting; multi-layer models or different training objectives sit outside the claim. The ReLU extension inherits approximation constants that could matter in practice but do not affect the rational core. No load-bearing gaps appear in the central proof. This is for readers working on expressivity limits of attention mechanisms and sequence models. Anyone tracking why certain functions remain hard for shallow transformers will get direct value from the bound. I would send it to peer review; the math is grounded and the quantitative statement is useful even if later work relaxes the one-layer restriction.

Referee Report

0 major / 2 minor

Summary. The paper proves that no single self-attention layer followed by a rational post-processing function can sign-represent the parity function on n bits unless the product of the number of heads and the degree of the rational function grows linearly with n. It further derives a margin-dependent lower bound for ReLU post-processing by composing the rational bound with known rational approximation results for ReLU networks.

Significance. If the algebraic degree bound holds, the result supplies a clean, parameter-free limitation on the expressive power of one-layer attention models for parity, a function whose sign-representation requires degree n. The per-head degree argument (softmax of linear forms yields bounded-degree rationals, independent of n) is a technical strength that directly yields the Omega(n) product lower bound without fitted parameters or empirical fits.

minor comments (2)

[§2] §2 (model definition): explicitly state whether positional encodings are permitted and whether the value matrix is allowed to depend on the input length n, as these choices affect the degree bound.
[final section] The extension to ReLU networks in the final section relies on a specific rational approximation rate; citing the precise theorem used for the approximation error would strengthen the margin-dependent claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript and the recommendation for minor revision. We appreciate the recognition that the per-head degree argument yields a clean, parameter-free Omega(n) lower bound on the product of heads and rational degree for sign-representing parity.

Circularity Check

0 steps flagged

No significant circularity in the lower bound derivation

full rationale

The paper derives a lower bound via algebraic degree analysis: each attention head computes a rational function of bounded degree (from softmax over linear forms), which composes with an outer rational post-processor of degree d to yield total degree O(h*d). Parity requires degree n for sign-representation, forcing h*d = Omega(n). This is a direct proof from standard definitions of scaled dot-product attention and rational functions, with no fitted parameters, self-referential definitions, load-bearing self-citations, or ansatz smuggling. The derivation is self-contained and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on standard mathematical properties of self-attention and rational functions; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

standard math Standard algebraic and analytic properties of self-attention layers and rational functions
The lower bound derivation uses these background facts about attention and rational approximation.

pith-pipeline@v0.9.0 · 5333 in / 1135 out tokens · 55896 ms · 2026-05-13T07:21:08.952191+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Spectral properties of threshold functions

Craig Gotsman and Nathan Linial. Spectral properties of threshold functions. Combinatorica, 14 0 (1): 0 35--50, 1994

work page 1994
[2]

The correct exponent for the Gotsman-Linial conjecture

Daniel M Kane. The correct exponent for the Gotsman-Linial conjecture. In IEEE Conference on Computational Complexity, pages 56--64, 2013

work page 2013
[3]

Parity, Sensitivity, and Transformers

Alexander Kozachinskiy, Tomasz Steifer, and Przemys aw Wa e ga. Parity, sensitivity, and transformers. arXiv preprint arXiv:2602.05896, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Rational approximation to |x|

Donald J Newman. Rational approximation to |x| . Michigan Math. J., 11: 0 11--14, 1964

work page 1964
[5]

Approximating threshold circuits by rational functions

Ramamohan Paturi and Michael E Saks. Approximating threshold circuits by rational functions. Information and Computation, 112 0 (2): 0 257--272, 1994

work page 1994
[6]

Neural networks and rational functions

Matus Telgarsky. Neural networks and rational functions. In International Conference on Machine Learning, pages 3387--3393, 2017

work page 2017