pith. machine review for the scientific record. sign in

arxiv: 2605.12171 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

Lower bounds for one-layer transformers that compute parity

Daniel Hsu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:21 UTC · model grok-4.3

classification 💻 cs.LG
keywords transformersself-attentionparity functionlower boundsrational functionsReLU networkssign representation
0
0 comments X

The pith

No self-attention layer post-processed by a rational function can sign-represent parity unless the product of heads and degree grows linearly with input length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a single self-attention layer followed by rational post-processing cannot correctly sign-represent the parity function on n-bit inputs unless the number of heads times the degree of the rational function grows at least linearly in n. A reader would care because parity is a basic boolean function whose computation often exposes fundamental limits in neural models, and the result quantifies how attention alone falls short without scaling resources. The bound combines with rational approximation theorems to extend a similar margin-dependent lower bound to ReLU post-processing. This clarifies why one-layer attention architectures need growing capacity to handle even simple high-frequency patterns like parity.

Core claim

No self-attention layer post-processed by a rational function can sign-represent the parity function unless the product of the number of heads and the degree of the post-processing function grows linearly with the input length.

What carries the argument

Lower bound on sign-representation of parity using the algebraic properties of self-attention outputs combined with the degree of rational functions.

If this is right

  • Single-layer attention models must increase either the number of heads or the degree of post-processing linearly to handle parity as input size grows.
  • The same linear growth requirement applies to ReLU post-processing when the margin is taken into account via rational approximation.
  • Architectures relying on one attention layer face inherent resource costs for functions like parity that require tracking global parity.
  • Multi-layer or alternative mechanisms become necessary to avoid this linear blow-up for parity-like tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The bound implies that depth in transformers may be required to compute parity without paying the linear cost in heads or degree.
  • Similar techniques could yield lower bounds for other simple boolean functions that attention layers are expected to handle in practice.
  • Designers of efficient attention variants might target reductions in effective degree or head requirements to bypass the bound.

Load-bearing premise

The model is limited to exactly one self-attention layer with rational post-processing and the standard sign-representation task, without extra layers or nonstandard architectural modifications.

What would settle it

An explicit construction of a single self-attention layer with rational post-processing where the product of heads and degree remains sublinear in input length yet still sign-represents parity on every input vector.

read the original abstract

This note shows that no self-attention layer post-processed by a rational function can sign-represent the parity function unless the product of the number of heads and the degree of the post-processing function grows linearly with the input length. Combining this lower bound with rational approximation of ReLU networks yields a margin-dependent extension for self-attention layers post-processed by ReLU networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proves that no single self-attention layer followed by a rational post-processing function can sign-represent the parity function on n bits unless the product of the number of heads and the degree of the rational function grows linearly with n. It further derives a margin-dependent lower bound for ReLU post-processing by composing the rational bound with known rational approximation results for ReLU networks.

Significance. If the algebraic degree bound holds, the result supplies a clean, parameter-free limitation on the expressive power of one-layer attention models for parity, a function whose sign-representation requires degree n. The per-head degree argument (softmax of linear forms yields bounded-degree rationals, independent of n) is a technical strength that directly yields the Omega(n) product lower bound without fitted parameters or empirical fits.

minor comments (2)
  1. [§2] §2 (model definition): explicitly state whether positional encodings are permitted and whether the value matrix is allowed to depend on the input length n, as these choices affect the degree bound.
  2. [final section] The extension to ReLU networks in the final section relies on a specific rational approximation rate; citing the precise theorem used for the approximation error would strengthen the margin-dependent claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript and the recommendation for minor revision. We appreciate the recognition that the per-head degree argument yields a clean, parameter-free Omega(n) lower bound on the product of heads and rational degree for sign-representing parity.

Circularity Check

0 steps flagged

No significant circularity in the lower bound derivation

full rationale

The paper derives a lower bound via algebraic degree analysis: each attention head computes a rational function of bounded degree (from softmax over linear forms), which composes with an outer rational post-processor of degree d to yield total degree O(h*d). Parity requires degree n for sign-representation, forcing h*d = Omega(n). This is a direct proof from standard definitions of scaled dot-product attention and rational functions, with no fitted parameters, self-referential definitions, load-bearing self-citations, or ansatz smuggling. The derivation is self-contained and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on standard mathematical properties of self-attention and rational functions; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • standard math Standard algebraic and analytic properties of self-attention layers and rational functions
    The lower bound derivation uses these background facts about attention and rational approximation.

pith-pipeline@v0.9.0 · 5333 in / 1135 out tokens · 55896 ms · 2026-05-13T07:21:08.952191+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    Spectral properties of threshold functions

    Craig Gotsman and Nathan Linial. Spectral properties of threshold functions. Combinatorica, 14 0 (1): 0 35--50, 1994

  2. [2]

    The correct exponent for the Gotsman-Linial conjecture

    Daniel M Kane. The correct exponent for the Gotsman-Linial conjecture. In IEEE Conference on Computational Complexity, pages 56--64, 2013

  3. [3]

    Parity, Sensitivity, and Transformers

    Alexander Kozachinskiy, Tomasz Steifer, and Przemys aw Wa e ga. Parity, sensitivity, and transformers. arXiv preprint arXiv:2602.05896, 2026

  4. [4]

    Rational approximation to |x|

    Donald J Newman. Rational approximation to |x| . Michigan Math. J., 11: 0 11--14, 1964

  5. [5]

    Approximating threshold circuits by rational functions

    Ramamohan Paturi and Michael E Saks. Approximating threshold circuits by rational functions. Information and Computation, 112 0 (2): 0 257--272, 1994

  6. [6]

    Neural networks and rational functions

    Matus Telgarsky. Neural networks and rational functions. In International Conference on Machine Learning, pages 3387--3393, 2017