Parity, Sensitivity, and Transformers

Alexander Kozachinskiy , Tomasz Steifer , Przemys{\l}aw Wa{\l}\c{e}ga

Authors on Pith no claims yet

classification 💻 cs.LG cs.AI

keywords paritytransformernumberassumptionscomputetransformerscannotcausal

read the original abstract

Understanding what neural architectures can and cannot compute is a central challenge in the theory of AI. One of the fundamental problems in this context is the PARITY task, which asks whether the number of 1s in a binary input sequence is even or odd. PARITY is one of the central tasks studied in the theory of computation, yet it remains surprisingly unclear under which conditions transformers can or cannot solve it. In this paper, we show that the minimal number of layers a transformer needs to compute PARITY is two. In particular, we solve the open problem asking whether a one-layer transformer can compute PARITY. We answer it negatively by showing that average sensitivity of a one-layer transformer grows slower than that of PARITY. Furthermore, we show a new construction for transformer that computes PARITY, which improves on the existing constructions by removing a number of impractical assumptions. In particular, the existing transformers for PARITY rely on such impractical assumptions as length-dependent positional encoding, hardmax, layernorm without a regularisation parameter, or incompatibility with causal masking. We show that these assumptions can be removed, at the cost of increasing the number of layers from two to four. Specifically, we show that PARITY can be computed by a four-layer transformer using softmax attention, length-independent and polynomially bounded positional encoding, no layernorm, and compatible with both causal and non-causal masking.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lower bounds for one-layer transformers that compute parity
cs.LG 2026-05 unverdicted novelty 7.0

One-layer self-attention with rational or ReLU post-processing cannot sign-represent parity unless heads times degree scales linearly with input length.