Tight Sample Complexity of Transformers
Pith reviewed 2026-06-27 17:31 UTC · model grok-4.3
The pith
Depth-L transformers with W parameters have VC dimension O(L W log(T W)).
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We tightly characterize the VC dimension of depth-L Transformers with a total of W parameters, mapping an input sequence of length T to a single output, establishing an upper bound of O(L W log (T W)) and a nearly matching lower bound of Ω(L W log (T W / L)). We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing learns with sample complexity O(L W log ((T+T') W)) and that any learning rule that uses chain-of-thought data requires at least Ω(L W log ((T+T') W / L)) examples, where T is the input length and T' is the number of autoregressive steps.
What carries the argument
VC dimension of the depth-L transformer hypothesis class with W parameters on length-T inputs
If this is right
- Teacher forcing achieves sample complexity O(L W log((T+T')W)) for chain-of-thought.
- Any learning rule using chain-of-thought data requires Ω(L W log((T+T')W / L)) examples.
- The VC dimension scales with L W log(T W), determining generalization guarantees.
Where Pith is reading between the lines
- The logarithmic dependence on T suggests that longer sequences do not increase sample needs dramatically.
- Similar bounds might hold for variants of transformers if the parameter counting is analogous.
- Empirical tests on small models could check if observed generalization matches the predicted sample complexity.
Load-bearing premise
The class of functions realized by the transformers allows direct application of VC dimension bounds without further restrictions on weights or positional encodings.
What would settle it
Observing a number of distinct labelings by such transformers on T-length sequences that exceeds O(L W log(T W)) or is less than Ω(L W log(T W / L)) would contradict the bounds.
Figures
read the original abstract
We tightly characterize the VC dimension of depth-$L$ Transformers with a total of $W$ parameters, mapping an input sequence of length $T$ to a single output, establishing an upper bound of $O(L W \log (T W))$ and a nearly matching lower bound of $\Omega(L W \log (T W / L))$. We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing (i.e. selecting a predictor consistent with the entire chain-of-thought on training data) learns with sample complexity $O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right)$ and that any learning rule that uses chain-of-thought data requires at least $\Omega\left(L W \log \left(\left(T+T^{\prime}\right) W / L\right)\right)$ examples, where $T$ is the input length and $T^{\prime}$ is the number of autoregressive steps.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to tightly characterize the VC dimension of depth-L Transformers with total W parameters (mapping input sequence of length T to a single output) via an upper bound of O(L W log (T W)) and a nearly matching lower bound of Ω(L W log (T W / L)). It further claims tight sample-complexity results for chain-of-thought learning, with teacher forcing achieving O(L W log ((T + T') W)) and any rule using such data requiring at least Ω(L W log ((T + T') W / L)) examples.
Significance. If the modeling assumptions hold and the counting arguments are valid, the results would supply the first nearly tight VC-dimension and sample-complexity characterizations for transformers, a notable contribution to the theoretical analysis of modern architectures.
major comments (3)
- [Model definition / hypothesis class] The central claims rest on the hypothesis class being defined precisely enough for standard Sauer-Shelah or covering-number arguments to apply directly. The manuscript must specify the exact input domain (real vectors vs. discrete tokens), weight ranges, normalization, and whether positional encodings are learned or fixed, because softmax over T positions is transcendental and can alter the growth function.
- [VC-dimension upper and lower bounds] The upper-bound derivation O(L W log (T W)) and the matching lower-bound construction Ω(L W log (T W / L)) must be shown to remain valid once the attention mechanism and any restrictions on the parameter space are fully stated; otherwise the claimed tightness may not hold inside the stated architecture.
- [Chain-of-thought sample complexity] The chain-of-thought extension inherits the same modeling assumptions; the sample-complexity statements O(L W log ((T + T') W)) and Ω(L W log ((T + T') W / L)) therefore require the same explicit formalization of the autoregressive steps and teacher-forcing predictor class.
minor comments (1)
- [Notation and model parameters] Clarify how the total parameter count W is computed across all layers and heads, and whether it includes positional-encoding parameters.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive report. The comments highlight the need for greater explicitness in the model definition and proof details. We address each point below and will revise the manuscript to incorporate the requested clarifications while preserving the core technical contributions.
read point-by-point responses
-
Referee: [Model definition / hypothesis class] The central claims rest on the hypothesis class being defined precisely enough for standard Sauer-Shelah or covering-number arguments to apply directly. The manuscript must specify the exact input domain (real vectors vs. discrete tokens), weight ranges, normalization, and whether positional encodings are learned or fixed, because softmax over T positions is transcendental and can alter the growth function.
Authors: We agree that explicit specification is required for the arguments to be fully rigorous. The manuscript models inputs as sequences of discrete tokens drawn from a finite vocabulary, with real-valued weights whose total count is W (no artificial bounds beyond the parameter count), fixed sinusoidal positional encodings, and standard layer normalization. The upper bound proceeds via a covering-number argument that discretizes the range of the softmax (which is Lipschitz on bounded domains) to obtain a finite growth function; the lower bound is combinatorial and independent of the transcendental character. In revision we will add a dedicated model-definition subsection stating these choices verbatim and including a short lemma justifying the discretization step for the covering numbers. revision: yes
-
Referee: [VC-dimension upper and lower bounds] The upper-bound derivation O(L W log (T W)) and the matching lower-bound construction Ω(L W log (T W / L)) must be shown to remain valid once the attention mechanism and any restrictions on the parameter space are fully stated; otherwise the claimed tightness may not hold inside the stated architecture.
Authors: The upper bound counts the number of distinct sign patterns realizable by composing L layers, each with at most W/L parameters, using the fact that each attention head and feed-forward block induces a function whose shattering capacity is controlled by its parameter count; the argument already incorporates the standard multi-head attention mechanism. The lower bound is realized by a direct embedding of a hard VC-dimension instance into the weight matrices of a depth-L transformer with the same attention structure. Once the model section is expanded as described above, both proofs will be cross-referenced to the precise architecture, confirming that no additional restrictions invalidate the stated bounds. We will insert these cross-references and a short verification paragraph in the revision. revision: yes
-
Referee: [Chain-of-thought sample complexity] The chain-of-thought extension inherits the same modeling assumptions; the sample-complexity statements O(L W log ((T + T') W)) and Ω(L W log ((T + T') W / L)) therefore require the same explicit formalization of the autoregressive steps and teacher-forcing predictor class.
Authors: The chain-of-thought results are obtained by viewing the teacher-forcing predictor as a single transformer that maps an extended sequence of length T+T' to the full output chain; the VC-dimension bounds therefore transfer directly. In revision we will add a formal definition of the teacher-forcing hypothesis class (the set of all depth-L transformers that are consistent with the observed chain on the training data) and restate the sample-complexity claims as immediate corollaries of the VC-dimension theorem already proved for the extended length. This makes the inheritance explicit without altering the numerical bounds. revision: yes
Circularity Check
No circularity: VC bounds from first-principles parameter counting
full rationale
The paper derives upper and lower bounds on VC dimension and sample complexity directly from the total parameter count W, depth L, and sequence lengths T/T' via standard Sauer-Shelah-style counting on the hypothesis class. No equations reduce a claimed prediction to a fitted quantity by construction, no self-citation chains carry the central argument, and no ansatz or uniqueness theorem is imported from prior author work. The derivation is self-contained against external VC-dimension theory and does not rely on the target result to define its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard VC-dimension theory applies directly to the class of functions realized by depth-L transformers with W parameters
Reference graph
Works this paper leans on
-
[1]
Transactions of the Association for Computational Linguistics , volume=
Saturated transformers are constant-depth threshold circuits , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=
2022
-
[2]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Average-Hard Attention Transformers are Constant-Depth Uniform Threshold Circuits , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[3]
Journal of Machine Learning Research , volume=
Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks , author=. Journal of Machine Learning Research , volume=
-
[4]
IEEE transactions on electronic computers , volume=
Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , author=. IEEE transactions on electronic computers , volume=. 1965 , publisher=
1965
-
[5]
Neural computation , volume=
What size net gives valid generalization? , author=. Neural computation , volume=. 1989 , publisher=
1989
-
[6]
Theory of Probability & Its Applications , volume=
On the uniform convergence of relative frequencies of events to their probabilities , author=. Theory of Probability & Its Applications , volume=. 1971 , publisher=
1971
-
[7]
Neural computation , volume=
Almost linear VC dimension bounds for piecewise polynomial networks , author=. Neural computation , volume=. 1998 , publisher=
1998
-
[8]
Pattern Recognition Letters , volume=
On the VC-dimension of depth-four threshold circuits , author=. Pattern Recognition Letters , volume=. 1993 , publisher=
1993
-
[9]
Proceedings of the 7th Annual Conference on Computational Learning Theory , pages=
VC dimension in circuit complexity , author=. Proceedings of the 7th Annual Conference on Computational Learning Theory , pages=
-
[10]
Neural Computation , volume=
Neural nets with superlinear VC-dimension , author=. Neural Computation , volume=. 1994 , publisher=
1994
-
[11]
Advances in neural information processing systems , volume=
On the number of linear regions of deep neural networks , author=. Advances in neural information processing systems , volume=
-
[12]
2025 , eprint=
A Theory of Learning with Autoregressive Chain of Thought , author=. 2025 , eprint=
2025
-
[13]
Bell system technical journal , volume=
Prediction and entropy of printed English , author=. Bell system technical journal , volume=. 1951 , publisher=
1951
-
[14]
Neural computation , volume=
A learning algorithm for continually running fully recurrent neural networks , author=. Neural computation , volume=. 1989 , publisher=
1989
-
[15]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[16]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[17]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[18]
Neural computation , volume=
Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=
1997
-
[19]
International Conference on Machine Learning , pages=
Inductive biases and variable creation in self-attention mechanisms , author=. International Conference on Machine Learning , pages=. 2022 , organization=
2022
-
[20]
arXiv preprint arXiv:2309.06979 , year=
Auto-regressive next-token predictors are universal learners , author=. arXiv preprint arXiv:2309.06979 , year=
-
[21]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. arXiv preprint arXiv:2303.12712 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
OpenAI blog , volume=
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[23]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Advances in Neural Information Processing Systems , volume=
Towards revealing the mystery behind chain of thought: a theoretical perspective , author=. Advances in Neural Information Processing Systems , volume=
-
[26]
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Show your work: Scratchpads for intermediate computation with language models , author=. arXiv preprint arXiv:2112.00114 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
arXiv preprint arXiv:2310.13088 , year=
Sequence Length Independent Norm-Based Generalization Bounds for Transformers , author=. arXiv preprint arXiv:2310.13088 , year=
-
[28]
International Conference on Learning Representations , year=
Are Transformers universal approximators of sequence-to-sequence functions? , author=. International Conference on Learning Representations , year=
-
[29]
Attention is Turing-Complete , journal=
P. Attention is Turing-Complete , journal=. 2021 , volume=
2021
-
[30]
arXiv preprint arXiv:2301.13196 , year=
Looped transformers as programmable computers , author=. arXiv preprint arXiv:2301.13196 , year=
-
[31]
Transactions of the Association for Computational Linguistics , volume=
Theoretical limitations of self-attention in neural sequence models , author=. Transactions of the Association for Computational Linguistics , volume=
-
[32]
Transactions of the Association for Computational Linguistics , volume=
Formal language recognition by hard attention transformers: Perspectives from circuit complexity , author=. Transactions of the Association for Computational Linguistics , volume=
-
[33]
International Conference on Machine Learning , pages=
Transformers learn in-context by gradient descent , author=. International Conference on Machine Learning , pages=. 2023 , organization=
2023
-
[34]
The Eleventh International Conference on Learning Representations , year=
What learning algorithm is in-context learning? Investigations with linear models , author=. The Eleventh International Conference on Learning Representations , year=
-
[35]
arXiv preprint arXiv:2306.04637 , year=
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection , author=. arXiv preprint arXiv:2306.04637 , year=
-
[36]
arXiv preprint arXiv:2308.16898 , year=
Transformers as support vector machines , author=. arXiv preprint arXiv:2308.16898 , year=
-
[37]
Advances in Neural Information Processing Systems , volume=
Vision Transformers provably learn spatial structure , author=. Advances in Neural Information Processing Systems , volume=
-
[38]
arXiv preprint arXiv:2305.16380 , year=
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer , author=. arXiv preprint arXiv:2305.16380 , year=
-
[39]
arXiv preprint arXiv:2212.14852 , year=
An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models , author=. arXiv preprint arXiv:2212.14852 , year=
-
[40]
Advances in neural information processing systems , volume=
Spectrally-normalized margin bounds for neural networks , author=. Advances in neural information processing systems , volume=
-
[41]
A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks
A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks , author=. arXiv preprint arXiv:1707.09564 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Conference On Learning Theory , pages=
Size-independent sample complexity of neural networks , author=. Conference On Learning Theory , pages=. 2018 , organization=
2018
-
[43]
arXiv preprint arXiv:2310.07923 , year=
The expresssive power of transformers with chain of thought , author=. arXiv preprint arXiv:2310.07923 , year=
-
[44]
2024 , eprint=
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , author=. 2024 , eprint=
2024
-
[45]
Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
1999 , publisher=
Neural Network Learning: Theoretical Foundations , author=. 1999 , publisher=
1999
-
[47]
Nature , volume=
Learning representations by back-propagating errors , author=. Nature , volume=. 1986 , publisher=
1986
-
[48]
arXiv preprint arXiv:2402.04084 , year=
Provably learning a multi-head attention layer , author=. arXiv preprint arXiv:2402.04084 , year=
-
[49]
Communications of the ACM , volume=
A theory of the learnable , author=. Communications of the ACM , volume=. 1984 , publisher=
1984
-
[50]
Machine Learning , volume=
On learning sets and functions , author=. Machine Learning , volume=. 1989 , publisher=
1989
-
[51]
Learnability and the
Blumer, Anselm and Ehrenfeucht, Andr. Learnability and the. Journal of the Association for Computing Machinery , volume =. 1989 , publisher =
1989
-
[52]
Polynomial Bounds for
Karpinski, Marek and Macintyre, Angus , journal =. Polynomial Bounds for. 1997 , doi =
1997
-
[53]
arXiv preprint arXiv:2503.14337 , year=
Pencil: Long thoughts with short memory , author=. arXiv preprint arXiv:2503.14337 , year=
-
[54]
Statistically Meaningful Approximation:
Wei, Colin and Chen, Yining and Ma, Tengyu , booktitle=. Statistically Meaningful Approximation:
-
[55]
7th International Conference on Learning Representations (ICLR) , year=
On the Turing Completeness of Modern NLP Models , author=. 7th International Conference on Learning Representations (ICLR) , year=
-
[56]
The Twelfth International Conference on Learning Representations (ICLR) , year=
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , author=. The Twelfth International Conference on Learning Representations (ICLR) , year=
-
[57]
The Parallelism Tradeoff:
Merrill, William and Sabharwal, Ashish , booktitle=. The Parallelism Tradeoff:
-
[58]
The Eleventh International Conference on Learning Representations (ICLR) , year=
Transformers Learn Shortcuts to Automata , author=. The Eleventh International Conference on Learning Representations (ICLR) , year=
-
[59]
Advances in Neural Information Processing Systems (NeurIPS) , year=
Are Transformers universal approximators of sequence-to-sequence functions? , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[60]
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=
On the Ability and Limitations of Transformers to Recognize Formal Languages , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=
2020
-
[61]
The Twelfth International Conference on Learning Representations (ICLR) , year=
The Expressive Power of Transformers with Chain of Thought , author=. The Twelfth International Conference on Learning Representations (ICLR) , year=
-
[62]
The Impact of Positional Encoding on Length Generalization in
Kazemnejad, Amirhossein and Padhi, Inkit and Ramamurthy, Karthikeyan Natesan and Das, Payel and Reddy, Siva , booktitle=. The Impact of Positional Encoding on Length Generalization in
-
[63]
GLU Variants Improve Transformer
Glu variants improve transformer , author=. arXiv preprint arXiv:2002.05202 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[64]
2009 , publisher=
Neural network learning: Theoretical foundations , author=. 2009 , publisher=
2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.