Tight Sample Complexity of Transformers

Chenxiao Yang; Nathan Srebro; Zhiyuan Li

arxiv: 2606.09731 · v1 · pith:XYKQSRSCnew · submitted 2026-06-08 · 💻 cs.LG

Tight Sample Complexity of Transformers

Chenxiao Yang , Nathan Srebro , Zhiyuan Li This is my paper

Pith reviewed 2026-06-27 17:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords transformersVC dimensionsample complexitychain-of-thoughtlearning theoryattention mechanismsgeneralization bounds

0 comments

The pith

Depth-L transformers with W parameters have VC dimension O(L W log(T W)).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes tight upper and lower bounds on the VC dimension of transformers. For depth L and W parameters on sequences of length T, this dimension is bounded by O(L W log(T W)) from above and Ω(L W log(T W / L)) from below. These bounds directly imply the number of examples needed to learn the model. The same tight characterization applies to chain-of-thought learning, where using the full chain allows sample complexity O(L W log((T + T') W)). This matters for understanding how much data transformers require to generalize beyond training examples.

Core claim

We tightly characterize the VC dimension of depth-L Transformers with a total of W parameters, mapping an input sequence of length T to a single output, establishing an upper bound of O(L W log (T W)) and a nearly matching lower bound of Ω(L W log (T W / L)). We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing learns with sample complexity O(L W log ((T+T') W)) and that any learning rule that uses chain-of-thought data requires at least Ω(L W log ((T+T') W / L)) examples, where T is the input length and T' is the number of autoregressive steps.

What carries the argument

VC dimension of the depth-L transformer hypothesis class with W parameters on length-T inputs

If this is right

Teacher forcing achieves sample complexity O(L W log((T+T')W)) for chain-of-thought.
Any learning rule using chain-of-thought data requires Ω(L W log((T+T')W / L)) examples.
The VC dimension scales with L W log(T W), determining generalization guarantees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The logarithmic dependence on T suggests that longer sequences do not increase sample needs dramatically.
Similar bounds might hold for variants of transformers if the parameter counting is analogous.
Empirical tests on small models could check if observed generalization matches the predicted sample complexity.

Load-bearing premise

The class of functions realized by the transformers allows direct application of VC dimension bounds without further restrictions on weights or positional encodings.

What would settle it

Observing a number of distinct labelings by such transformers on T-length sequences that exceeds O(L W log(T W)) or is less than Ω(L W log(T W / L)) would contradict the bounds.

Figures

Figures reproduced from arXiv: 2606.09731 by Chenxiao Yang, Nathan Srebro, Zhiyuan Li.

**Figure 1.** Figure 1: Recursive retrieval gadget for sample (j, ℓ, i). The context positions enumerate all B-bit label configurations, while the query register r0 stores an encoded list of target addresses (sj,1, . . . , sj,L). At layer k, hard attention selects the context position t ∗ k nearest to the current register, the value head reads its i-th bit, and the FFN shifts the register to expose the next address. A gate output… view at source ↗

**Figure 2.** Figure 2: Top (Trace Shattering Dimension TSdim): for b = 1 the entire trace matches the groundtruth; otherwise some tokens differ. Bottom (Answer Shattering Dimension ASdim): for each input, all generators share the same thought prefix and differ only in the final answer. Main Result. Since answer shattering is a stronger requirement (varying the final token while keeping the prefix fixed necessarily varies the fu… view at source ↗

read the original abstract

We tightly characterize the VC dimension of depth-$L$ Transformers with a total of $W$ parameters, mapping an input sequence of length $T$ to a single output, establishing an upper bound of $O(L W \log (T W))$ and a nearly matching lower bound of $\Omega(L W \log (T W / L))$. We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing (i.e. selecting a predictor consistent with the entire chain-of-thought on training data) learns with sample complexity $O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right)$ and that any learning rule that uses chain-of-thought data requires at least $\Omega\left(L W \log \left(\left(T+T^{\prime}\right) W / L\right)\right)$ examples, where $T$ is the input length and $T^{\prime}$ is the number of autoregressive steps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first nearly tight VC-dimension bounds for transformers and extends them to chain-of-thought, but the formalization of the hypothesis class needs close checking.

read the letter

The main thing to know is that this paper supplies the first nearly matching upper and lower bounds on VC dimension for depth-L transformers with W parameters on length-T sequences: O(L W log(T W)) above and Ω(L W log(T W / L)) below. They carry the same style of bound over to the sample complexity of chain-of-thought learning under teacher forcing.

That is new. Earlier work on neural-net complexity gave looser or one-sided results; here both directions sit within a log factor of each other, and the CoT extension is a direct, relevant addition.

The paper does the counting from the parameter space without obvious circularity, and the rates scale sensibly with depth, width, and sequence length. That is useful for anyone who wants concrete sample-complexity numbers for sequence models.

The soft spot is exactly the one flagged in the stress test. Transformers compute with softmax over T positions and learned or fixed positional encodings; the input domain is real vectors or tokens. Standard Sauer-Shelah or covering arguments do not apply automatically to a class with transcendental functions and no explicit weight bounds. If the upper bound proof discretizes or restricts the class in a way that changes the growth function, or if the lower-bound construction cannot be realized inside the stated architecture, the claimed tightness slips. The CoT bounds inherit the same modeling choice. Without the full proofs it is impossible to tell how tightly they closed those gaps.

This is for learning theorists who track architecture-specific generalization. A reader who needs benchmark rates for data requirements in transformers will find the numbers worth looking at, even if the constants need later tightening. It deserves a serious referee because the question is central, the bounds are sharp on paper, and the only way to know whether the formalization holds is to have experts examine the details.

Referee Report

3 major / 1 minor

Summary. The paper claims to tightly characterize the VC dimension of depth-L Transformers with total W parameters (mapping input sequence of length T to a single output) via an upper bound of O(L W log (T W)) and a nearly matching lower bound of Ω(L W log (T W / L)). It further claims tight sample-complexity results for chain-of-thought learning, with teacher forcing achieving O(L W log ((T + T') W)) and any rule using such data requiring at least Ω(L W log ((T + T') W / L)) examples.

Significance. If the modeling assumptions hold and the counting arguments are valid, the results would supply the first nearly tight VC-dimension and sample-complexity characterizations for transformers, a notable contribution to the theoretical analysis of modern architectures.

major comments (3)

[Model definition / hypothesis class] The central claims rest on the hypothesis class being defined precisely enough for standard Sauer-Shelah or covering-number arguments to apply directly. The manuscript must specify the exact input domain (real vectors vs. discrete tokens), weight ranges, normalization, and whether positional encodings are learned or fixed, because softmax over T positions is transcendental and can alter the growth function.
[VC-dimension upper and lower bounds] The upper-bound derivation O(L W log (T W)) and the matching lower-bound construction Ω(L W log (T W / L)) must be shown to remain valid once the attention mechanism and any restrictions on the parameter space are fully stated; otherwise the claimed tightness may not hold inside the stated architecture.
[Chain-of-thought sample complexity] The chain-of-thought extension inherits the same modeling assumptions; the sample-complexity statements O(L W log ((T + T') W)) and Ω(L W log ((T + T') W / L)) therefore require the same explicit formalization of the autoregressive steps and teacher-forcing predictor class.

minor comments (1)

[Notation and model parameters] Clarify how the total parameter count W is computed across all layers and heads, and whether it includes positional-encoding parameters.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive report. The comments highlight the need for greater explicitness in the model definition and proof details. We address each point below and will revise the manuscript to incorporate the requested clarifications while preserving the core technical contributions.

read point-by-point responses

Referee: [Model definition / hypothesis class] The central claims rest on the hypothesis class being defined precisely enough for standard Sauer-Shelah or covering-number arguments to apply directly. The manuscript must specify the exact input domain (real vectors vs. discrete tokens), weight ranges, normalization, and whether positional encodings are learned or fixed, because softmax over T positions is transcendental and can alter the growth function.

Authors: We agree that explicit specification is required for the arguments to be fully rigorous. The manuscript models inputs as sequences of discrete tokens drawn from a finite vocabulary, with real-valued weights whose total count is W (no artificial bounds beyond the parameter count), fixed sinusoidal positional encodings, and standard layer normalization. The upper bound proceeds via a covering-number argument that discretizes the range of the softmax (which is Lipschitz on bounded domains) to obtain a finite growth function; the lower bound is combinatorial and independent of the transcendental character. In revision we will add a dedicated model-definition subsection stating these choices verbatim and including a short lemma justifying the discretization step for the covering numbers. revision: yes
Referee: [VC-dimension upper and lower bounds] The upper-bound derivation O(L W log (T W)) and the matching lower-bound construction Ω(L W log (T W / L)) must be shown to remain valid once the attention mechanism and any restrictions on the parameter space are fully stated; otherwise the claimed tightness may not hold inside the stated architecture.

Authors: The upper bound counts the number of distinct sign patterns realizable by composing L layers, each with at most W/L parameters, using the fact that each attention head and feed-forward block induces a function whose shattering capacity is controlled by its parameter count; the argument already incorporates the standard multi-head attention mechanism. The lower bound is realized by a direct embedding of a hard VC-dimension instance into the weight matrices of a depth-L transformer with the same attention structure. Once the model section is expanded as described above, both proofs will be cross-referenced to the precise architecture, confirming that no additional restrictions invalidate the stated bounds. We will insert these cross-references and a short verification paragraph in the revision. revision: yes
Referee: [Chain-of-thought sample complexity] The chain-of-thought extension inherits the same modeling assumptions; the sample-complexity statements O(L W log ((T + T') W)) and Ω(L W log ((T + T') W / L)) therefore require the same explicit formalization of the autoregressive steps and teacher-forcing predictor class.

Authors: The chain-of-thought results are obtained by viewing the teacher-forcing predictor as a single transformer that maps an extended sequence of length T+T' to the full output chain; the VC-dimension bounds therefore transfer directly. In revision we will add a formal definition of the teacher-forcing hypothesis class (the set of all depth-L transformers that are consistent with the observed chain on the training data) and restate the sample-complexity claims as immediate corollaries of the VC-dimension theorem already proved for the extended length. This makes the inheritance explicit without altering the numerical bounds. revision: yes

Circularity Check

0 steps flagged

No circularity: VC bounds from first-principles parameter counting

full rationale

The paper derives upper and lower bounds on VC dimension and sample complexity directly from the total parameter count W, depth L, and sequence lengths T/T' via standard Sauer-Shelah-style counting on the hypothesis class. No equations reduce a claimed prediction to a fitted quantity by construction, no self-citation chains carry the central argument, and no ansatz or uniqueness theorem is imported from prior author work. The derivation is self-contained against external VC-dimension theory and does not rely on the target result to define its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The derivation relies on standard results from statistical learning theory applied to a transformer hypothesis class whose precise definition is not supplied in the abstract.

axioms (1)

domain assumption Standard VC-dimension theory applies directly to the class of functions realized by depth-L transformers with W parameters
The upper and lower bounds are obtained by counting the shattering capacity of this class.

pith-pipeline@v0.9.1-grok · 5698 in / 1334 out tokens · 27371 ms · 2026-06-27T17:31:39.522851+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 17 canonical work pages · 7 internal anchors

[1]

Transactions of the Association for Computational Linguistics , volume=

Saturated transformers are constant-depth threshold circuits , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

2022
[2]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Average-Hard Attention Transformers are Constant-Depth Uniform Threshold Circuits , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[3]

Journal of Machine Learning Research , volume=

Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks , author=. Journal of Machine Learning Research , volume=
[4]

IEEE transactions on electronic computers , volume=

Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , author=. IEEE transactions on electronic computers , volume=. 1965 , publisher=

1965
[5]

Neural computation , volume=

What size net gives valid generalization? , author=. Neural computation , volume=. 1989 , publisher=

1989
[6]

Theory of Probability & Its Applications , volume=

On the uniform convergence of relative frequencies of events to their probabilities , author=. Theory of Probability & Its Applications , volume=. 1971 , publisher=

1971
[7]

Neural computation , volume=

Almost linear VC dimension bounds for piecewise polynomial networks , author=. Neural computation , volume=. 1998 , publisher=

1998
[8]

Pattern Recognition Letters , volume=

On the VC-dimension of depth-four threshold circuits , author=. Pattern Recognition Letters , volume=. 1993 , publisher=

1993
[9]

Proceedings of the 7th Annual Conference on Computational Learning Theory , pages=

VC dimension in circuit complexity , author=. Proceedings of the 7th Annual Conference on Computational Learning Theory , pages=
[10]

Neural Computation , volume=

Neural nets with superlinear VC-dimension , author=. Neural Computation , volume=. 1994 , publisher=

1994
[11]

Advances in neural information processing systems , volume=

On the number of linear regions of deep neural networks , author=. Advances in neural information processing systems , volume=
[12]

2025 , eprint=

A Theory of Learning with Autoregressive Chain of Thought , author=. 2025 , eprint=

2025
[13]

Bell system technical journal , volume=

Prediction and entropy of printed English , author=. Bell system technical journal , volume=. 1951 , publisher=

1951
[14]

Neural computation , volume=

A learning algorithm for continually running fully recurrent neural networks , author=. Neural computation , volume=. 1989 , publisher=

1989
[15]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=
[16]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
[17]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[18]

Neural computation , volume=

Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

1997
[19]

International Conference on Machine Learning , pages=

Inductive biases and variable creation in self-attention mechanisms , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[20]

arXiv preprint arXiv:2309.06979 , year=

Auto-regressive next-token predictors are universal learners , author=. arXiv preprint arXiv:2309.06979 , year=

work page arXiv
[21]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. arXiv preprint arXiv:2303.12712 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
[23]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Advances in Neural Information Processing Systems , volume=

Towards revealing the mystery behind chain of thought: a theoretical perspective , author=. Advances in Neural Information Processing Systems , volume=
[26]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Show your work: Scratchpads for intermediate computation with language models , author=. arXiv preprint arXiv:2112.00114 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

arXiv preprint arXiv:2310.13088 , year=

Sequence Length Independent Norm-Based Generalization Bounds for Transformers , author=. arXiv preprint arXiv:2310.13088 , year=

work page arXiv
[28]

International Conference on Learning Representations , year=

Are Transformers universal approximators of sequence-to-sequence functions? , author=. International Conference on Learning Representations , year=
[29]

Attention is Turing-Complete , journal=

P. Attention is Turing-Complete , journal=. 2021 , volume=

2021
[30]

arXiv preprint arXiv:2301.13196 , year=

Looped transformers as programmable computers , author=. arXiv preprint arXiv:2301.13196 , year=

work page arXiv
[31]

Transactions of the Association for Computational Linguistics , volume=

Theoretical limitations of self-attention in neural sequence models , author=. Transactions of the Association for Computational Linguistics , volume=
[32]

Transactions of the Association for Computational Linguistics , volume=

Formal language recognition by hard attention transformers: Perspectives from circuit complexity , author=. Transactions of the Association for Computational Linguistics , volume=
[33]

International Conference on Machine Learning , pages=

Transformers learn in-context by gradient descent , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[34]

The Eleventh International Conference on Learning Representations , year=

What learning algorithm is in-context learning? Investigations with linear models , author=. The Eleventh International Conference on Learning Representations , year=
[35]

arXiv preprint arXiv:2306.04637 , year=

Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection , author=. arXiv preprint arXiv:2306.04637 , year=

work page arXiv
[36]

arXiv preprint arXiv:2308.16898 , year=

Transformers as support vector machines , author=. arXiv preprint arXiv:2308.16898 , year=

work page arXiv
[37]

Advances in Neural Information Processing Systems , volume=

Vision Transformers provably learn spatial structure , author=. Advances in Neural Information Processing Systems , volume=
[38]

arXiv preprint arXiv:2305.16380 , year=

Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer , author=. arXiv preprint arXiv:2305.16380 , year=

work page arXiv
[39]

arXiv preprint arXiv:2212.14852 , year=

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models , author=. arXiv preprint arXiv:2212.14852 , year=

work page arXiv
[40]

Advances in neural information processing systems , volume=

Spectrally-normalized margin bounds for neural networks , author=. Advances in neural information processing systems , volume=
[41]

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks , author=. arXiv preprint arXiv:1707.09564 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Conference On Learning Theory , pages=

Size-independent sample complexity of neural networks , author=. Conference On Learning Theory , pages=. 2018 , organization=

2018
[43]

arXiv preprint arXiv:2310.07923 , year=

The expresssive power of transformers with chain of thought , author=. arXiv preprint arXiv:2310.07923 , year=

work page arXiv
[44]

2024 , eprint=

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , author=. 2024 , eprint=

2024
[45]

Layer Normalization

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

1999 , publisher=

Neural Network Learning: Theoretical Foundations , author=. 1999 , publisher=

1999
[47]

Nature , volume=

Learning representations by back-propagating errors , author=. Nature , volume=. 1986 , publisher=

1986
[48]

arXiv preprint arXiv:2402.04084 , year=

Provably learning a multi-head attention layer , author=. arXiv preprint arXiv:2402.04084 , year=

work page arXiv
[49]

Communications of the ACM , volume=

A theory of the learnable , author=. Communications of the ACM , volume=. 1984 , publisher=

1984
[50]

Machine Learning , volume=

On learning sets and functions , author=. Machine Learning , volume=. 1989 , publisher=

1989
[51]

Learnability and the

Blumer, Anselm and Ehrenfeucht, Andr. Learnability and the. Journal of the Association for Computing Machinery , volume =. 1989 , publisher =

1989
[52]

Polynomial Bounds for

Karpinski, Marek and Macintyre, Angus , journal =. Polynomial Bounds for. 1997 , doi =

1997
[53]

arXiv preprint arXiv:2503.14337 , year=

Pencil: Long thoughts with short memory , author=. arXiv preprint arXiv:2503.14337 , year=

work page arXiv
[54]

Statistically Meaningful Approximation:

Wei, Colin and Chen, Yining and Ma, Tengyu , booktitle=. Statistically Meaningful Approximation:
[55]

7th International Conference on Learning Representations (ICLR) , year=

On the Turing Completeness of Modern NLP Models , author=. 7th International Conference on Learning Representations (ICLR) , year=
[56]

The Twelfth International Conference on Learning Representations (ICLR) , year=

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , author=. The Twelfth International Conference on Learning Representations (ICLR) , year=
[57]

The Parallelism Tradeoff:

Merrill, William and Sabharwal, Ashish , booktitle=. The Parallelism Tradeoff:
[58]

The Eleventh International Conference on Learning Representations (ICLR) , year=

Transformers Learn Shortcuts to Automata , author=. The Eleventh International Conference on Learning Representations (ICLR) , year=
[59]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Are Transformers universal approximators of sequence-to-sequence functions? , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[60]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

On the Ability and Limitations of Transformers to Recognize Formal Languages , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

2020
[61]

The Twelfth International Conference on Learning Representations (ICLR) , year=

The Expressive Power of Transformers with Chain of Thought , author=. The Twelfth International Conference on Learning Representations (ICLR) , year=
[62]

The Impact of Positional Encoding on Length Generalization in

Kazemnejad, Amirhossein and Padhi, Inkit and Ramamurthy, Karthikeyan Natesan and Das, Payel and Reddy, Siva , booktitle=. The Impact of Positional Encoding on Length Generalization in
[63]

GLU Variants Improve Transformer

Glu variants improve transformer , author=. arXiv preprint arXiv:2002.05202 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2002
[64]

2009 , publisher=

Neural network learning: Theoretical foundations , author=. 2009 , publisher=

2009

[1] [1]

Transactions of the Association for Computational Linguistics , volume=

Saturated transformers are constant-depth threshold circuits , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

2022

[2] [2]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Average-Hard Attention Transformers are Constant-Depth Uniform Threshold Circuits , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[3] [3]

Journal of Machine Learning Research , volume=

Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks , author=. Journal of Machine Learning Research , volume=

[4] [4]

IEEE transactions on electronic computers , volume=

Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , author=. IEEE transactions on electronic computers , volume=. 1965 , publisher=

1965

[5] [5]

Neural computation , volume=

What size net gives valid generalization? , author=. Neural computation , volume=. 1989 , publisher=

1989

[6] [6]

Theory of Probability & Its Applications , volume=

On the uniform convergence of relative frequencies of events to their probabilities , author=. Theory of Probability & Its Applications , volume=. 1971 , publisher=

1971

[7] [7]

Neural computation , volume=

Almost linear VC dimension bounds for piecewise polynomial networks , author=. Neural computation , volume=. 1998 , publisher=

1998

[8] [8]

Pattern Recognition Letters , volume=

On the VC-dimension of depth-four threshold circuits , author=. Pattern Recognition Letters , volume=. 1993 , publisher=

1993

[9] [9]

Proceedings of the 7th Annual Conference on Computational Learning Theory , pages=

VC dimension in circuit complexity , author=. Proceedings of the 7th Annual Conference on Computational Learning Theory , pages=

[10] [10]

Neural Computation , volume=

Neural nets with superlinear VC-dimension , author=. Neural Computation , volume=. 1994 , publisher=

1994

[11] [11]

Advances in neural information processing systems , volume=

On the number of linear regions of deep neural networks , author=. Advances in neural information processing systems , volume=

[12] [12]

2025 , eprint=

A Theory of Learning with Autoregressive Chain of Thought , author=. 2025 , eprint=

2025

[13] [13]

Bell system technical journal , volume=

Prediction and entropy of printed English , author=. Bell system technical journal , volume=. 1951 , publisher=

1951

[14] [14]

Neural computation , volume=

A learning algorithm for continually running fully recurrent neural networks , author=. Neural computation , volume=. 1989 , publisher=

1989

[15] [15]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

[16] [16]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

[17] [17]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[18] [18]

Neural computation , volume=

Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

1997

[19] [19]

International Conference on Machine Learning , pages=

Inductive biases and variable creation in self-attention mechanisms , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022

[20] [20]

arXiv preprint arXiv:2309.06979 , year=

Auto-regressive next-token predictors are universal learners , author=. arXiv preprint arXiv:2309.06979 , year=

work page arXiv

[21] [21]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. arXiv preprint arXiv:2303.12712 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

[23] [23]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Advances in Neural Information Processing Systems , volume=

Towards revealing the mystery behind chain of thought: a theoretical perspective , author=. Advances in Neural Information Processing Systems , volume=

[26] [26]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Show your work: Scratchpads for intermediate computation with language models , author=. arXiv preprint arXiv:2112.00114 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

arXiv preprint arXiv:2310.13088 , year=

Sequence Length Independent Norm-Based Generalization Bounds for Transformers , author=. arXiv preprint arXiv:2310.13088 , year=

work page arXiv

[28] [28]

International Conference on Learning Representations , year=

Are Transformers universal approximators of sequence-to-sequence functions? , author=. International Conference on Learning Representations , year=

[29] [29]

Attention is Turing-Complete , journal=

P. Attention is Turing-Complete , journal=. 2021 , volume=

2021

[30] [30]

arXiv preprint arXiv:2301.13196 , year=

Looped transformers as programmable computers , author=. arXiv preprint arXiv:2301.13196 , year=

work page arXiv

[31] [31]

Transactions of the Association for Computational Linguistics , volume=

Theoretical limitations of self-attention in neural sequence models , author=. Transactions of the Association for Computational Linguistics , volume=

[32] [32]

Transactions of the Association for Computational Linguistics , volume=

Formal language recognition by hard attention transformers: Perspectives from circuit complexity , author=. Transactions of the Association for Computational Linguistics , volume=

[33] [33]

International Conference on Machine Learning , pages=

Transformers learn in-context by gradient descent , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[34] [34]

The Eleventh International Conference on Learning Representations , year=

What learning algorithm is in-context learning? Investigations with linear models , author=. The Eleventh International Conference on Learning Representations , year=

[35] [35]

arXiv preprint arXiv:2306.04637 , year=

Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection , author=. arXiv preprint arXiv:2306.04637 , year=

work page arXiv

[36] [36]

arXiv preprint arXiv:2308.16898 , year=

Transformers as support vector machines , author=. arXiv preprint arXiv:2308.16898 , year=

work page arXiv

[37] [37]

Advances in Neural Information Processing Systems , volume=

Vision Transformers provably learn spatial structure , author=. Advances in Neural Information Processing Systems , volume=

[38] [38]

arXiv preprint arXiv:2305.16380 , year=

Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer , author=. arXiv preprint arXiv:2305.16380 , year=

work page arXiv

[39] [39]

arXiv preprint arXiv:2212.14852 , year=

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models , author=. arXiv preprint arXiv:2212.14852 , year=

work page arXiv

[40] [40]

Advances in neural information processing systems , volume=

Spectrally-normalized margin bounds for neural networks , author=. Advances in neural information processing systems , volume=

[41] [41]

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks , author=. arXiv preprint arXiv:1707.09564 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Conference On Learning Theory , pages=

Size-independent sample complexity of neural networks , author=. Conference On Learning Theory , pages=. 2018 , organization=

2018

[43] [43]

arXiv preprint arXiv:2310.07923 , year=

The expresssive power of transformers with chain of thought , author=. arXiv preprint arXiv:2310.07923 , year=

work page arXiv

[44] [44]

2024 , eprint=

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , author=. 2024 , eprint=

2024

[45] [45]

Layer Normalization

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

1999 , publisher=

Neural Network Learning: Theoretical Foundations , author=. 1999 , publisher=

1999

[47] [47]

Nature , volume=

Learning representations by back-propagating errors , author=. Nature , volume=. 1986 , publisher=

1986

[48] [48]

arXiv preprint arXiv:2402.04084 , year=

Provably learning a multi-head attention layer , author=. arXiv preprint arXiv:2402.04084 , year=

work page arXiv

[49] [49]

Communications of the ACM , volume=

A theory of the learnable , author=. Communications of the ACM , volume=. 1984 , publisher=

1984

[50] [50]

Machine Learning , volume=

On learning sets and functions , author=. Machine Learning , volume=. 1989 , publisher=

1989

[51] [51]

Learnability and the

Blumer, Anselm and Ehrenfeucht, Andr. Learnability and the. Journal of the Association for Computing Machinery , volume =. 1989 , publisher =

1989

[52] [52]

Polynomial Bounds for

Karpinski, Marek and Macintyre, Angus , journal =. Polynomial Bounds for. 1997 , doi =

1997

[53] [53]

arXiv preprint arXiv:2503.14337 , year=

Pencil: Long thoughts with short memory , author=. arXiv preprint arXiv:2503.14337 , year=

work page arXiv

[54] [54]

Statistically Meaningful Approximation:

Wei, Colin and Chen, Yining and Ma, Tengyu , booktitle=. Statistically Meaningful Approximation:

[55] [55]

7th International Conference on Learning Representations (ICLR) , year=

On the Turing Completeness of Modern NLP Models , author=. 7th International Conference on Learning Representations (ICLR) , year=

[56] [56]

The Twelfth International Conference on Learning Representations (ICLR) , year=

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , author=. The Twelfth International Conference on Learning Representations (ICLR) , year=

[57] [57]

The Parallelism Tradeoff:

Merrill, William and Sabharwal, Ashish , booktitle=. The Parallelism Tradeoff:

[58] [58]

The Eleventh International Conference on Learning Representations (ICLR) , year=

Transformers Learn Shortcuts to Automata , author=. The Eleventh International Conference on Learning Representations (ICLR) , year=

[59] [59]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Are Transformers universal approximators of sequence-to-sequence functions? , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[60] [60]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

On the Ability and Limitations of Transformers to Recognize Formal Languages , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

2020

[61] [61]

The Twelfth International Conference on Learning Representations (ICLR) , year=

The Expressive Power of Transformers with Chain of Thought , author=. The Twelfth International Conference on Learning Representations (ICLR) , year=

[62] [62]

The Impact of Positional Encoding on Length Generalization in

Kazemnejad, Amirhossein and Padhi, Inkit and Ramamurthy, Karthikeyan Natesan and Das, Payel and Reddy, Siva , booktitle=. The Impact of Positional Encoding on Length Generalization in

[63] [63]

GLU Variants Improve Transformer

Glu variants improve transformer , author=. arXiv preprint arXiv:2002.05202 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2002

[64] [64]

2009 , publisher=

Neural network learning: Theoretical foundations , author=. 2009 , publisher=

2009