arxiv: 2605.08811 · v1 · submitted 2026-05-09 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity

Wenjing Liao, Zhongjie Shi

Pith reviewed 2026-05-12 02:33 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords Transformer networksApproximation theoryHölder continuous functionsGeneralization error boundsAttention mechanismPartition of unityRegression on manifolds

0 comments

The pith

A Transformer with two encoder blocks approximates α-Hölder functions uniformly using O(ε^{-d/α}) parameters via softmax partition of unity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a proof that Transformers can approximate continuous functions on compact sets in a way that matches the best known rates for neural networks. The key idea is to use the attention mechanism to create local approximations at different points in the domain and then combine them using the softmax function as a smooth partition of unity. With just two encoder blocks and simple feed-forward layers, the total number of parameters needed to get within ε of any α-Hölder function is proportional to ε raised to -d/α. This approximation result then implies that the model trained on data will generalize with an error that is nearly the best possible for any method. The analysis covers both Euclidean cubes and Riemannian manifolds and uses architectures close to those used in practice.

Core claim

The paper proves that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks achieves a uniform ε-approximation error for α-Hölder continuous functions with α ∈ (0,1] using O(ε^{-d/α}) total parameters. The construction relies on the attention mechanism achieving spatial localization through affine transformations, allowing the softmax to aggregate local approximations into a global one.

What carries the argument

The softmax partition of unity, realized by configuring the attention weights through affine input transformations to localize and blend local approximations from the feed-forward networks.

If this is right

The empirical risk minimizer over this Transformer class attains a generalization error bound of O(n^{-2α/(2α+d)} log n) for training set size n.
The approximation and generalization results hold for target functions on compact Riemannian manifolds as well as the unit cube.
The studied architecture is dense, shallow and wide, employing softmax activation and sinusoidal positional encodings similar to practical Transformers.
These rates are near minimax-optimal, matching the lower bounds up to a logarithmic factor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result indicates that increasing depth beyond two blocks may not be essential for achieving optimal approximation rates in this setting, with width and parameter scaling being more critical.
It connects the expressive power of attention to classical partition of unity methods in approximation theory, suggesting similar constructions could work for other sequence models.
Practitioners might achieve better sample efficiency by ensuring attention layers can form localized partitions, potentially through specific initialization or training regimes.

Load-bearing premise

The attention mechanism can be configured via affine transformations of the input to achieve the spatial localization required for the softmax to act as a partition of unity that aggregates the local approximations without introducing uncontrolled error.

What would settle it

An explicit counterexample or numerical test where the approximation error for a Hölder-α function remains larger than ε despite using O(ε^{-d/α}) parameters in a two-block Transformer with the described attention setup.

Figures

Figures reproduced from arXiv: 2605.08811 by Wenjing Liao, Zhongjie Shi.

read the original abstract

This paper investigates the learning theory of Transformer networks for regression tasks on the compact Euclidean domain $[0,1]^d$ and $d$-dimensional compact Riemannian manifolds. We propose a novel constructive approximation framework for Transformers that builds local approximations of the target function and aggregates them into a global approximation via softmax partition of unity. This approach leverages the attention mechanism to achieve spatial localization through affine transformations of the input. The softmax activation plays a crucial role in aggregating local approximations to a global output. From an approximation perspective, we prove that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks can achieve a uniform $\varepsilon$-approximation error for $\alpha$-H\"older continuous functions with $\alpha \in (0,1]$ using $\mathcal{O}(\varepsilon^{-d/\alpha})$ total parameters. Building upon this approximation guarantee, we establish a near minimax-optimal generalization error bound of order $\mathcal{O}\big(n^{-\frac{2\alpha}{2\alpha+d}} \log n\big)$ for the empirical risk minimizer, where $n$ is the training data size. The Transformer architecture studied in this paper is dense, shallow and wide, and employs softmax activation and sinusoidal positional encodings, closely reflecting practical implementations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a constructive two-block Transformer that hits the standard Holder approximation rate via softmax partition of unity, but the localization step via affine attention maps looks vulnerable to uncontrolled softmax tails.

read the letter

The headline result is that a shallow dense Transformer with two encoder blocks and single-hidden-layer FFNs can uniformly approximate α-Hölder functions on [0,1]^d to error ε with O(ε^{-d/α}) parameters, then get near-minimax generalization. That is the concrete claim worth checking first. The construction uses the first block to localize via affine input transforms so softmax acts like a partition of unity, then the second block aggregates the local FFN pieces. This framing is new enough compared with earlier Transformer approximation papers, and it stays close to actual implementations by keeping softmax and sinusoidal encodings. The rates follow from classical approximation theory plus standard ERM arguments, which is clean when it works. The soft spot is exactly the one in the stress-test note. The proof needs the attention scores to produce softmax weights that are essentially supported inside each small cube, with tails small enough not to add extra error. Affine maps can create peaks, but softmax always has positive mass everywhere; without an explicit uniform bound showing the leakage is smaller than ε (or absorbed into the parameter count), the total uniform error may pick up an extra factor. The abstract does not spell out that tail control, so the O(ε) guarantee is not yet fully convincing on the given sketch. If the full paper supplies a discretization or rapid-decay argument that closes this, the result strengthens. If not, the parameter count or the uniform norm claim needs adjustment. This is the kind of paper that belongs in a reading group for people working on attention theory and approximation. It is worth a serious referee because the architecture is explicit, the target rates are standard, and the partition-of-unity idea is worth testing even if the localization needs tightening. I would send it to review with a request for the tail bounds written out.

Referee Report

2 major / 3 minor

Summary. The manuscript claims that a dense Transformer with exactly two encoder blocks and single-hidden-layer pointwise FFNs can achieve uniform ε-approximation of any α-Hölder function (α ∈ (0,1]) on [0,1]^d or compact Riemannian manifolds using O(ε^{-d/α}) total parameters. The proof proceeds by constructing local Hölder approximants on a partition of small cubes and using the first encoder block to realize a softmax partition of unity via affine input transformations, with the second block performing the aggregation; a near-minimax generalization bound O(n^{-2α/(2α+d)} log n) for the ERM is then derived from this approximation guarantee.

Significance. If the localization and error-control arguments close, the result would be significant: it supplies a constructive, parameter-optimal approximation theory for a shallow, practical Transformer architecture (softmax attention plus sinusoidal encodings) that matches the classical rate for neural networks while explaining how attention can be configured to perform spatial localization and global aggregation. The explicit two-block construction and the link to empirical-risk minimization are strengths.

major comments (2)

[§3.2] §3.2 (localization via attention): the claim that affine maps of the input coordinates can be chosen so that the resulting softmax weights form a partition of unity supported essentially on each target cube requires an explicit uniform tail bound. Without a quantitative estimate showing that the total leakage mass outside each cube is at most ε/2 (uniformly over all cubes and all functions in the Hölder ball), the aggregation step in the proof of the main approximation theorem cannot be guaranteed to preserve the O(ε) error; any polynomial or sub-exponential tail would inflate the required number of local approximants or the width of the FFNs.
[Theorem 4.1] Theorem 4.1 and its proof: the uniform approximation error is stated to be O(ε) with the cited parameter count, yet the derivation appears to treat the softmax output as exactly localized once the attention scores are set by affine transforms. An explicit discretization or truncation argument that makes the residual overlap vanish at rate ε (independent of d and α) is needed; otherwise the O(ε^{-d/α}) bound is not closed.

minor comments (3)

[§2] The definition of 'dense' Transformer (width scaling, exact embedding dimension, and how positional encodings are added) should be stated once in §2 before being used in the construction.
[§3.1] Notation for the Hölder ball and the local approximants (e.g., the constant C_α in the local polynomial or spline approximants) is introduced inconsistently between the abstract and §3.1; a single global constant should be fixed.
[Figure 1] Figure 1 (schematic of the two-block architecture) would benefit from explicit labels on the attention-score computation and the cube partition to match the proof steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on the localization and approximation arguments in our paper. We will revise the manuscript to incorporate explicit quantitative bounds on the softmax tails and a truncation argument in the proof, which can be done without changing the stated rates. This addresses the concerns while preserving the main contributions.

read point-by-point responses

Referee: [§3.2] the claim that affine maps of the input coordinates can be chosen so that the resulting softmax weights form a partition of unity supported essentially on each target cube requires an explicit uniform tail bound. Without a quantitative estimate showing that the total leakage mass outside each cube is at most ε/2 (uniformly over all cubes and all functions in the Hölder ball), the aggregation step in the proof of the main approximation theorem cannot be guaranteed to preserve the O(ε) error; any polynomial or sub-exponential tail would inflate the required number of local approximants or the width of the FFNs.

Authors: We agree that an explicit bound is needed for completeness. The revised manuscript will include Lemma 3.2 providing a uniform tail bound: by scaling the affine transformations with a factor of O(log(1/ε)), the softmax leakage outside each cube is bounded by ε/2 uniformly over all cubes and Hölder functions. This scaling increases the attention parameters by a logarithmic factor, which is absorbed into the O(ε^{-d/α}) bound since the dominant term comes from the number of local approximants and FFN widths. The bound is exponential in the scaling, ensuring no inflation of the parameter count. revision: yes
Referee: [Theorem 4.1] the uniform approximation error is stated to be O(ε) with the cited parameter count, yet the derivation appears to treat the softmax output as exactly localized once the attention scores are set by affine transforms. An explicit discretization or truncation argument that makes the residual overlap vanish at rate ε (independent of d and α) is needed; otherwise the O(ε^{-d/α}) bound is not closed.

Authors: We acknowledge the need for an explicit argument. In the proof of Theorem 4.1, we will add a truncation step: discard softmax weights smaller than ε/(2K) where K = O(ε^{-d/α}) is the number of cubes, and bound the error from the discarded parts using the uniform boundedness of the local approximants (≤ C from Hölder norm). The resulting overlap error is ≤ ε/2, with the constant independent of d and α (d and α affect only K, already in the parameter count). This closes the O(ε) approximation with the given parameters. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation combines classical Hölder approximation with explicit constructive realization of softmax partition of unity

full rationale

The paper constructs local approximants on cubes using standard single-hidden-layer networks (drawing on classical approximation theory for α-Hölder functions) and realizes the aggregation step by configuring the first encoder block's attention scores via affine input maps so that softmax produces the desired weights. The global uniform error bound and the subsequent generalization rate O(n^{-2α/(2α+d)} log n) are then obtained from standard empirical-risk-minimization arguments that do not refer back to any fitted parameter or self-referential normalization inside the paper. No step equates a claimed prediction to a quantity defined by the same construction, and no load-bearing premise rests solely on a self-citation whose content is itself unverified.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the mathematical fact that softmax can form a partition of unity and that attention layers can realize the required affine localizations; these are treated as properties of the architecture rather than derived from data.

axioms (2)

domain assumption The target function belongs to the alpha-Holder class on the compact domain or manifold.
This is the function class for which the uniform approximation and generalization rates are stated.
domain assumption The attention mechanism can realize arbitrary affine transformations that localize each head to a desired region.
This enables the local approximations that are later aggregated by softmax.

pith-pipeline@v0.9.0 · 5523 in / 1534 out tokens · 67675 ms · 2026-05-12T02:33:15.859354+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
β_i(x) = exp(M_g(r_g² − ‖x−c_i‖²)) / ∑ exp(...) = exp(2M_g ⟨x,c_i⟩ − M_g‖c_i‖²) / ∑ ... (Lemma 1, Eq. 1–3); two-block Transformer realizes this via affine Q/K/V maps (Theorem 1 proof sketch)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
O(ε^{-d/α}) parameter count and n^{-2α/(2α+d)} log n generalization rate (Theorems 1–4)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

[1]

Transformers as statisti- cians: Provable in-context learning with in-context algorithm selection.Advances in Neural Information Processing Systems, 36:57125–57211, 2023

Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisti- cians: Provable in-context learning with in-context algorithm selection.Advances in Neural Information Processing Systems, 36:57125–57211, 2023

work page 2023
[2]

Language mod- els are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language mod- els are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

work page 1901
[3]

Efficient approximation of deep relu networks for functions on low dimensional manifolds.Advances in Neural Information Processing Systems, 32, 2019

Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Efficient approximation of deep relu networks for functions on low dimensional manifolds.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[4]

Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recovery.Information and Inference: A Journal of the IMA, 11(4):1203–1253, 2022

work page 2022
[5]

Deep net tree structure for balance of capacity and approximation ability.Frontiers in Applied Mathematics and Statistics, 5:46, 2019

Charles K Chui, Shao-Bo Lin, and Ding-Xuan Zhou. Deep net tree structure for balance of capacity and approximation ability.Frontiers in Applied Mathematics and Statistics, 5:46, 2019

work page 2019
[6]

A deep network construction that adapts to intrinsic dimensionality beyond the domain.Neural Networks, 141:404–419, 2021

Alexander Cloninger and Timo Klock. A deep network construction that adapts to intrinsic dimensionality beyond the domain.Neural Networks, 141:404–419, 2021

work page 2021
[7]

Cambridge University Press, 2007

Felipe Cucker and Ding Xuan Zhou.Learning Theory: An Approximation Theory Viewpoint, volume 24. Cambridge University Press, 2007

work page 2007
[8]

On deep generative models for approximation and estimation of distributions on manifolds.Advances in Neural Information Processing Systems, 35:10615–10628, 2022

Biraj Dahal, Alexander Havrilla, Minshuo Chen, Tuo Zhao, and Wenjing Liao. On deep generative models for approximation and estimation of distributions on manifolds.Advances in Neural Information Processing Systems, 35:10615–10628, 2022

work page 2022
[9]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1107–1128, 2024

work page 2024
[10]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

work page 2021
[11]

Inductive biases and vari- able creation in self-attention mechanisms

Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and vari- able creation in self-attention mechanisms. InInternational Conference on Machine Learning, pages 5793–5831. PMLR, 2022

work page 2022
[12]

Curvature measures.Transactions of the American Mathematical Society, 93(3):418–491, 1959

Herbert Federer. Curvature measures.Transactions of the American Mathematical Society, 93(3):418–491, 1959. 40

work page 1959
[13]

What can transformers learn in-context? a case study of simple function classes.Advances in Neural Information Processing Systems, 35:30583–30598, 2022

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in Neural Information Processing Systems, 35:30583–30598, 2022

work page 2022
[14]

Minimax manifold estimation.Journal of Machine Learning Research, 13:1263–1291, 2012

Christopher R Genovese, Marco Perone Pacifico, Isabella Verdinelli, Larry Wasserman, et al. Minimax manifold estimation.Journal of Machine Learning Research, 13:1263–1291, 2012

work page 2012
[15]

Generative adversarial nets.Advances in Neural Information Processing Systems, 27, 2014

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in Neural Information Processing Systems, 27, 2014

work page 2014
[16]

Speech recognition with deep recurrent neural networks

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6645–6649. Ieee, 2013

work page 2013
[17]

On the rate of convergence of a classi- fier based on a transformer encoder.IEEE Transactions on Information Theory, 68(12):8139– 8155, 2022

Iryna Gurevych, Michael Kohler, and G¨ ozde G¨ ul S ¸ahin. On the rate of convergence of a classi- fier based on a transformer encoder.IEEE Transactions on Information Theory, 68(12):8139– 8155, 2022

work page 2022
[18]

Alex Havrilla and Wenjing Liao. Understanding scaling laws with statistical and approximation theory for transformer neural networks on intrinsically low-dimensional data.Advances in Neural Information Processing Systems, 37:42162–42210, 2024

work page 2024
[19]

Approximation rate of the transformer architecture for se- quence modeling.Advances in Neural Information Processing Systems, 37:68926–68955, 2024

Haotian Jiang and Qianxiao Li. Approximation rate of the transformer architecture for se- quence modeling.Advances in Neural Information Processing Systems, 37:68926–68955, 2024

work page 2024
[20]

Transformers can overcome the curse of dimensionality: A theoretical study from an approximation perspective.Journal of Machine Learning Research, 27(50):1–34, 2026

Yuling Jiao, Yanming Lai, Yang Wang, and Bokai Yan. Transformers can overcome the curse of dimensionality: A theoretical study from an approximation perspective.Journal of Machine Learning Research, 27(50):1–34, 2026

work page 2026
[21]

Are transformers with one layer self-attention using low- rank weight matrices universal approximators? InThe Twelfth International Conference on Learning Representations, 2024

Tokio Kajitsuka and Issei Sato. Are transformers with one layer self-attention using low- rank weight matrices universal approximators? InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[22]

Imagenet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems, 25, 2012

work page 2012
[23]

Springer Science & Business Media, 2006

John M Lee.Riemannian Manifolds: An Introduction to Curvature. Springer Science & Business Media, 2006

work page 2006
[24]

Besov function approximation and binary classification on low-dimensional manifolds using convolutional residual networks

Hao Liu, Minshuo Chen, Tuo Zhao, and Wenjing Liao. Besov function approximation and binary classification on low-dimensional manifolds using convolutional residual networks. In International Conference on Machine Learning, pages 6770–6780. PMLR, 2021

work page 2021
[25]

Deep nonparametric estimation of intrinsic data structures by chart autoencoders: Generalization error and robustness.Applied and Computational Harmonic Analysis, 68:101602, 2024

Hao Liu, Alex Havrilla, Rongjie Lai, and Wenjing Liao. Deep nonparametric estimation of intrinsic data structures by chart autoencoders: Generalization error and robustness.Applied and Computational Harmonic Analysis, 68:101602, 2024

work page 2024
[26]

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM Computing Surveys, 55(9):1–35, 2023

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM Computing Surveys, 55(9):1–35, 2023. 41

work page 2023
[27]

Theory of deep convolutional neural networks iii: Approximating radial functions.Neural Networks, 144:778–790, 2021

Tong Mao, Zhongjie Shi, and Ding-Xuan Zhou. Theory of deep convolutional neural networks iii: Approximating radial functions.Neural Networks, 144:778–790, 2021

work page 2021
[28]

Fast learning in networks of locally-tuned processing units.Neural Computation, 1(2):281–294, 1989

John Moody and Christian J Darken. Fast learning in networks of locally-tuned processing units.Neural Computation, 1(2):281–294, 1989

work page 1989
[29]

Adaptive approximation and generalization of deep neural network with intrinsic dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020

Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020

work page 2020
[30]

Finding the homology of submanifolds with high confidence from random samples.Discrete & Computational Geometry, 39(1):419– 441, 2008

Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high confidence from random samples.Discrete & Computational Geometry, 39(1):419– 441, 2008

work page 2008
[31]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

work page 2022
[32]

Nonparametric regression using deep neural networks with ReLU activation function.The Annals of Statistics, 48(4):1875–1897, 2020

Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function.The Annals of Statistics, 48(4):1875–1897, 2020

work page 2020
[33]

Understanding in-context learning on structured manifolds: Bridging attention to kernel methods

Zhaiming Shen, Alexander Hsu, Rongjie Lai, and Wenjing Liao. Understanding in-context learning on structured manifolds: Bridging attention to kernel methods. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[34]

A two-dimensional interpolation function for irregularly-spaced data

Donald Shepard. A two-dimensional interpolation function for irregularly-spaced data. In Proceedings of the 1968 23rd ACM National Conference, pages 517–524, 1968

work page 1968
[35]

Nonlinear functional regression by functional deep neural network with kernel embedding.Journal of Machine Learning Research, 26(284):1–49, 2025

Zhongjie Shi, Jun Fan, Linhao Song, Ding-Xuan Zhou, and Johan AK Suykens. Nonlinear functional regression by functional deep neural network with kernel embedding.Journal of Machine Learning Research, 26(284):1–49, 2025

work page 2025
[36]

Approximation and estimation capability of vision transformers for hierarchical compositional models.Applied and Computational Harmonic Analysis, page 101849, 2025

Zhongjie Shi, Zhiying Fang, and Yuan Cao. Approximation and estimation capability of vision transformers for hierarchical compositional models.Applied and Computational Harmonic Analysis, page 101849, 2025

work page 2025
[37]

Can overfitted deep neural networks in adversarial training generalize? an approximation viewpoint.SIAM Journal on Mathematics of Data Science, 8(2):225–256, 2026

Zhongjie Shi, Fanghui Liu, Yuan Cao, and Johan AK Suykens. Can overfitted deep neural networks in adversarial training generalize? an approximation viewpoint.SIAM Journal on Mathematics of Data Science, 8(2):225–256, 2026

work page 2026
[38]

Learning theory of distribution regression with neural networks.Constructive Approximation, 62(1):61–104, 2025

Zhongjie Shi, Zhan Yu, and Ding-Xuan Zhou. Learning theory of distribution regression with neural networks.Constructive Approximation, 62(1):61–104, 2025

work page 2025
[39]

Mastering the game of go without human knowledge.Nature, 550(7676):354–359, 2017

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge.Nature, 550(7676):354–359, 2017

work page 2017
[40]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. 42

work page 2021
[41]

Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input

Shokichi Takakura and Taiji Suzuki. Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input. InInternational Conference on Machine Learning, pages 33416–33447. PMLR, 2023

work page 2023
[42]

Manifolds

Loring W Tu. Manifolds. InAn Introduction to Manifolds, pages 47–83. Springer, 2011

work page 2011
[43]

Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

work page 2017
[44]

Statistically meaningful approximation: a case study on approximating turing machines with transformers

Colin Wei, Yining Chen, and Tengyu Ma. Statistically meaningful approximation: a case study on approximating turing machines with transformers. InAdvances in Neural Information Processing Systems, volume 35, pages 12071–12083, 2022

work page 2022
[45]

Error bounds for approximations with deep relu networks.Neural Networks, 94:103–114, 2017

Dmitry Yarotsky. Error bounds for approximations with deep relu networks.Neural Networks, 94:103–114, 2017

work page 2017
[46]

Optimal approximation of continuous functions by very deep relu networks

Dmitry Yarotsky. Optimal approximation of continuous functions by very deep relu networks. InConference on Learning Theory, pages 639–649. PMLR, 2018

work page 2018
[47]

Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations, 2020

Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations, 2020

work page 2020
[48]

O (n) connections are expressive enough: Universal approximability of sparse transformers

Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. O (n) connections are expressive enough: Universal approximability of sparse transformers. InAdvances in Neural Information Processing Systems, volume 33, pages 13783–13794, 2020

work page 2020
[49]

Big bird: Transformers for longer sequences

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. InAdvances in Neural Information Processing Systems, volume 33, pages 17283–17297, 2020

work page 2020
[50]

An analysis of attention via the lens of exchangeability and latent variable models.arXiv preprint arXiv:2212.14852, 2022

Yufeng Zhang, Boyi Liu, Qi Cai, Lingxiao Wang, and Zhaoran Wang. An analysis of attention via the lens of exchangeability and latent variable models.arXiv preprint arXiv:2212.14852, 2022. 43

work page arXiv 2022