Recognition: 2 theorem links
· Lean TheoremLearning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity
Pith reviewed 2026-05-12 02:33 UTC · model grok-4.3
The pith
A Transformer with two encoder blocks approximates α-Hölder functions uniformly using O(ε^{-d/α}) parameters via softmax partition of unity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper proves that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks achieves a uniform ε-approximation error for α-Hölder continuous functions with α ∈ (0,1] using O(ε^{-d/α}) total parameters. The construction relies on the attention mechanism achieving spatial localization through affine transformations, allowing the softmax to aggregate local approximations into a global one.
What carries the argument
The softmax partition of unity, realized by configuring the attention weights through affine input transformations to localize and blend local approximations from the feed-forward networks.
If this is right
- The empirical risk minimizer over this Transformer class attains a generalization error bound of O(n^{-2α/(2α+d)} log n) for training set size n.
- The approximation and generalization results hold for target functions on compact Riemannian manifolds as well as the unit cube.
- The studied architecture is dense, shallow and wide, employing softmax activation and sinusoidal positional encodings similar to practical Transformers.
- These rates are near minimax-optimal, matching the lower bounds up to a logarithmic factor.
Where Pith is reading between the lines
- The result indicates that increasing depth beyond two blocks may not be essential for achieving optimal approximation rates in this setting, with width and parameter scaling being more critical.
- It connects the expressive power of attention to classical partition of unity methods in approximation theory, suggesting similar constructions could work for other sequence models.
- Practitioners might achieve better sample efficiency by ensuring attention layers can form localized partitions, potentially through specific initialization or training regimes.
Load-bearing premise
The attention mechanism can be configured via affine transformations of the input to achieve the spatial localization required for the softmax to act as a partition of unity that aggregates the local approximations without introducing uncontrolled error.
What would settle it
An explicit counterexample or numerical test where the approximation error for a Hölder-α function remains larger than ε despite using O(ε^{-d/α}) parameters in a two-block Transformer with the described attention setup.
Figures
read the original abstract
This paper investigates the learning theory of Transformer networks for regression tasks on the compact Euclidean domain $[0,1]^d$ and $d$-dimensional compact Riemannian manifolds. We propose a novel constructive approximation framework for Transformers that builds local approximations of the target function and aggregates them into a global approximation via softmax partition of unity. This approach leverages the attention mechanism to achieve spatial localization through affine transformations of the input. The softmax activation plays a crucial role in aggregating local approximations to a global output. From an approximation perspective, we prove that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks can achieve a uniform $\varepsilon$-approximation error for $\alpha$-H\"older continuous functions with $\alpha \in (0,1]$ using $\mathcal{O}(\varepsilon^{-d/\alpha})$ total parameters. Building upon this approximation guarantee, we establish a near minimax-optimal generalization error bound of order $\mathcal{O}\big(n^{-\frac{2\alpha}{2\alpha+d}} \log n\big)$ for the empirical risk minimizer, where $n$ is the training data size. The Transformer architecture studied in this paper is dense, shallow and wide, and employs softmax activation and sinusoidal positional encodings, closely reflecting practical implementations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that a dense Transformer with exactly two encoder blocks and single-hidden-layer pointwise FFNs can achieve uniform ε-approximation of any α-Hölder function (α ∈ (0,1]) on [0,1]^d or compact Riemannian manifolds using O(ε^{-d/α}) total parameters. The proof proceeds by constructing local Hölder approximants on a partition of small cubes and using the first encoder block to realize a softmax partition of unity via affine input transformations, with the second block performing the aggregation; a near-minimax generalization bound O(n^{-2α/(2α+d)} log n) for the ERM is then derived from this approximation guarantee.
Significance. If the localization and error-control arguments close, the result would be significant: it supplies a constructive, parameter-optimal approximation theory for a shallow, practical Transformer architecture (softmax attention plus sinusoidal encodings) that matches the classical rate for neural networks while explaining how attention can be configured to perform spatial localization and global aggregation. The explicit two-block construction and the link to empirical-risk minimization are strengths.
major comments (2)
- [§3.2] §3.2 (localization via attention): the claim that affine maps of the input coordinates can be chosen so that the resulting softmax weights form a partition of unity supported essentially on each target cube requires an explicit uniform tail bound. Without a quantitative estimate showing that the total leakage mass outside each cube is at most ε/2 (uniformly over all cubes and all functions in the Hölder ball), the aggregation step in the proof of the main approximation theorem cannot be guaranteed to preserve the O(ε) error; any polynomial or sub-exponential tail would inflate the required number of local approximants or the width of the FFNs.
- [Theorem 4.1] Theorem 4.1 and its proof: the uniform approximation error is stated to be O(ε) with the cited parameter count, yet the derivation appears to treat the softmax output as exactly localized once the attention scores are set by affine transforms. An explicit discretization or truncation argument that makes the residual overlap vanish at rate ε (independent of d and α) is needed; otherwise the O(ε^{-d/α}) bound is not closed.
minor comments (3)
- [§2] The definition of 'dense' Transformer (width scaling, exact embedding dimension, and how positional encodings are added) should be stated once in §2 before being used in the construction.
- [§3.1] Notation for the Hölder ball and the local approximants (e.g., the constant C_α in the local polynomial or spline approximants) is introduced inconsistently between the abstract and §3.1; a single global constant should be fixed.
- [Figure 1] Figure 1 (schematic of the two-block architecture) would benefit from explicit labels on the attention-score computation and the cube partition to match the proof steps.
Simulated Author's Rebuttal
We thank the referee for the insightful comments on the localization and approximation arguments in our paper. We will revise the manuscript to incorporate explicit quantitative bounds on the softmax tails and a truncation argument in the proof, which can be done without changing the stated rates. This addresses the concerns while preserving the main contributions.
read point-by-point responses
-
Referee: [§3.2] the claim that affine maps of the input coordinates can be chosen so that the resulting softmax weights form a partition of unity supported essentially on each target cube requires an explicit uniform tail bound. Without a quantitative estimate showing that the total leakage mass outside each cube is at most ε/2 (uniformly over all cubes and all functions in the Hölder ball), the aggregation step in the proof of the main approximation theorem cannot be guaranteed to preserve the O(ε) error; any polynomial or sub-exponential tail would inflate the required number of local approximants or the width of the FFNs.
Authors: We agree that an explicit bound is needed for completeness. The revised manuscript will include Lemma 3.2 providing a uniform tail bound: by scaling the affine transformations with a factor of O(log(1/ε)), the softmax leakage outside each cube is bounded by ε/2 uniformly over all cubes and Hölder functions. This scaling increases the attention parameters by a logarithmic factor, which is absorbed into the O(ε^{-d/α}) bound since the dominant term comes from the number of local approximants and FFN widths. The bound is exponential in the scaling, ensuring no inflation of the parameter count. revision: yes
-
Referee: [Theorem 4.1] the uniform approximation error is stated to be O(ε) with the cited parameter count, yet the derivation appears to treat the softmax output as exactly localized once the attention scores are set by affine transforms. An explicit discretization or truncation argument that makes the residual overlap vanish at rate ε (independent of d and α) is needed; otherwise the O(ε^{-d/α}) bound is not closed.
Authors: We acknowledge the need for an explicit argument. In the proof of Theorem 4.1, we will add a truncation step: discard softmax weights smaller than ε/(2K) where K = O(ε^{-d/α}) is the number of cubes, and bound the error from the discarded parts using the uniform boundedness of the local approximants (≤ C from Hölder norm). The resulting overlap error is ≤ ε/2, with the constant independent of d and α (d and α affect only K, already in the parameter count). This closes the O(ε) approximation with the given parameters. revision: yes
Circularity Check
No circularity; derivation combines classical Hölder approximation with explicit constructive realization of softmax partition of unity
full rationale
The paper constructs local approximants on cubes using standard single-hidden-layer networks (drawing on classical approximation theory for α-Hölder functions) and realizes the aggregation step by configuring the first encoder block's attention scores via affine input maps so that softmax produces the desired weights. The global uniform error bound and the subsequent generalization rate O(n^{-2α/(2α+d)} log n) are then obtained from standard empirical-risk-minimization arguments that do not refer back to any fitted parameter or self-referential normalization inside the paper. No step equates a claimed prediction to a quantity defined by the same construction, and no load-bearing premise rests solely on a self-citation whose content is itself unverified.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The target function belongs to the alpha-Holder class on the compact domain or manifold.
- domain assumption The attention mechanism can realize arbitrary affine transformations that localize each head to a desired region.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearβ_i(x) = exp(M_g(r_g² − ‖x−c_i‖²)) / ∑ exp(...) = exp(2M_g ⟨x,c_i⟩ − M_g‖c_i‖²) / ∑ ... (Lemma 1, Eq. 1–3); two-block Transformer realizes this via affine Q/K/V maps (Theorem 1 proof sketch)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearO(ε^{-d/α}) parameter count and n^{-2α/(2α+d)} log n generalization rate (Theorems 1–4)
Reference graph
Works this paper leans on
-
[1]
Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisti- cians: Provable in-context learning with in-context algorithm selection.Advances in Neural Information Processing Systems, 36:57125–57211, 2023
work page 2023
-
[2]
Language mod- els are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language mod- els are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020
work page 1901
-
[3]
Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Efficient approximation of deep relu networks for functions on low dimensional manifolds.Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[4]
Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recovery.Information and Inference: A Journal of the IMA, 11(4):1203–1253, 2022
work page 2022
-
[5]
Charles K Chui, Shao-Bo Lin, and Ding-Xuan Zhou. Deep net tree structure for balance of capacity and approximation ability.Frontiers in Applied Mathematics and Statistics, 5:46, 2019
work page 2019
-
[6]
Alexander Cloninger and Timo Klock. A deep network construction that adapts to intrinsic dimensionality beyond the domain.Neural Networks, 141:404–419, 2021
work page 2021
-
[7]
Cambridge University Press, 2007
Felipe Cucker and Ding Xuan Zhou.Learning Theory: An Approximation Theory Viewpoint, volume 24. Cambridge University Press, 2007
work page 2007
-
[8]
Biraj Dahal, Alexander Havrilla, Minshuo Chen, Tuo Zhao, and Wenjing Liao. On deep generative models for approximation and estimation of distributions on manifolds.Advances in Neural Information Processing Systems, 35:10615–10628, 2022
work page 2022
-
[9]
A survey on in-context learning
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1107–1128, 2024
work page 2024
-
[10]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021
work page 2021
-
[11]
Inductive biases and vari- able creation in self-attention mechanisms
Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and vari- able creation in self-attention mechanisms. InInternational Conference on Machine Learning, pages 5793–5831. PMLR, 2022
work page 2022
-
[12]
Curvature measures.Transactions of the American Mathematical Society, 93(3):418–491, 1959
Herbert Federer. Curvature measures.Transactions of the American Mathematical Society, 93(3):418–491, 1959. 40
work page 1959
-
[13]
Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in Neural Information Processing Systems, 35:30583–30598, 2022
work page 2022
-
[14]
Minimax manifold estimation.Journal of Machine Learning Research, 13:1263–1291, 2012
Christopher R Genovese, Marco Perone Pacifico, Isabella Verdinelli, Larry Wasserman, et al. Minimax manifold estimation.Journal of Machine Learning Research, 13:1263–1291, 2012
work page 2012
-
[15]
Generative adversarial nets.Advances in Neural Information Processing Systems, 27, 2014
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in Neural Information Processing Systems, 27, 2014
work page 2014
-
[16]
Speech recognition with deep recurrent neural networks
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6645–6649. Ieee, 2013
work page 2013
-
[17]
Iryna Gurevych, Michael Kohler, and G¨ ozde G¨ ul S ¸ahin. On the rate of convergence of a classi- fier based on a transformer encoder.IEEE Transactions on Information Theory, 68(12):8139– 8155, 2022
work page 2022
-
[18]
Alex Havrilla and Wenjing Liao. Understanding scaling laws with statistical and approximation theory for transformer neural networks on intrinsically low-dimensional data.Advances in Neural Information Processing Systems, 37:42162–42210, 2024
work page 2024
-
[19]
Haotian Jiang and Qianxiao Li. Approximation rate of the transformer architecture for se- quence modeling.Advances in Neural Information Processing Systems, 37:68926–68955, 2024
work page 2024
-
[20]
Yuling Jiao, Yanming Lai, Yang Wang, and Bokai Yan. Transformers can overcome the curse of dimensionality: A theoretical study from an approximation perspective.Journal of Machine Learning Research, 27(50):1–34, 2026
work page 2026
-
[21]
Tokio Kajitsuka and Issei Sato. Are transformers with one layer self-attention using low- rank weight matrices universal approximators? InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[22]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems, 25, 2012
work page 2012
-
[23]
Springer Science & Business Media, 2006
John M Lee.Riemannian Manifolds: An Introduction to Curvature. Springer Science & Business Media, 2006
work page 2006
-
[24]
Hao Liu, Minshuo Chen, Tuo Zhao, and Wenjing Liao. Besov function approximation and binary classification on low-dimensional manifolds using convolutional residual networks. In International Conference on Machine Learning, pages 6770–6780. PMLR, 2021
work page 2021
-
[25]
Hao Liu, Alex Havrilla, Rongjie Lai, and Wenjing Liao. Deep nonparametric estimation of intrinsic data structures by chart autoencoders: Generalization error and robustness.Applied and Computational Harmonic Analysis, 68:101602, 2024
work page 2024
-
[26]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM Computing Surveys, 55(9):1–35, 2023. 41
work page 2023
-
[27]
Tong Mao, Zhongjie Shi, and Ding-Xuan Zhou. Theory of deep convolutional neural networks iii: Approximating radial functions.Neural Networks, 144:778–790, 2021
work page 2021
-
[28]
Fast learning in networks of locally-tuned processing units.Neural Computation, 1(2):281–294, 1989
John Moody and Christian J Darken. Fast learning in networks of locally-tuned processing units.Neural Computation, 1(2):281–294, 1989
work page 1989
-
[29]
Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020
work page 2020
-
[30]
Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high confidence from random samples.Discrete & Computational Geometry, 39(1):419– 441, 2008
work page 2008
-
[31]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022
work page 2022
-
[32]
Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function.The Annals of Statistics, 48(4):1875–1897, 2020
work page 2020
-
[33]
Understanding in-context learning on structured manifolds: Bridging attention to kernel methods
Zhaiming Shen, Alexander Hsu, Rongjie Lai, and Wenjing Liao. Understanding in-context learning on structured manifolds: Bridging attention to kernel methods. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[34]
A two-dimensional interpolation function for irregularly-spaced data
Donald Shepard. A two-dimensional interpolation function for irregularly-spaced data. In Proceedings of the 1968 23rd ACM National Conference, pages 517–524, 1968
work page 1968
-
[35]
Zhongjie Shi, Jun Fan, Linhao Song, Ding-Xuan Zhou, and Johan AK Suykens. Nonlinear functional regression by functional deep neural network with kernel embedding.Journal of Machine Learning Research, 26(284):1–49, 2025
work page 2025
-
[36]
Zhongjie Shi, Zhiying Fang, and Yuan Cao. Approximation and estimation capability of vision transformers for hierarchical compositional models.Applied and Computational Harmonic Analysis, page 101849, 2025
work page 2025
-
[37]
Zhongjie Shi, Fanghui Liu, Yuan Cao, and Johan AK Suykens. Can overfitted deep neural networks in adversarial training generalize? an approximation viewpoint.SIAM Journal on Mathematics of Data Science, 8(2):225–256, 2026
work page 2026
-
[38]
Zhongjie Shi, Zhan Yu, and Ding-Xuan Zhou. Learning theory of distribution regression with neural networks.Constructive Approximation, 62(1):61–104, 2025
work page 2025
-
[39]
Mastering the game of go without human knowledge.Nature, 550(7676):354–359, 2017
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge.Nature, 550(7676):354–359, 2017
work page 2017
-
[40]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. 42
work page 2021
-
[41]
Shokichi Takakura and Taiji Suzuki. Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input. InInternational Conference on Machine Learning, pages 33416–33447. PMLR, 2023
work page 2023
- [42]
-
[43]
Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017
work page 2017
-
[44]
Colin Wei, Yining Chen, and Tengyu Ma. Statistically meaningful approximation: a case study on approximating turing machines with transformers. InAdvances in Neural Information Processing Systems, volume 35, pages 12071–12083, 2022
work page 2022
-
[45]
Error bounds for approximations with deep relu networks.Neural Networks, 94:103–114, 2017
Dmitry Yarotsky. Error bounds for approximations with deep relu networks.Neural Networks, 94:103–114, 2017
work page 2017
-
[46]
Optimal approximation of continuous functions by very deep relu networks
Dmitry Yarotsky. Optimal approximation of continuous functions by very deep relu networks. InConference on Learning Theory, pages 639–649. PMLR, 2018
work page 2018
-
[47]
Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations, 2020
work page 2020
-
[48]
O (n) connections are expressive enough: Universal approximability of sparse transformers
Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. O (n) connections are expressive enough: Universal approximability of sparse transformers. InAdvances in Neural Information Processing Systems, volume 33, pages 13783–13794, 2020
work page 2020
-
[49]
Big bird: Transformers for longer sequences
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. InAdvances in Neural Information Processing Systems, volume 33, pages 17283–17297, 2020
work page 2020
-
[50]
Yufeng Zhang, Boyi Liu, Qi Cai, Lingxiao Wang, and Zhaoran Wang. An analysis of attention via the lens of exchangeability and latent variable models.arXiv preprint arXiv:2212.14852, 2022. 43
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.