pith. machine review for the scientific record. sign in

arxiv: 2605.08811 · v1 · submitted 2026-05-09 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity

Wenjing Liao, Zhongjie Shi

Pith reviewed 2026-05-12 02:33 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords Transformer networksApproximation theoryHölder continuous functionsGeneralization error boundsAttention mechanismPartition of unityRegression on manifolds
0
0 comments X

The pith

A Transformer with two encoder blocks approximates α-Hölder functions uniformly using O(ε^{-d/α}) parameters via softmax partition of unity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a proof that Transformers can approximate continuous functions on compact sets in a way that matches the best known rates for neural networks. The key idea is to use the attention mechanism to create local approximations at different points in the domain and then combine them using the softmax function as a smooth partition of unity. With just two encoder blocks and simple feed-forward layers, the total number of parameters needed to get within ε of any α-Hölder function is proportional to ε raised to -d/α. This approximation result then implies that the model trained on data will generalize with an error that is nearly the best possible for any method. The analysis covers both Euclidean cubes and Riemannian manifolds and uses architectures close to those used in practice.

Core claim

The paper proves that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks achieves a uniform ε-approximation error for α-Hölder continuous functions with α ∈ (0,1] using O(ε^{-d/α}) total parameters. The construction relies on the attention mechanism achieving spatial localization through affine transformations, allowing the softmax to aggregate local approximations into a global one.

What carries the argument

The softmax partition of unity, realized by configuring the attention weights through affine input transformations to localize and blend local approximations from the feed-forward networks.

If this is right

  • The empirical risk minimizer over this Transformer class attains a generalization error bound of O(n^{-2α/(2α+d)} log n) for training set size n.
  • The approximation and generalization results hold for target functions on compact Riemannian manifolds as well as the unit cube.
  • The studied architecture is dense, shallow and wide, employing softmax activation and sinusoidal positional encodings similar to practical Transformers.
  • These rates are near minimax-optimal, matching the lower bounds up to a logarithmic factor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result indicates that increasing depth beyond two blocks may not be essential for achieving optimal approximation rates in this setting, with width and parameter scaling being more critical.
  • It connects the expressive power of attention to classical partition of unity methods in approximation theory, suggesting similar constructions could work for other sequence models.
  • Practitioners might achieve better sample efficiency by ensuring attention layers can form localized partitions, potentially through specific initialization or training regimes.

Load-bearing premise

The attention mechanism can be configured via affine transformations of the input to achieve the spatial localization required for the softmax to act as a partition of unity that aggregates the local approximations without introducing uncontrolled error.

What would settle it

An explicit counterexample or numerical test where the approximation error for a Hölder-α function remains larger than ε despite using O(ε^{-d/α}) parameters in a two-block Transformer with the described attention setup.

Figures

Figures reproduced from arXiv: 2605.08811 by Wenjing Liao, Zhongjie Shi.

Figure 1
Figure 1. Figure 1: Geometric illustration of the softmax parti C [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
read the original abstract

This paper investigates the learning theory of Transformer networks for regression tasks on the compact Euclidean domain $[0,1]^d$ and $d$-dimensional compact Riemannian manifolds. We propose a novel constructive approximation framework for Transformers that builds local approximations of the target function and aggregates them into a global approximation via softmax partition of unity. This approach leverages the attention mechanism to achieve spatial localization through affine transformations of the input. The softmax activation plays a crucial role in aggregating local approximations to a global output. From an approximation perspective, we prove that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks can achieve a uniform $\varepsilon$-approximation error for $\alpha$-H\"older continuous functions with $\alpha \in (0,1]$ using $\mathcal{O}(\varepsilon^{-d/\alpha})$ total parameters. Building upon this approximation guarantee, we establish a near minimax-optimal generalization error bound of order $\mathcal{O}\big(n^{-\frac{2\alpha}{2\alpha+d}} \log n\big)$ for the empirical risk minimizer, where $n$ is the training data size. The Transformer architecture studied in this paper is dense, shallow and wide, and employs softmax activation and sinusoidal positional encodings, closely reflecting practical implementations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript claims that a dense Transformer with exactly two encoder blocks and single-hidden-layer pointwise FFNs can achieve uniform ε-approximation of any α-Hölder function (α ∈ (0,1]) on [0,1]^d or compact Riemannian manifolds using O(ε^{-d/α}) total parameters. The proof proceeds by constructing local Hölder approximants on a partition of small cubes and using the first encoder block to realize a softmax partition of unity via affine input transformations, with the second block performing the aggregation; a near-minimax generalization bound O(n^{-2α/(2α+d)} log n) for the ERM is then derived from this approximation guarantee.

Significance. If the localization and error-control arguments close, the result would be significant: it supplies a constructive, parameter-optimal approximation theory for a shallow, practical Transformer architecture (softmax attention plus sinusoidal encodings) that matches the classical rate for neural networks while explaining how attention can be configured to perform spatial localization and global aggregation. The explicit two-block construction and the link to empirical-risk minimization are strengths.

major comments (2)
  1. [§3.2] §3.2 (localization via attention): the claim that affine maps of the input coordinates can be chosen so that the resulting softmax weights form a partition of unity supported essentially on each target cube requires an explicit uniform tail bound. Without a quantitative estimate showing that the total leakage mass outside each cube is at most ε/2 (uniformly over all cubes and all functions in the Hölder ball), the aggregation step in the proof of the main approximation theorem cannot be guaranteed to preserve the O(ε) error; any polynomial or sub-exponential tail would inflate the required number of local approximants or the width of the FFNs.
  2. [Theorem 4.1] Theorem 4.1 and its proof: the uniform approximation error is stated to be O(ε) with the cited parameter count, yet the derivation appears to treat the softmax output as exactly localized once the attention scores are set by affine transforms. An explicit discretization or truncation argument that makes the residual overlap vanish at rate ε (independent of d and α) is needed; otherwise the O(ε^{-d/α}) bound is not closed.
minor comments (3)
  1. [§2] The definition of 'dense' Transformer (width scaling, exact embedding dimension, and how positional encodings are added) should be stated once in §2 before being used in the construction.
  2. [§3.1] Notation for the Hölder ball and the local approximants (e.g., the constant C_α in the local polynomial or spline approximants) is introduced inconsistently between the abstract and §3.1; a single global constant should be fixed.
  3. [Figure 1] Figure 1 (schematic of the two-block architecture) would benefit from explicit labels on the attention-score computation and the cube partition to match the proof steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on the localization and approximation arguments in our paper. We will revise the manuscript to incorporate explicit quantitative bounds on the softmax tails and a truncation argument in the proof, which can be done without changing the stated rates. This addresses the concerns while preserving the main contributions.

read point-by-point responses
  1. Referee: [§3.2] the claim that affine maps of the input coordinates can be chosen so that the resulting softmax weights form a partition of unity supported essentially on each target cube requires an explicit uniform tail bound. Without a quantitative estimate showing that the total leakage mass outside each cube is at most ε/2 (uniformly over all cubes and all functions in the Hölder ball), the aggregation step in the proof of the main approximation theorem cannot be guaranteed to preserve the O(ε) error; any polynomial or sub-exponential tail would inflate the required number of local approximants or the width of the FFNs.

    Authors: We agree that an explicit bound is needed for completeness. The revised manuscript will include Lemma 3.2 providing a uniform tail bound: by scaling the affine transformations with a factor of O(log(1/ε)), the softmax leakage outside each cube is bounded by ε/2 uniformly over all cubes and Hölder functions. This scaling increases the attention parameters by a logarithmic factor, which is absorbed into the O(ε^{-d/α}) bound since the dominant term comes from the number of local approximants and FFN widths. The bound is exponential in the scaling, ensuring no inflation of the parameter count. revision: yes

  2. Referee: [Theorem 4.1] the uniform approximation error is stated to be O(ε) with the cited parameter count, yet the derivation appears to treat the softmax output as exactly localized once the attention scores are set by affine transforms. An explicit discretization or truncation argument that makes the residual overlap vanish at rate ε (independent of d and α) is needed; otherwise the O(ε^{-d/α}) bound is not closed.

    Authors: We acknowledge the need for an explicit argument. In the proof of Theorem 4.1, we will add a truncation step: discard softmax weights smaller than ε/(2K) where K = O(ε^{-d/α}) is the number of cubes, and bound the error from the discarded parts using the uniform boundedness of the local approximants (≤ C from Hölder norm). The resulting overlap error is ≤ ε/2, with the constant independent of d and α (d and α affect only K, already in the parameter count). This closes the O(ε) approximation with the given parameters. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation combines classical Hölder approximation with explicit constructive realization of softmax partition of unity

full rationale

The paper constructs local approximants on cubes using standard single-hidden-layer networks (drawing on classical approximation theory for α-Hölder functions) and realizes the aggregation step by configuring the first encoder block's attention scores via affine input maps so that softmax produces the desired weights. The global uniform error bound and the subsequent generalization rate O(n^{-2α/(2α+d)} log n) are then obtained from standard empirical-risk-minimization arguments that do not refer back to any fitted parameter or self-referential normalization inside the paper. No step equates a claimed prediction to a quantity defined by the same construction, and no load-bearing premise rests solely on a self-citation whose content is itself unverified.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the mathematical fact that softmax can form a partition of unity and that attention layers can realize the required affine localizations; these are treated as properties of the architecture rather than derived from data.

axioms (2)
  • domain assumption The target function belongs to the alpha-Holder class on the compact domain or manifold.
    This is the function class for which the uniform approximation and generalization rates are stated.
  • domain assumption The attention mechanism can realize arbitrary affine transformations that localize each head to a desired region.
    This enables the local approximations that are later aggregated by softmax.

pith-pipeline@v0.9.0 · 5523 in / 1534 out tokens · 67675 ms · 2026-05-12T02:33:15.859354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

  1. [1]

    Transformers as statisti- cians: Provable in-context learning with in-context algorithm selection.Advances in Neural Information Processing Systems, 36:57125–57211, 2023

    Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisti- cians: Provable in-context learning with in-context algorithm selection.Advances in Neural Information Processing Systems, 36:57125–57211, 2023

  2. [2]

    Language mod- els are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language mod- els are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

  3. [3]

    Efficient approximation of deep relu networks for functions on low dimensional manifolds.Advances in Neural Information Processing Systems, 32, 2019

    Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Efficient approximation of deep relu networks for functions on low dimensional manifolds.Advances in Neural Information Processing Systems, 32, 2019

  4. [4]

    Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recovery.Information and Inference: A Journal of the IMA, 11(4):1203–1253, 2022

  5. [5]

    Deep net tree structure for balance of capacity and approximation ability.Frontiers in Applied Mathematics and Statistics, 5:46, 2019

    Charles K Chui, Shao-Bo Lin, and Ding-Xuan Zhou. Deep net tree structure for balance of capacity and approximation ability.Frontiers in Applied Mathematics and Statistics, 5:46, 2019

  6. [6]

    A deep network construction that adapts to intrinsic dimensionality beyond the domain.Neural Networks, 141:404–419, 2021

    Alexander Cloninger and Timo Klock. A deep network construction that adapts to intrinsic dimensionality beyond the domain.Neural Networks, 141:404–419, 2021

  7. [7]

    Cambridge University Press, 2007

    Felipe Cucker and Ding Xuan Zhou.Learning Theory: An Approximation Theory Viewpoint, volume 24. Cambridge University Press, 2007

  8. [8]

    On deep generative models for approximation and estimation of distributions on manifolds.Advances in Neural Information Processing Systems, 35:10615–10628, 2022

    Biraj Dahal, Alexander Havrilla, Minshuo Chen, Tuo Zhao, and Wenjing Liao. On deep generative models for approximation and estimation of distributions on manifolds.Advances in Neural Information Processing Systems, 35:10615–10628, 2022

  9. [9]

    A survey on in-context learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1107–1128, 2024

  10. [10]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

  11. [11]

    Inductive biases and vari- able creation in self-attention mechanisms

    Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and vari- able creation in self-attention mechanisms. InInternational Conference on Machine Learning, pages 5793–5831. PMLR, 2022

  12. [12]

    Curvature measures.Transactions of the American Mathematical Society, 93(3):418–491, 1959

    Herbert Federer. Curvature measures.Transactions of the American Mathematical Society, 93(3):418–491, 1959. 40

  13. [13]

    What can transformers learn in-context? a case study of simple function classes.Advances in Neural Information Processing Systems, 35:30583–30598, 2022

    Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in Neural Information Processing Systems, 35:30583–30598, 2022

  14. [14]

    Minimax manifold estimation.Journal of Machine Learning Research, 13:1263–1291, 2012

    Christopher R Genovese, Marco Perone Pacifico, Isabella Verdinelli, Larry Wasserman, et al. Minimax manifold estimation.Journal of Machine Learning Research, 13:1263–1291, 2012

  15. [15]

    Generative adversarial nets.Advances in Neural Information Processing Systems, 27, 2014

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in Neural Information Processing Systems, 27, 2014

  16. [16]

    Speech recognition with deep recurrent neural networks

    Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6645–6649. Ieee, 2013

  17. [17]

    On the rate of convergence of a classi- fier based on a transformer encoder.IEEE Transactions on Information Theory, 68(12):8139– 8155, 2022

    Iryna Gurevych, Michael Kohler, and G¨ ozde G¨ ul S ¸ahin. On the rate of convergence of a classi- fier based on a transformer encoder.IEEE Transactions on Information Theory, 68(12):8139– 8155, 2022

  18. [18]

    Alex Havrilla and Wenjing Liao. Understanding scaling laws with statistical and approximation theory for transformer neural networks on intrinsically low-dimensional data.Advances in Neural Information Processing Systems, 37:42162–42210, 2024

  19. [19]

    Approximation rate of the transformer architecture for se- quence modeling.Advances in Neural Information Processing Systems, 37:68926–68955, 2024

    Haotian Jiang and Qianxiao Li. Approximation rate of the transformer architecture for se- quence modeling.Advances in Neural Information Processing Systems, 37:68926–68955, 2024

  20. [20]

    Transformers can overcome the curse of dimensionality: A theoretical study from an approximation perspective.Journal of Machine Learning Research, 27(50):1–34, 2026

    Yuling Jiao, Yanming Lai, Yang Wang, and Bokai Yan. Transformers can overcome the curse of dimensionality: A theoretical study from an approximation perspective.Journal of Machine Learning Research, 27(50):1–34, 2026

  21. [21]

    Are transformers with one layer self-attention using low- rank weight matrices universal approximators? InThe Twelfth International Conference on Learning Representations, 2024

    Tokio Kajitsuka and Issei Sato. Are transformers with one layer self-attention using low- rank weight matrices universal approximators? InThe Twelfth International Conference on Learning Representations, 2024

  22. [22]

    Imagenet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in Neural Information Processing Systems, 25, 2012

  23. [23]

    Springer Science & Business Media, 2006

    John M Lee.Riemannian Manifolds: An Introduction to Curvature. Springer Science & Business Media, 2006

  24. [24]

    Besov function approximation and binary classification on low-dimensional manifolds using convolutional residual networks

    Hao Liu, Minshuo Chen, Tuo Zhao, and Wenjing Liao. Besov function approximation and binary classification on low-dimensional manifolds using convolutional residual networks. In International Conference on Machine Learning, pages 6770–6780. PMLR, 2021

  25. [25]

    Deep nonparametric estimation of intrinsic data structures by chart autoencoders: Generalization error and robustness.Applied and Computational Harmonic Analysis, 68:101602, 2024

    Hao Liu, Alex Havrilla, Rongjie Lai, and Wenjing Liao. Deep nonparametric estimation of intrinsic data structures by chart autoencoders: Generalization error and robustness.Applied and Computational Harmonic Analysis, 68:101602, 2024

  26. [26]

    Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM Computing Surveys, 55(9):1–35, 2023

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM Computing Surveys, 55(9):1–35, 2023. 41

  27. [27]

    Theory of deep convolutional neural networks iii: Approximating radial functions.Neural Networks, 144:778–790, 2021

    Tong Mao, Zhongjie Shi, and Ding-Xuan Zhou. Theory of deep convolutional neural networks iii: Approximating radial functions.Neural Networks, 144:778–790, 2021

  28. [28]

    Fast learning in networks of locally-tuned processing units.Neural Computation, 1(2):281–294, 1989

    John Moody and Christian J Darken. Fast learning in networks of locally-tuned processing units.Neural Computation, 1(2):281–294, 1989

  29. [29]

    Adaptive approximation and generalization of deep neural network with intrinsic dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020

    Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020

  30. [30]

    Finding the homology of submanifolds with high confidence from random samples.Discrete & Computational Geometry, 39(1):419– 441, 2008

    Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high confidence from random samples.Discrete & Computational Geometry, 39(1):419– 441, 2008

  31. [31]

    Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  32. [32]

    Nonparametric regression using deep neural networks with ReLU activation function.The Annals of Statistics, 48(4):1875–1897, 2020

    Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function.The Annals of Statistics, 48(4):1875–1897, 2020

  33. [33]

    Understanding in-context learning on structured manifolds: Bridging attention to kernel methods

    Zhaiming Shen, Alexander Hsu, Rongjie Lai, and Wenjing Liao. Understanding in-context learning on structured manifolds: Bridging attention to kernel methods. InThe Fourteenth International Conference on Learning Representations, 2026

  34. [34]

    A two-dimensional interpolation function for irregularly-spaced data

    Donald Shepard. A two-dimensional interpolation function for irregularly-spaced data. In Proceedings of the 1968 23rd ACM National Conference, pages 517–524, 1968

  35. [35]

    Nonlinear functional regression by functional deep neural network with kernel embedding.Journal of Machine Learning Research, 26(284):1–49, 2025

    Zhongjie Shi, Jun Fan, Linhao Song, Ding-Xuan Zhou, and Johan AK Suykens. Nonlinear functional regression by functional deep neural network with kernel embedding.Journal of Machine Learning Research, 26(284):1–49, 2025

  36. [36]

    Approximation and estimation capability of vision transformers for hierarchical compositional models.Applied and Computational Harmonic Analysis, page 101849, 2025

    Zhongjie Shi, Zhiying Fang, and Yuan Cao. Approximation and estimation capability of vision transformers for hierarchical compositional models.Applied and Computational Harmonic Analysis, page 101849, 2025

  37. [37]

    Can overfitted deep neural networks in adversarial training generalize? an approximation viewpoint.SIAM Journal on Mathematics of Data Science, 8(2):225–256, 2026

    Zhongjie Shi, Fanghui Liu, Yuan Cao, and Johan AK Suykens. Can overfitted deep neural networks in adversarial training generalize? an approximation viewpoint.SIAM Journal on Mathematics of Data Science, 8(2):225–256, 2026

  38. [38]

    Learning theory of distribution regression with neural networks.Constructive Approximation, 62(1):61–104, 2025

    Zhongjie Shi, Zhan Yu, and Ding-Xuan Zhou. Learning theory of distribution regression with neural networks.Constructive Approximation, 62(1):61–104, 2025

  39. [39]

    Mastering the game of go without human knowledge.Nature, 550(7676):354–359, 2017

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge.Nature, 550(7676):354–359, 2017

  40. [40]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. 42

  41. [41]

    Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input

    Shokichi Takakura and Taiji Suzuki. Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input. InInternational Conference on Machine Learning, pages 33416–33447. PMLR, 2023

  42. [42]

    Manifolds

    Loring W Tu. Manifolds. InAn Introduction to Manifolds, pages 47–83. Springer, 2011

  43. [43]

    Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

  44. [44]

    Statistically meaningful approximation: a case study on approximating turing machines with transformers

    Colin Wei, Yining Chen, and Tengyu Ma. Statistically meaningful approximation: a case study on approximating turing machines with transformers. InAdvances in Neural Information Processing Systems, volume 35, pages 12071–12083, 2022

  45. [45]

    Error bounds for approximations with deep relu networks.Neural Networks, 94:103–114, 2017

    Dmitry Yarotsky. Error bounds for approximations with deep relu networks.Neural Networks, 94:103–114, 2017

  46. [46]

    Optimal approximation of continuous functions by very deep relu networks

    Dmitry Yarotsky. Optimal approximation of continuous functions by very deep relu networks. InConference on Learning Theory, pages 639–649. PMLR, 2018

  47. [47]

    Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations, 2020

    Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations, 2020

  48. [48]

    O (n) connections are expressive enough: Universal approximability of sparse transformers

    Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. O (n) connections are expressive enough: Universal approximability of sparse transformers. InAdvances in Neural Information Processing Systems, volume 33, pages 13783–13794, 2020

  49. [49]

    Big bird: Transformers for longer sequences

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. InAdvances in Neural Information Processing Systems, volume 33, pages 17283–17297, 2020

  50. [50]

    An analysis of attention via the lens of exchangeability and latent variable models.arXiv preprint arXiv:2212.14852, 2022

    Yufeng Zhang, Boyi Liu, Qi Cai, Lingxiao Wang, and Zhaoran Wang. An analysis of attention via the lens of exchangeability and latent variable models.arXiv preprint arXiv:2212.14852, 2022. 43