arxiv: 2605.07463 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Approximation Error Upper and Lower Bounds for H\"{o}lder Class with Transformers

Jerry Zhijian Yang, Xiliang Lu, Xin He, Yuling Jiao

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:55 UTC · model grok-4.3

classification 💻 cs.LG

keywords transformersapproximation boundsholder classneural network expressivitydepth complexityregression riskupper and lower bounds

0 comments

The pith

Transformers can approximate any bounded Hölder function to accuracy ε with O(ε^{-d0/α}) blocks, but require at least Ω(ε^{-d0/(4α)}) blocks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to determine the number of Transformer blocks required to approximate functions from the Hölder class to any given accuracy ε. It proves an upper bound of O(ε to the power of negative d0 over α) blocks suffices for d0-dimensional inputs of smoothness α in (0,1]. It also gives the first rigorous lower bound of Ω(ε to the power of negative d0 over 4α) blocks. These results quantify the depth needed for accurate approximation and extend to excess risk rates in regression. A sympathetic reader would care because the bounds make concrete how deep Transformers must be to handle a wide class of smooth functions reliably.

Core claim

We prove that a Transformer network composed of at most O(ε^{-d0/α}) blocks can approximate any bounded Hölder function with d0-dimensional input and smoothness α∈(0,1] under any accuracy ε>0. Leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least Ω(ε^{-d0/(4α)}) blocks to achieve the ε approximation accuracy. We extend the derived results for standard Transformers to a general regression task and establish the corresponding excess risk rates demonstrating Transformers' empirical effectiveness in real-world settings.

What carries the argument

Standard Transformer architecture with softmax attention, ReLU activations, and residual connections, used to approximate functions in the bounded Hölder class.

Load-bearing premise

The target functions are bounded and Hölder continuous with smoothness α in (0,1], and the lower bound relies on a known upper bound for the VC-dimension of the Transformer hypothesis class.

What would settle it

For small ε, d0 and α, construct a worst-case bounded Hölder function and check whether any Transformer with fewer than c times ε to the power of negative d0 over 4α blocks can achieve approximation error below ε.

Figures

Figures reproduced from arXiv: 2605.07463 by Jerry Zhijian Yang, Xiliang Lu, Xin He, Yuling Jiao.

**Figure 1.** Figure 1: Pipeline of the standard Transformer function class: bridging the vector form and sequence form. For the input sequence X ∈ R d×L with L length and ddimensional tokens, a Transformer block mapping from R d×L to R d×L is represented as Attn(X) = X+ Xr1 i=1 Wi OWi V XσS [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Diagram: Error decomposition of the excess risk. Via the statistical learning, the total error between any obtainable estimator fs(τ) and the target function f ∗ can be measured by the expectation of the excess risk,. The error can be decompose into three parts. Lemma D.1 (Error Decomposition of Excess Risk). Given an i.i.d sample set D = {z (i) = (x (i) , y(i) )} N i=1, the target function space Hα d0,1 a… view at source ↗

read the original abstract

We explore the expressive power of Transformers by establishing precise approximation error upper and lower bounds for H\"{o}lder class. Specifically, a new approximation upper bound is derived for the standard Transformer architecture equipped with Softmax operators, ReLU activation functions, and residual connections. We prove that a Transformer network composed of at most $\mathcal{O}(\varepsilon^{-{d_{0}}/{\alpha}})$ blocks can approximate any bounded H\"{o}lder function with $d_{0}$-dimensional input and smoothness $\alpha\in(0,1]$ under any accuracy $\varepsilon>0$. In the case of approximation lower bounds, leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least $\Omega(\varepsilon^{-{d_{0}}/({4\alpha})})$ blocks to achieve the $\varepsilon$ approximation accuracy. As a final step, we extend the derived results for standard Transformers to a general regression task and establish the corresponding excess risk rates demonstrating Transformers' empirical effectiveness in real-world settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives explicit upper and lower bounds on the number of Transformer blocks needed for Holder approximation, but the lower bound depends on an unverified VC-dimension result for the softmax-residual case.

read the letter

The core contribution is a pair of scaling results for standard Transformers on bounded Holder functions. The upper bound shows that O(ε^{-d0/α}) blocks suffice to reach accuracy ε, which lines up with known rates for other feed-forward architectures. The lower bound claims that Ω(ε^{-d0/(4α)}) blocks are necessary, presented as the first rigorous VC-dimension argument for this model family. They also carry the approximation rates over to a regression setting and give excess-risk bounds that match the approximation error up to logs and constants. That extension is straightforward once the approximation result is in hand and helps connect the theory to sample complexity questions people actually care about in practice. The constructions for the upper bound use the usual softmax attention, ReLU, and residuals, so the positive result applies directly to the architecture most people run. The lower-bound argument, however, imports a VC-dimension upper bound from prior work without showing that the cited result still holds once softmax and residual connections are present. Softmax is not piecewise polynomial, and residuals change how layers compose, so the polynomial degree that produces the factor of 4 may not be justified. If the actual VC growth is faster, the lower bound weakens or disappears. The paper does not appear to re-derive the VC dimension for its exact hypothesis class, which leaves the Omega statement on shaky ground. This work is mainly useful for people who want concrete depth-versus-accuracy trade-offs for attention models rather than generic universal-approximation statements. Readers already comfortable with metric entropy and VC tools will get the most out of it. The upper bound and regression extension are solid enough to warrant referee time; the lower bound needs a fix or a clearer justification before the paper is ready for publication. I would send it to review with a request that the authors either verify the VC bound for their architecture or weaken the lower-bound claim accordingly.

Referee Report

2 major / 3 minor

Summary. The manuscript derives approximation upper and lower bounds for the Hölder class using standard Transformer networks (softmax attention, ReLU activations, residual connections). It proves that O(ε^{-d_0/α}) blocks suffice to ε-approximate any bounded Hölder function on d_0-dimensional inputs with smoothness α ∈ (0,1]. For the lower bound, it combines the metric entropy of the Hölder class with a VC-dimension upper bound from prior literature to claim that Ω(ε^{-d_0/(4α)}) blocks are necessary. The results are extended to a regression setting yielding corresponding excess risk rates.

Significance. If the bounds hold, this provides the first rigorous lower bound on the number of Transformer blocks needed for Hölder approximation, which is significant for understanding depth requirements in attention-based models. The explicit upper-bound construction and the regression extension strengthen the theoretical analysis and link to empirical performance.

major comments (2)

[Lower bound section] Lower bound argument: The Ω(ε^{-d_0/(4α)}) claim is obtained by combining Hölder metric entropy (order ε^{-d_0/α}) with an external upper bound on VC-dimension of k-block Transformers that is assumed to grow at most as O(k^4). No verification is supplied that the specific architecture (softmax attention + residual connections) satisfies the hypotheses of the cited VC result; softmax is not piecewise-polynomial and residuals alter the compositional structure, so the polynomial degree and thus the 1/4 factor in the exponent may not hold.
[Upper bound section] Upper bound construction: While an explicit construction is claimed to achieve the O(ε^{-d_0/α}) rate, the manuscript should explicitly state the norm in which approximation holds (e.g., uniform on the compact domain) and confirm that the boundedness assumption on the target function is used without circularity in the residual-connection analysis.

minor comments (3)

[Abstract] Abstract: 'demand for at least' is grammatically awkward; replace with 'require at least'.
[Throughout] Notation: Ensure consistent subscript formatting (d_0 vs d0) and math-mode usage for all parameters throughout the text and equations.
[References] References: When citing the prior VC-dimension result, include the specific theorem or proposition number used to obtain the polynomial degree.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the insightful comments that help improve the clarity and rigor of our work. We address the major comments point-by-point below, indicating the planned revisions.

read point-by-point responses

Referee: [Lower bound section] Lower bound argument: The Ω(ε^{-d_0/(4α)}) claim is obtained by combining Hölder metric entropy (order ε^{-d_0/α}) with an external upper bound on VC-dimension of k-block Transformers that is assumed to grow at most as O(k^4). No verification is supplied that the specific architecture (softmax attention + residual connections) satisfies the hypotheses of the cited VC result; softmax is not piecewise-polynomial and residuals alter the compositional structure, so the polynomial degree and thus the 1/4 factor in the exponent may not hold.

Authors: We thank the referee for highlighting this subtlety. The cited VC-dimension result is applied to our architecture under the assumption that the overall function class remains polynomially bounded in the number of blocks; however, we agree that direct verification for softmax attention and residual connections is not provided. In the revision we will add an appendix lemma that bounds the VC dimension of the specific Transformer class by O(k^4) via a piecewise-polynomial approximation of the softmax (with error controlled by the boundedness assumption) and a direct accounting of residual additions as linear operations. If the resulting degree changes the exponent, we will state the adjusted lower bound explicitly. This addresses the concern without altering the main claim structure. revision: partial
Referee: [Upper bound section] Upper bound construction: While an explicit construction is claimed to achieve the O(ε^{-d_0/α}) rate, the manuscript should explicitly state the norm in which approximation holds (e.g., uniform on the compact domain) and confirm that the boundedness assumption on the target function is used without circularity in the residual-connection analysis.

Authors: We agree these clarifications strengthen the presentation. The upper bound holds in the uniform norm over the compact domain [0,1]^{d_0}, which we will state explicitly in the theorem statement and proof. The boundedness assumption (|f| ≤ M) is an a priori property of the target function and is used only to derive uniform bounds on intermediate activations via induction on the residual blocks; the induction hypothesis controls the error reduction at each step independently of ε, so no circularity arises. A short remark will be added after the construction to make this explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; upper bound from explicit construction, lower bound from external VC citation

full rationale

The paper's upper bound is obtained by an explicit construction of a Transformer with O(ε^{-d0/α}) blocks that approximates bounded Hölder functions (standard residual+softmax+ReLU architecture). The lower bound combines the known metric entropy of the Hölder class (order ε^{-d0/α}) with a VC-dimension upper bound taken from prior literature, yielding the Ω(ε^{-d0/(4α)}) rate. No equation or claim reduces a derived rate to a quantity defined by the same rate, no parameter is fitted and then renamed as a prediction, and the cited VC result is external rather than a self-citation chain. The derivation is therefore self-contained against standard tools in approximation theory and statistical learning.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claims rest on the standard definition of the Hölder class, boundedness of the target functions, and an external upper bound on the VC-dimension of the Transformer class; no free parameters are fitted and no new entities are postulated.

axioms (2)

domain assumption Standard definition and properties of the Hölder class on bounded domains
Invoked to define the function class being approximated.
standard math Existence of an upper bound on the VC-dimension of the Transformer hypothesis class
Used to obtain the lower bound on the number of blocks.

pith-pipeline@v0.9.0 · 5486 in / 1413 out tokens · 35572 ms · 2026-05-11T01:55:39.992857+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We prove that a Transformer network composed of at most O(ε^{-d0/α}) blocks can approximate any bounded Hölder function... leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least Ω(ε^{-d0/(4α)}) blocks
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Theorem 5.2 (Anthony & Bartlett, 2009, Theorem 8.14) ... VCdim(M) ≤ t²ω(ω+19 log₂(9ω))

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

and Bartlett, P

Anthony, M. and Bartlett, P. L. Neural network learning: Theoretical foundations. cambridge university press, 2009

work page 2009
[3]

Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39 0 (3): 0 930--945, 1993

work page 1993
[4]

L., Harvey, N., Liaw, C., and Mehrabian, A

Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20 0 (63): 0 1--17, 2019

work page 2019
[5]

Baum, E. B. On the capabilities of multilayer perceptrons. Journal of complexity, 4 0 (3): 0 193--215, 1988

work page 1988
[6]

Approximation by superpositions of a sigmoidal function

Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2 0 (4): 0 303--314, 1989

work page 1989
[7]

An image is worth 16x16 words: Transformers for image recognition at scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy

work page 2021
[8]

L., Goel, S., Kakade, S., and Zhang, C

Edelman, B. L., Goel, S., Kakade, S., and Zhang, C. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pp.\ 5793--5831. PMLR, 2022

work page 2022
[9]

and Ghodsi, A

Ghojogh, B. and Ghodsi, A. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey . working paper or preprint, December 2020. URL https://hal.science/hal-04637647

work page 2020
[10]

Gurevych, I., Kohler, M., and S ahin, G. G. On the rate of convergence of a classifier based on a transformer encoder. IEEE Transactions on Information Theory, 68 0 (12): 0 8139--8155, 2022

work page 2022
[11]

Multilayer feedforward networks are universal approximators

Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural networks, 2 0 (5): 0 359--366, 1989

work page 1989
[12]

Y.-C., Wu, W., Li, Z., Pi, S., Song, Z., and Liu, H

Hu, J. Y.-C., Wu, W., Li, Z., Pi, S., Song, Z., and Liu, H. On statistical rates and provably efficient criteria of latent diffusion transformers (dits). In Advances in Neural Information Processing Systems, volume 37, pp.\ 31562--31628, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/37f6bed18b9b404f53dcaec4607c4fb7-Paper-Conference.pdf

work page 2024
[13]

Y.-C., Wang, W.-P., Gilani, A., Li, C., Song, Z., and Liu, H

Hu, J. Y.-C., Wang, W.-P., Gilani, A., Li, C., Song, Z., and Liu, H. Fundamental limits of prompt tuning transformers: Universality, capacity and efficiency. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=jDpdQPMosW

work page 2025
[14]

Y.-C., Wu, W., Lee, Y.-C., Huang, Y.-C., Chen, M., and Liu, H

Hu, J. Y.-C., Wu, W., Lee, Y.-C., Huang, Y.-C., Chen, M., and Liu, H. On statistical rates of conditional diffusion transformers: Approximation, estimation and minimax optimality. In The Thirteenth International Conference on Learning Representations, 2025 b . URL https://openreview.net/forum?id=c54apoozCS

work page 2025
[15]

Learning capability and storage capacity of two-hidden-layer feedforward networks

Huang, G.-B. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE transactions on neural networks, 14 0 (2): 0 274--281, 2003

work page 2003
[16]

Convergence analysis of flow matching in latent space with transformers

Jiao, Y., Lai, Y., Wang, Y., and Yan, B. Convergence analysis of flow matching in latent space with transformers. arXiv preprint arXiv:2404.02538, 2024

work page arXiv 2024
[17]

Approximation bounds for transformer networks with application to regression

Jiao, Y., Lai, Y., Sun, D., Wang, Y., and Yan, B. Approximation bounds for transformer networks with application to regression. arXiv preprint arXiv:2504.12175, 2025

work page arXiv 2025
[18]

Transformers can overcome the curse of dimensionality: A theoretical study from an approximation perspective

Jiao, Y., Lai, Y., Wang, Y., and Yan, B. Transformers can overcome the curse of dimensionality: A theoretical study from an approximation perspective. Journal of Machine Learning Research, 27 0 (50): 0 1--34, 2026

work page 2026
[19]

and Sato, I

Kajitsuka, T. and Sato, I. Are transformers with one layer self-attention using low-rank weight matrices universal approximators? In the twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024. URL https://openreview.net/forum?id=nJnky5K944

work page 2024
[20]

and Sato, I

Kajitsuka, T. and Sato, I. On the optimal memorization capacity of transformers. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=UGVYezlLcZ

work page 2025
[21]

Provable memorization capacity of transformers

Kim, J., Kim, M., and Mozafari, B. Provable memorization capacity of transformers. In the eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 , 2023. URL https://openreview.net/forum?id=8JCg5xJCTPR

work page 2023
[22]

Universal regular conditional distributions via probabilistic transformers

Kratsios, A. Universal regular conditional distributions via probabilistic transformers. Constructive Approximation, 57 0 (3): 0 1145--1212, 2023

work page 2023
[23]

Universal approximation under constraints is possible with transformers

Kratsios, A., Zamanlooy, B., Liu, T., and Dokmanic, I. Universal approximation under constraints is possible with transformers. In the tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , 2022. URL https://openreview.net/forum?id=JGO8CvG5S9

work page 2022
[24]

Memorization capacity of multi-head attention in transformers

Mahdavi, S., Liao, R., and Thrampoulidis, C. Memorization capacity of multi-head attention in transformers. In the twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024. URL https://openreview.net/forum?id=MrR3rMxqqv

work page 2024
[25]

and Hein, M

Nguyen, Q. and Hein, M. Optimization landscape and expressivity of deep cnns. In International conference on machine learning, pp.\ 3730--3739. PMLR, 2018

work page 2018
[26]

Provable memorization via deep neural networks using sub-linear parameters

Park, S., Lee, J., Yun, C., and Shin, J. Provable memorization via deep neural networks using sub-linear parameters. In Conference on learning theory, pp.\ 3627--3661. PMLR, 2021

work page 2021
[27]

Prompting a pretrained transformer can be a universal approximator

Petrov, A., Torr, P., and Bibi, A. Prompting a pretrained transformer can be a universal approximator. In forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , 2024. URL https://openreview.net/forum?id=3mQ6ZKTSQl

work page 2024
[28]

Nonparametric regression using deep neural networks with ReLU activation function , volume=

Schmidt-Hieber, J. Nonparametric regression using deep neural networks with ReLU activation function . The Annals of Statistics, 48 0 (4): 0 1875 -- 1897, 2020. doi:10.1214/19-AOS1875. URL https://doi.org/10.1214/19-AOS1875

work page doi:10.1214/19-aos1875 2020
[29]

Optimal approximation rate of relu networks in terms of width and depth

Shen, Z., Yang, H., and Zhang, S. Optimal approximation rate of relu networks in terms of width and depth. Journal de Math \'e matiques Pures et Appliqu \'e es , 157: 0 101--135, 2022

work page 2022
[30]

Siegel, J. W. and Xu, J. Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks. Foundations of Computational Mathematics, 24 0 (2): 0 481--537, 2024

work page 2024
[31]

and Suzuki, T

Takakura, S. and Suzuki, T. Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input. In International Conference on Machine Learning, pp.\ 33416--33447. PMLR, 2023

work page 2023
[32]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

On the optimal memorization power of re LU neural networks

Vardi, G., Yehudai, G., and Shamir, O. On the optimal memorization power of re LU neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=MkTPtnjeYTV

work page 2022
[34]

N., Kaiser, L

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

work page 2017
[35]

and Sato, I

Xu, K. and Sato, I. On expressive power of looped transformers: Theoretical analysis and enhancement via timestep encoding. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=H4BuhRezCV

work page 2025
[36]

On the rates of convergence for learning with convolutional neural networks

Yang, Y., Feng, H., and Zhou, D.-X. On the rates of convergence for learning with convolutional neural networks. SIAM Journal on Mathematics of Data Science, 7 0 (4): 0 1755--1772, 2025

work page 2025
[37]

Error bounds for approximations with deep

Yarotsky, D. Error bounds for approximations with deep relu networks. Neural Networks, 94: 0 103--114, 2017. ISSN 0893-6080. doi:https://doi.org/10.1016/j.neunet.2017.07.002. URL https://www.sciencedirect.com/science/article/pii/S0893608017301545

work page doi:10.1016/j.neunet.2017.07.002 2017
[38]

and Zhevnerchuk, A

Yarotsky, D. and Zhevnerchuk, A. The phase diagram of approximation rates for deep neural networks. Advances in neural information processing systems, 33: 0 13005--13015, 2020

work page 2020
[39]

Do transformers really perform badly for graph representation? In Advances in Neural Information Processing Systems, volume 34, pp.\ 28877--28888, 2021

Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform badly for graph representation? In Advances in Neural Information Processing Systems, volume 34, pp.\ 28877--28888, 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/f1c1592588411002af340cbaedd6fc33-Paper.pdf

work page 2021
[40]

S., Reddi, S., and Kumar, S

Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S., and Kumar, S. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ByxRM0Ntvr

work page 2020
[41]

Theory of deep convolutional neural networks: Downsampling

Zhou, D.-X. Theory of deep convolutional neural networks: Downsampling. Neural Networks, 124: 0 319--327, 2020

work page 2020