pith. machine review for the scientific record. sign in

arxiv: 2605.07463 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Approximation Error Upper and Lower Bounds for H\"{o}lder Class with Transformers

Jerry Zhijian Yang, Xiliang Lu, Xin He, Yuling Jiao

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:55 UTC · model grok-4.3

classification 💻 cs.LG
keywords transformersapproximation boundsholder classneural network expressivitydepth complexityregression riskupper and lower bounds
0
0 comments X

The pith

Transformers can approximate any bounded Hölder function to accuracy ε with O(ε^{-d0/α}) blocks, but require at least Ω(ε^{-d0/(4α)}) blocks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to determine the number of Transformer blocks required to approximate functions from the Hölder class to any given accuracy ε. It proves an upper bound of O(ε to the power of negative d0 over α) blocks suffices for d0-dimensional inputs of smoothness α in (0,1]. It also gives the first rigorous lower bound of Ω(ε to the power of negative d0 over 4α) blocks. These results quantify the depth needed for accurate approximation and extend to excess risk rates in regression. A sympathetic reader would care because the bounds make concrete how deep Transformers must be to handle a wide class of smooth functions reliably.

Core claim

We prove that a Transformer network composed of at most O(ε^{-d0/α}) blocks can approximate any bounded Hölder function with d0-dimensional input and smoothness α∈(0,1] under any accuracy ε>0. Leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least Ω(ε^{-d0/(4α)}) blocks to achieve the ε approximation accuracy. We extend the derived results for standard Transformers to a general regression task and establish the corresponding excess risk rates demonstrating Transformers' empirical effectiveness in real-world settings.

What carries the argument

Standard Transformer architecture with softmax attention, ReLU activations, and residual connections, used to approximate functions in the bounded Hölder class.

Load-bearing premise

The target functions are bounded and Hölder continuous with smoothness α in (0,1], and the lower bound relies on a known upper bound for the VC-dimension of the Transformer hypothesis class.

What would settle it

For small ε, d0 and α, construct a worst-case bounded Hölder function and check whether any Transformer with fewer than c times ε to the power of negative d0 over 4α blocks can achieve approximation error below ε.

Figures

Figures reproduced from arXiv: 2605.07463 by Jerry Zhijian Yang, Xiliang Lu, Xin He, Yuling Jiao.

Figure 1
Figure 1. Figure 1: Pipeline of the standard Transformer function class: bridging the vector form and sequence form. For the input sequence X ∈ R d×L with L length and d￾dimensional tokens, a Transformer block mapping from R d×L to R d×L is represented as Attn(X) = X+ Xr1 i=1 Wi OWi V XσS [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Diagram: Error decomposition of the excess risk. Via the statistical learning, the total error between any obtainable estimator fs(τ) and the target function f ∗ can be measured by the expectation of the excess risk,. The error can be decompose into three parts. Lemma D.1 (Error Decomposition of Excess Risk). Given an i.i.d sample set D = {z (i) = (x (i) , y(i) )} N i=1, the target function space Hα d0,1 a… view at source ↗
read the original abstract

We explore the expressive power of Transformers by establishing precise approximation error upper and lower bounds for H\"{o}lder class. Specifically, a new approximation upper bound is derived for the standard Transformer architecture equipped with Softmax operators, ReLU activation functions, and residual connections. We prove that a Transformer network composed of at most $\mathcal{O}(\varepsilon^{-{d_{0}}/{\alpha}})$ blocks can approximate any bounded H\"{o}lder function with $d_{0}$-dimensional input and smoothness $\alpha\in(0,1]$ under any accuracy $\varepsilon>0$. In the case of approximation lower bounds, leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least $\Omega(\varepsilon^{-{d_{0}}/({4\alpha})})$ blocks to achieve the $\varepsilon$ approximation accuracy. As a final step, we extend the derived results for standard Transformers to a general regression task and establish the corresponding excess risk rates demonstrating Transformers' empirical effectiveness in real-world settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript derives approximation upper and lower bounds for the Hölder class using standard Transformer networks (softmax attention, ReLU activations, residual connections). It proves that O(ε^{-d_0/α}) blocks suffice to ε-approximate any bounded Hölder function on d_0-dimensional inputs with smoothness α ∈ (0,1]. For the lower bound, it combines the metric entropy of the Hölder class with a VC-dimension upper bound from prior literature to claim that Ω(ε^{-d_0/(4α)}) blocks are necessary. The results are extended to a regression setting yielding corresponding excess risk rates.

Significance. If the bounds hold, this provides the first rigorous lower bound on the number of Transformer blocks needed for Hölder approximation, which is significant for understanding depth requirements in attention-based models. The explicit upper-bound construction and the regression extension strengthen the theoretical analysis and link to empirical performance.

major comments (2)
  1. [Lower bound section] Lower bound argument: The Ω(ε^{-d_0/(4α)}) claim is obtained by combining Hölder metric entropy (order ε^{-d_0/α}) with an external upper bound on VC-dimension of k-block Transformers that is assumed to grow at most as O(k^4). No verification is supplied that the specific architecture (softmax attention + residual connections) satisfies the hypotheses of the cited VC result; softmax is not piecewise-polynomial and residuals alter the compositional structure, so the polynomial degree and thus the 1/4 factor in the exponent may not hold.
  2. [Upper bound section] Upper bound construction: While an explicit construction is claimed to achieve the O(ε^{-d_0/α}) rate, the manuscript should explicitly state the norm in which approximation holds (e.g., uniform on the compact domain) and confirm that the boundedness assumption on the target function is used without circularity in the residual-connection analysis.
minor comments (3)
  1. [Abstract] Abstract: 'demand for at least' is grammatically awkward; replace with 'require at least'.
  2. [Throughout] Notation: Ensure consistent subscript formatting (d_0 vs d0) and math-mode usage for all parameters throughout the text and equations.
  3. [References] References: When citing the prior VC-dimension result, include the specific theorem or proposition number used to obtain the polynomial degree.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the insightful comments that help improve the clarity and rigor of our work. We address the major comments point-by-point below, indicating the planned revisions.

read point-by-point responses
  1. Referee: [Lower bound section] Lower bound argument: The Ω(ε^{-d_0/(4α)}) claim is obtained by combining Hölder metric entropy (order ε^{-d_0/α}) with an external upper bound on VC-dimension of k-block Transformers that is assumed to grow at most as O(k^4). No verification is supplied that the specific architecture (softmax attention + residual connections) satisfies the hypotheses of the cited VC result; softmax is not piecewise-polynomial and residuals alter the compositional structure, so the polynomial degree and thus the 1/4 factor in the exponent may not hold.

    Authors: We thank the referee for highlighting this subtlety. The cited VC-dimension result is applied to our architecture under the assumption that the overall function class remains polynomially bounded in the number of blocks; however, we agree that direct verification for softmax attention and residual connections is not provided. In the revision we will add an appendix lemma that bounds the VC dimension of the specific Transformer class by O(k^4) via a piecewise-polynomial approximation of the softmax (with error controlled by the boundedness assumption) and a direct accounting of residual additions as linear operations. If the resulting degree changes the exponent, we will state the adjusted lower bound explicitly. This addresses the concern without altering the main claim structure. revision: partial

  2. Referee: [Upper bound section] Upper bound construction: While an explicit construction is claimed to achieve the O(ε^{-d_0/α}) rate, the manuscript should explicitly state the norm in which approximation holds (e.g., uniform on the compact domain) and confirm that the boundedness assumption on the target function is used without circularity in the residual-connection analysis.

    Authors: We agree these clarifications strengthen the presentation. The upper bound holds in the uniform norm over the compact domain [0,1]^{d_0}, which we will state explicitly in the theorem statement and proof. The boundedness assumption (|f| ≤ M) is an a priori property of the target function and is used only to derive uniform bounds on intermediate activations via induction on the residual blocks; the induction hypothesis controls the error reduction at each step independently of ε, so no circularity arises. A short remark will be added after the construction to make this explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; upper bound from explicit construction, lower bound from external VC citation

full rationale

The paper's upper bound is obtained by an explicit construction of a Transformer with O(ε^{-d0/α}) blocks that approximates bounded Hölder functions (standard residual+softmax+ReLU architecture). The lower bound combines the known metric entropy of the Hölder class (order ε^{-d0/α}) with a VC-dimension upper bound taken from prior literature, yielding the Ω(ε^{-d0/(4α)}) rate. No equation or claim reduces a derived rate to a quantity defined by the same rate, no parameter is fitted and then renamed as a prediction, and the cited VC result is external rather than a self-citation chain. The derivation is therefore self-contained against standard tools in approximation theory and statistical learning.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claims rest on the standard definition of the Hölder class, boundedness of the target functions, and an external upper bound on the VC-dimension of the Transformer class; no free parameters are fitted and no new entities are postulated.

axioms (2)
  • domain assumption Standard definition and properties of the Hölder class on bounded domains
    Invoked to define the function class being approximated.
  • standard math Existence of an upper bound on the VC-dimension of the Transformer hypothesis class
    Used to obtain the lower bound on the number of blocks.

pith-pipeline@v0.9.0 · 5486 in / 1413 out tokens · 35572 ms · 2026-05-11T01:55:39.992857+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    and Bartlett, P

    Anthony, M. and Bartlett, P. L. Neural network learning: Theoretical foundations. cambridge university press, 2009

  3. [3]

    Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39 0 (3): 0 930--945, 1993

  4. [4]

    L., Harvey, N., Liaw, C., and Mehrabian, A

    Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20 0 (63): 0 1--17, 2019

  5. [5]

    Baum, E. B. On the capabilities of multilayer perceptrons. Journal of complexity, 4 0 (3): 0 193--215, 1988

  6. [6]

    Approximation by superpositions of a sigmoidal function

    Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2 0 (4): 0 303--314, 1989

  7. [7]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy

  8. [8]

    L., Goel, S., Kakade, S., and Zhang, C

    Edelman, B. L., Goel, S., Kakade, S., and Zhang, C. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pp.\ 5793--5831. PMLR, 2022

  9. [9]

    and Ghodsi, A

    Ghojogh, B. and Ghodsi, A. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey . working paper or preprint, December 2020. URL https://hal.science/hal-04637647

  10. [10]

    Gurevych, I., Kohler, M., and S ahin, G. G. On the rate of convergence of a classifier based on a transformer encoder. IEEE Transactions on Information Theory, 68 0 (12): 0 8139--8155, 2022

  11. [11]

    Multilayer feedforward networks are universal approximators

    Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural networks, 2 0 (5): 0 359--366, 1989

  12. [12]

    Y.-C., Wu, W., Li, Z., Pi, S., Song, Z., and Liu, H

    Hu, J. Y.-C., Wu, W., Li, Z., Pi, S., Song, Z., and Liu, H. On statistical rates and provably efficient criteria of latent diffusion transformers (dits). In Advances in Neural Information Processing Systems, volume 37, pp.\ 31562--31628, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/37f6bed18b9b404f53dcaec4607c4fb7-Paper-Conference.pdf

  13. [13]

    Y.-C., Wang, W.-P., Gilani, A., Li, C., Song, Z., and Liu, H

    Hu, J. Y.-C., Wang, W.-P., Gilani, A., Li, C., Song, Z., and Liu, H. Fundamental limits of prompt tuning transformers: Universality, capacity and efficiency. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=jDpdQPMosW

  14. [14]

    Y.-C., Wu, W., Lee, Y.-C., Huang, Y.-C., Chen, M., and Liu, H

    Hu, J. Y.-C., Wu, W., Lee, Y.-C., Huang, Y.-C., Chen, M., and Liu, H. On statistical rates of conditional diffusion transformers: Approximation, estimation and minimax optimality. In The Thirteenth International Conference on Learning Representations, 2025 b . URL https://openreview.net/forum?id=c54apoozCS

  15. [15]

    Learning capability and storage capacity of two-hidden-layer feedforward networks

    Huang, G.-B. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE transactions on neural networks, 14 0 (2): 0 274--281, 2003

  16. [16]

    Convergence analysis of flow matching in latent space with transformers

    Jiao, Y., Lai, Y., Wang, Y., and Yan, B. Convergence analysis of flow matching in latent space with transformers. arXiv preprint arXiv:2404.02538, 2024

  17. [17]

    Approximation bounds for transformer networks with application to regression

    Jiao, Y., Lai, Y., Sun, D., Wang, Y., and Yan, B. Approximation bounds for transformer networks with application to regression. arXiv preprint arXiv:2504.12175, 2025

  18. [18]

    Transformers can overcome the curse of dimensionality: A theoretical study from an approximation perspective

    Jiao, Y., Lai, Y., Wang, Y., and Yan, B. Transformers can overcome the curse of dimensionality: A theoretical study from an approximation perspective. Journal of Machine Learning Research, 27 0 (50): 0 1--34, 2026

  19. [19]

    and Sato, I

    Kajitsuka, T. and Sato, I. Are transformers with one layer self-attention using low-rank weight matrices universal approximators? In the twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024. URL https://openreview.net/forum?id=nJnky5K944

  20. [20]

    and Sato, I

    Kajitsuka, T. and Sato, I. On the optimal memorization capacity of transformers. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=UGVYezlLcZ

  21. [21]

    Provable memorization capacity of transformers

    Kim, J., Kim, M., and Mozafari, B. Provable memorization capacity of transformers. In the eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 , 2023. URL https://openreview.net/forum?id=8JCg5xJCTPR

  22. [22]

    Universal regular conditional distributions via probabilistic transformers

    Kratsios, A. Universal regular conditional distributions via probabilistic transformers. Constructive Approximation, 57 0 (3): 0 1145--1212, 2023

  23. [23]

    Universal approximation under constraints is possible with transformers

    Kratsios, A., Zamanlooy, B., Liu, T., and Dokmanic, I. Universal approximation under constraints is possible with transformers. In the tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , 2022. URL https://openreview.net/forum?id=JGO8CvG5S9

  24. [24]

    Memorization capacity of multi-head attention in transformers

    Mahdavi, S., Liao, R., and Thrampoulidis, C. Memorization capacity of multi-head attention in transformers. In the twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024. URL https://openreview.net/forum?id=MrR3rMxqqv

  25. [25]

    and Hein, M

    Nguyen, Q. and Hein, M. Optimization landscape and expressivity of deep cnns. In International conference on machine learning, pp.\ 3730--3739. PMLR, 2018

  26. [26]

    Provable memorization via deep neural networks using sub-linear parameters

    Park, S., Lee, J., Yun, C., and Shin, J. Provable memorization via deep neural networks using sub-linear parameters. In Conference on learning theory, pp.\ 3627--3661. PMLR, 2021

  27. [27]

    Prompting a pretrained transformer can be a universal approximator

    Petrov, A., Torr, P., and Bibi, A. Prompting a pretrained transformer can be a universal approximator. In forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , 2024. URL https://openreview.net/forum?id=3mQ6ZKTSQl

  28. [28]

    Nonparametric regression using deep neural networks with ReLU activation function , volume=

    Schmidt-Hieber, J. Nonparametric regression using deep neural networks with ReLU activation function . The Annals of Statistics, 48 0 (4): 0 1875 -- 1897, 2020. doi:10.1214/19-AOS1875. URL https://doi.org/10.1214/19-AOS1875

  29. [29]

    Optimal approximation rate of relu networks in terms of width and depth

    Shen, Z., Yang, H., and Zhang, S. Optimal approximation rate of relu networks in terms of width and depth. Journal de Math \'e matiques Pures et Appliqu \'e es , 157: 0 101--135, 2022

  30. [30]

    Siegel, J. W. and Xu, J. Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks. Foundations of Computational Mathematics, 24 0 (2): 0 481--537, 2024

  31. [31]

    and Suzuki, T

    Takakura, S. and Suzuki, T. Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input. In International Conference on Machine Learning, pp.\ 33416--33447. PMLR, 2023

  32. [32]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  33. [33]

    On the optimal memorization power of re LU neural networks

    Vardi, G., Yehudai, G., and Shamir, O. On the optimal memorization power of re LU neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=MkTPtnjeYTV

  34. [34]

    N., Kaiser, L

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  35. [35]

    and Sato, I

    Xu, K. and Sato, I. On expressive power of looped transformers: Theoretical analysis and enhancement via timestep encoding. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=H4BuhRezCV

  36. [36]

    On the rates of convergence for learning with convolutional neural networks

    Yang, Y., Feng, H., and Zhou, D.-X. On the rates of convergence for learning with convolutional neural networks. SIAM Journal on Mathematics of Data Science, 7 0 (4): 0 1755--1772, 2025

  37. [37]

    Error bounds for approximations with deep

    Yarotsky, D. Error bounds for approximations with deep relu networks. Neural Networks, 94: 0 103--114, 2017. ISSN 0893-6080. doi:https://doi.org/10.1016/j.neunet.2017.07.002. URL https://www.sciencedirect.com/science/article/pii/S0893608017301545

  38. [38]

    and Zhevnerchuk, A

    Yarotsky, D. and Zhevnerchuk, A. The phase diagram of approximation rates for deep neural networks. Advances in neural information processing systems, 33: 0 13005--13015, 2020

  39. [39]

    Do transformers really perform badly for graph representation? In Advances in Neural Information Processing Systems, volume 34, pp.\ 28877--28888, 2021

    Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform badly for graph representation? In Advances in Neural Information Processing Systems, volume 34, pp.\ 28877--28888, 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/f1c1592588411002af340cbaedd6fc33-Paper.pdf

  40. [40]

    S., Reddi, S., and Kumar, S

    Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S., and Kumar, S. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ByxRM0Ntvr

  41. [41]

    Theory of deep convolutional neural networks: Downsampling

    Zhou, D.-X. Theory of deep convolutional neural networks: Downsampling. Neural Networks, 124: 0 319--327, 2020