pith. sign in

arxiv: 2605.20988 · v1 · pith:FQLV235Rnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

A Sharper Picture of Generalization in Transformers

Pith reviewed 2026-05-21 05:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords transformersgeneralizationPAC-BayesFourier spectraboolean functionsflat minimasharpnesssparse spectra
0
0 comments X

The pith

Transformers can implement any boolean function with sparsity at most the context length using flat minima, which then yield non-vacuous PAC-Bayes generalization bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates generalization in transformers on boolean domains through the lens of Fourier spectra rather than Rademacher complexity. It establishes that sparse, low-degree spectra permit the construction of low-sharpness flat minima capable of realizing the target functions. By proving the existence of such minima for any boolean function whose sparsity does not exceed the context length and then applying PAC-Bayes bounds to an idealized learner that uses them, the authors obtain generalization guarantees that remain non-vacuous. Empirical evaluations and mechanistic interpretability experiments are used to check whether real trained transformers behave in ways consistent with these constructions.

Core claim

We show that sparse spectra concentrated on low-degree components enable low-sharpness constructions with good generalization properties. Our idea is to show the existence of flat minima implementing any boolean function of sparsity no greater than the context length, and then apply a PAC-Bayes bound to an idealized low-sharpness learner, resulting in a non-vacuous generalization bound.

What carries the argument

Existence of flat (low-sharpness) minima that exactly implement any sparse boolean function whose sparsity is bounded by the context length; these minima serve as the basis for applying PAC-Bayes theory to obtain non-vacuous bounds.

If this is right

  • Any boolean function whose sparsity is at most the context length admits a flat-minimum realization inside the transformer parameter space.
  • PAC-Bayes bounds applied to an idealized learner that selects among such low-sharpness solutions produce non-vacuous generalization guarantees.
  • Empirical predictions derived from the low-sharpness construction are testable on concrete transformer training runs.
  • Mechanistic interpretability analysis can reveal whether real transformers exhibit weight configurations consistent with the flat-minima construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the flat-minima construction holds, transformers may implicitly select low-sharpness solutions when the data distribution favors sparse low-degree functions.
  • Similar existence arguments could be attempted for other architectures that admit a notion of sharpness and can represent boolean functions exactly.
  • Increasing context length would immediately extend the class of functions for which non-vacuous bounds are guaranteed under the same argument.

Load-bearing premise

The existence of flat minima implementing any boolean function of sparsity no greater than the context length.

What would settle it

Training or optimization runs that fail to locate any flat minimum realizing a specific sparse boolean function whose sparsity is at most the context length, or measurements showing that the resulting PAC-Bayes bound is still vacuous despite using the idealized low-sharpness learner.

Figures

Figures reproduced from arXiv: 2605.20988 by Paul Lintilhac, Sair Shaikh.

Figure 1
Figure 1. Figure 1: (left) A plot showing our semi-analytic generalization bound for [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Sharpness and Frobenius norm comparisons between the exact construction and the learned [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: This plot shows the empirical 90th−percentile perturbation of the sharpness (the trace of the loss Hessian), and how it depends on the degree Df and sparsity ω of the target function it expresses, as well as the sequence length T. each line is a different magnitude of the perturbation, and naturally the size of the perturbation always increases as σ increases. While qualitatively similar to our analytic bo… view at source ↗
Figure 5
Figure 5. Figure 5: Left: A plot showing the combined attention matrix W for a function learned with an architecture [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: This plot shows our analytic bound on the perturbation to the sharpness over the same grid of [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: This plot shows a comparison of the bound with analytic worst-case bound on [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A diagram showing the high-level dependency graph of our 1.5-layer transformer construction. [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Left: A plot showing the upper bound on the maximum degree of the target function obtained [PITH_FULL_IMAGE:figures/full_fig_p046_8.png] view at source ↗
read the original abstract

We study transformers' generalization behavior on boolean domains from the perspective of the Fourier Spectra of their target functions. In contrast to prior work (Edelman et al., 2022; Trauger and Tewari, 2024), which derived generalization bounds from Rademacher complexity, we investigate the feasibility of obtaining generalization bounds via PAC-Bayes theory. We show that sparse spectra concentrated on low-degree components enable low-sharpness constructions with good generalization properties. Our idea is to show the existence of flat minima implementing any boolean function of sparsity no greater than the context length, and then apply a PAC-Bayes bound to an idealized low-sharpness learner, resulting in a non-vacuous generalization bound. We evaluate predictions empirically and conduct a mechanistic interpretability study to support the realism of our theoretical construction in real transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that sparse Fourier spectra concentrated on low-degree components enable low-sharpness constructions in transformers that implement any boolean function whose sparsity is at most the context length. The authors establish the existence of such flat minima, then apply a PAC-Bayes bound to an idealized low-sharpness learner to obtain a non-vacuous generalization bound. This approach is positioned as an alternative to prior Rademacher-complexity analyses, with supporting empirical evaluations of the theoretical predictions and a mechanistic interpretability study examining the construction's realism in trained models.

Significance. If the existence result is shown with explicit, function-independent control of sharpness and the resulting PAC-Bayes bound is rigorously non-vacuous, the work would supply a useful theoretical account of why transformers can generalize on sparse boolean tasks. It would highlight the interplay between Fourier sparsity, flat minima, and generalization, offering a concrete alternative to complexity-based bounds and potentially explaining empirical observations of flat minima. The empirical validation and interpretability analysis add practical relevance.

major comments (2)
  1. [§3.2, Theorem 1] §3.2, Theorem 1: The existence construction for flat minima must explicitly bound the sharpness measure (e.g., Hessian trace or largest eigenvalue) by a quantity depending only on context length and sparsity level, independent of the particular boolean function or its Fourier coefficients. If sharpness scales with the target function, the idealized low-sharpness learner cannot be defined uniformly and the subsequent PAC-Bayes bound collapses to vacuous for some sparse functions.
  2. [§4, Eq. (8)] §4, Eq. (8): The PAC-Bayes application to the idealized learner requires a prior that is independent of the data yet yields a controlled KL term; it is unclear whether the bound remains non-vacuous when the posterior is centered at the constructed flat minimum for arbitrary sparse spectra up to the context length.
minor comments (2)
  1. [Figure 3] The caption of Figure 3 should specify the exact sharpness metric plotted and the number of random seeds used for the error bars.
  2. Notation for the Fourier support size S and context length n should be introduced consistently in the introduction rather than first appearing in the theoretical section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points about uniformity in our constructions and the PAC-Bayes application. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§3.2, Theorem 1] The existence construction for flat minima must explicitly bound the sharpness measure (e.g., Hessian trace or largest eigenvalue) by a quantity depending only on context length and sparsity level, independent of the particular boolean function or its Fourier coefficients. If sharpness scales with the target function, the idealized low-sharpness learner cannot be defined uniformly and the subsequent PAC-Bayes bound collapses to vacuous for some sparse functions.

    Authors: We agree that an explicit function-independent bound is necessary to ensure the idealized learner is well-defined uniformly. In the construction underlying Theorem 1, the transformer parameters are set by routing each sparse low-degree Fourier term through dedicated attention heads and MLP channels using a fixed template whose scaling depends only on context length and sparsity level; the resulting Hessian trace (or largest eigenvalue) is then bounded by a quantity polynomial in the context length and linear in the sparsity level, with no dependence on the numerical values of the Fourier coefficients. We will revise the theorem statement and proof to state this bound explicitly. revision: yes

  2. Referee: [§4, Eq. (8)] The PAC-Bayes application to the idealized learner requires a prior that is independent of the data yet yields a controlled KL term; it is unclear whether the bound remains non-vacuous when the posterior is centered at the constructed flat minimum for arbitrary sparse spectra up to the context length.

    Authors: The prior is a fixed, data-independent Gaussian over parameter space whose variance is chosen once to cover the entire family of constructed minima for all sparse spectra up to context length. The posterior is a Gaussian centered at the flat minimum whose covariance is inversely proportional to the (uniformly bounded) sharpness; the resulting KL term is therefore bounded by a function of context length and sparsity alone. Consequently the PAC-Bayes bound remains non-vacuous whenever the constructed model achieves zero empirical risk, which it does by design. We will add a clarifying paragraph after Equation (8) that makes the data-independence and uniform KL control explicit. revision: partial

Circularity Check

0 steps flagged

No circularity: existence construction and PAC-Bayes application are independent of the target bound

full rationale

The paper's chain proceeds by first establishing (via construction or proof) the existence of flat minima in transformer parameter space that realize any boolean function whose Fourier support size is at most the context length, then feeding that idealized low-sharpness learner into a standard PAC-Bayes bound to obtain a non-vacuous generalization guarantee. No step reduces the claimed existence or the resulting bound to a fitted parameter, a self-citation chain, or a redefinition of the target quantity; the abstract and skeptic summary both treat the existence result as a separate, load-bearing mathematical claim rather than a tautology or data-dependent fit. The derivation is therefore self-contained against external benchmarks once the existence statement is accepted.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central construction rests on one key unproven existence statement extracted from the abstract; no free parameters or invented entities are described.

axioms (1)
  • ad hoc to paper Existence of flat minima implementing any boolean function of sparsity no greater than the context length
    This existence statement is required to define the idealized low-sharpness learner to which the PAC-Bayes bound is applied.

pith-pipeline@v0.9.0 · 5661 in / 1499 out tokens · 45560 ms · 2026-05-21T05:57:41.515603+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

  1. [1]

    Generalization on the unseen, logic reasoning and degree curriculum

    Abbe, E., Bengio, S., Lotfi, A., and Rizk, K. Generalization on the unseen, logic reasoning and degree curriculum. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.),International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedings of Machine Learning Research,...

  2. [2]

    How far can transformers reason? the globalitybarrierandinductivescratchpad

    Abbé, E., Bengio, S., Lotfi, A., Sandon, C., and Saremi, O. How far can transformers reason? the globalitybarrierandinductivescratchpad. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=FoGwiFXzuN

  3. [3]

    Testing low-degree polynomials over GF(2)

    Alon, N., Kaufman, T., Krivelevich, M., Litsyn, S., and Ron, D. Testing low-degree polynomials over GF(2). InApproximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 188–199. Springer, 2003

  4. [4]

    A user-friendly introduction to pac-bayes bounds

    Alquier, P. A user-friendly introduction to pac-bayes bounds. 2023

  5. [5]

    Exploring length generalization in large language models.Advances in Neural Information Processing Systems, 35:38546–38556, 2022

    Anil, C., Wu, Y., Andreassen, A., Lewkowycz, A., Misra, V., Ramasesh, V., Slone, A., Gur-Ari, G., Dyer, E., and Neyshabur, B. Exploring length generalization in large language models.Advances in Neural Information Processing Systems, 35:38546–38556, 2022

  6. [6]

    On the ability and limitations of transformers to recognize formal languages

    Bhattamishra, S., Ahuja, K., and Goyal, N. On the ability and limitations of transformers to recognize formal languages. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.),Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 7096–7116. Association for Computational Linguisti...

  7. [7]

    Simplicity bias in transformers and their ability tolearnsparsebooleanfunctions

    Bhattamishra, S., Patel, A., Kanade, V., and Blunsom, P. Simplicity bias in transformers and their ability tolearnsparsebooleanfunctions. InRogers, A., Boyd-Graber, J.L., andOkazaki, N.(eds.),Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 5767...

  8. [8]

    A., Charton, F., and Kempe, J

    Cabannes, V., Arnal, C., Bouaziz, W., Yang, X. A., Charton, F., and Kempe, J. Iteration head: A mechanistic study of chain-of-thought. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=QBCxWpOt5w

  9. [9]

    and Cholak, P

    Chiang, D. and Cholak, P. Overcoming a theoretical limitation of self-attention. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.),Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp. 7654–7664. Association for Computational Linguistics, 2022....

  10. [10]

    On the optimization and generalization of multi-head attention.Transactions on Machine Learning Research, 2023

    Deora, P., Ghaderi, R., Taheri, H., and Thrampoulidis, C. On the optimization and generalization of multi-head attention.Transactions on Machine Learning Research, 2023

  11. [11]

    L., Goel, S., Kakade, S., and Zhang, C

    Edelman, B. L., Goel, S., Kakade, S., and Zhang, C. Inductive biases and variable creation in self-attention mechanisms. InInternational Conference on Machine Learning, pp. 5793–5831. PMLR, 2022

  12. [12]

    A., Shpilka, A., and Wimmer, K

    Gopalan, P., O’Donnell, R., Servedio, R. A., Shpilka, A., and Wimmer, K. Testing fourier dimensionality and sparsity.SIAM Journal on Computing, 40(4):1075–1100, 2011. doi: 10.1137/100785429. URL https://doi.org/10.1137/100785429

  13. [13]

    Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 2020

    Hahn, M. Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 2020

  14. [14]

    and Rofin, M

    Hahn, M. and Rofin, M. Why are sensitive functions hard for transformers? In Ku, L.-W., Martins, A., and Srikumar, V. (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14973–15008, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-l...

  15. [15]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima.CoRR, abs/1609.04836, 2016. URL http://arxiv.org/abs/1609.04836

  16. [16]

    and Suzuki, T

    Kim, J. and Suzuki, T. Transformers provably solve parity efficiently with chain of thought. InNeurIPS 2024 Workshop on Mathematics of Modern Machine Learning, 2024. URLhttps://openreview.net/ forum?id=E7HwPhfX1B

  17. [17]

    and Massart, P

    Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection.The Annals of Statistics, 28(5):1302–1338, 2000. doi: 10.1214/aos/1015957395

  18. [18]

    T., Goel, S., Krishnamurthy, A., and Zhang, C

    Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata,

  19. [19]

    URLhttps://arxiv.org/abs/2210.10749

  20. [20]

    and Sabharwal, A

    Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023

  21. [21]

    Exploring Generalization in Deep Learning

    Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. Exploring generalization in deep learning. CoRR, abs/1706.08947, 2017. URLhttp://arxiv.org/abs/1706.08947

  22. [22]

    and Loizou, N

    Oikonomou, D. and Loizou, N. Sharpness-aware minimization: General analysis and improved rates. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview. net/forum?id=8rvqpiTTFv

  23. [23]

    Representational strengths and limitations of transformers

    Sanford, C., Hsu, D., and Telgarsky, M. Representational strengths and limitations of transformers. CoRR, abs/2306.02896, 2023. doi: 10.48550/ARXIV.2306.02896. URLhttps://doi.org/10.48550/ arXiv.2306.02896

  24. [24]

    Transformers, parallel computation, and logarithmic depth

    Sanford, C., Hsu, D., and Telgarsky, M. Transformers, parallel computation, and logarithmic depth. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/ forum?id=QCZabhKQhB

  25. [25]

    Transformers as recognizers of formal languages: A survey on expressivity.CoRR, abs/2311.00208, 2023

    Strobl, L., Merrill, W., Weiss, G., Chiang, D., and Angluin, D. Transformers as recognizers of formal languages: A survey on expressivity.CoRR, abs/2311.00208, 2023. doi: 10.48550/ARXIV.2311.00208. URLhttps://doi.org/10.48550/arXiv.2311.00208

  26. [26]

    and Tewari, A

    Trauger, J. and Tewari, A. Sequence length independent norm-based generalization bounds for trans- formers. InInternational Conference on Artificial Intelligence and Statistics, pp. 1405–1413. PMLR, 2024

  27. [27]

    Thinking like transformers

    Weiss, G., Goldberg, Y., and Yahav, E. Thinking like transformers. InInternational Conference on Machine Learning, pp. 11080–11090. PMLR, 2021

  28. [28]

    half-layer

    Yi, K. and Zhang, Q. Nearly optimal sparse Fourier transform in the general case. InProceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC), pp. 590–600. ACM, 2019. 16 Contents 1 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Experiments . . . . . . . . . . . . . . ...

  29. [29]

    remainder perturbations

    (see also 19 in the supplementary material), with probability1−δ,such an R.V. is upper bounded as ∥ϵt,:∥≤1√ d √ d+ 2 √ dlog( 1 δ) + 2log(1 δ) To upper bound the maximum ofTsuch variables, we apply a union bound: max t∈T ∥ϵt,:∥≤1 d ( d+ 2 √ dlog(T δ) + 2log(T δ) ) Note that per our definition ofd, d>8log(T) ϵ2p .Therefore, with probability at least1−δ, max...

  30. [30]

    intra-layer derivative

    provide a bound intended for losses that are possibly unbounded, but have a tail distribution that is sub-gaussian. Mathematically, we assume that for some positive constantΣ,our loss function satisfies: E[et[l(Θ,x,f)−E[l(Θ,x,f)]]]≤e Σ2t2 2 . Then with probability at least1−δ,the following bound holds: EΘ∼Q[L(fΘ )]≤EΘ∼Q[ˆL(fΘ )] +λΣ2 2m + KL(Q∥P) + ln1 δ ...

  31. [31]

    remainder perturbations

    + (8d + 6)(Df + 1) )) block matrix. To bound the maximum eigenvalue of the Hessian, we want to find max||v||=1⟨v,∇2 ΘT(X,Θ)v⟩,where v=concat(s,p,q,r,η,γ,ν,), =concat(s 1,...,sd+1,p 1,...,pd+1,q 1,...,qd+1,r 1,...rd+1,η1,...,ηd+1,γ1,ν1,...,ν4(Df+1)), wheres i∈R,pi,qi,ri,νi∈Rd+1,ηi,γ1∈R4(Df+1).Written out explicitly, we seek to upper bound ∇2 ΘT(X,Θ)  =...