pith. machine review for the scientific record. sign in

arxiv: 2604.17219 · v1 · submitted 2026-04-19 · 📊 stat.ML · cs.LG

Recognition: unknown

PAC-Bayes Bounds for Gibbs Posteriors via Singular Learning Theory

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:21 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords PAC-Bayes boundsGibbs posteriorssingular learning theorygeneralization boundsoverparameterized modelslow-rank matrix completionReLU neural networks
0
0 comments X

The pith

PAC-Bayes generalization bounds for Gibbs posteriors are derived explicitly using singular learning theory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives explicit non-asymptotic PAC-Bayes generalization bounds for Gibbs posteriors, which are data-dependent distributions obtained by tilting a prior with empirical risk. This is done by analyzing a marginal-type integral over the parameter space with tools from singular learning theory, yielding characterizations of posterior risk that adapt to data structure and intrinsic complexity. Unlike classical bounds that rely on uniform laws and metric entropy, these posterior-averaged bounds apply to overparameterized singular models. Applications to low-rank matrix completion and ReLU neural network tasks demonstrate that the bounds are analytically tractable and tighter than complexity-based alternatives. This approach shows potential for precise finite-sample guarantees in modern machine learning models.

Core claim

The authors derive explicit non-asymptotic PAC-Bayes bounds for Gibbs posteriors by analyzing the marginal integral in the bound using singular learning theory, obtaining explicit and practically meaningful characterizations of the posterior risk that apply to overparameterized and singular models such as those in low-rank matrix completion and ReLU neural network regression and classification, and which are substantially tighter than classical complexity-based bounds.

What carries the argument

The marginal-type integral over the parameter space, analyzed using singular learning theory to characterize the posterior risk.

If this is right

  • The bounds apply to overparameterized models without needing explicit metric entropy control.
  • They adapt to the data structure and intrinsic model complexity.
  • In applications to low-rank matrix completion, the bounds are analytically tractable.
  • In ReLU neural network regression and classification, the bounds are substantially tighter than classical ones.
  • The results highlight the potential for precise finite-sample generalization guarantees in modern singular models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This framework could be extended to derive similar bounds for other types of posteriors or learning algorithms in singular regimes.
  • It opens the door to using singular learning theory more broadly in generalization analysis beyond PAC-Bayes.
  • Practitioners might use these bounds for better model selection in overparameterized settings where traditional bounds are vacuous.

Load-bearing premise

That the tools from singular learning theory can be applied to obtain explicit and practically meaningful characterizations of the posterior risk for the specific singular models considered, such as low-rank matrix completion and ReLU networks.

What would settle it

An explicit computation in one of the applications, such as low-rank matrix completion, where the bound derived via singular learning theory fails to be tighter than a classical complexity-based PAC-Bayes bound, or where the marginal integral cannot be evaluated explicitly.

Figures

Figures reproduced from arXiv: 2604.17219 by Chenyang Wang, Yun Yang.

Figure 1
Figure 1. Figure 1: Recursive blow-up scheme and candidate poles [PITH_FULL_IMAGE:figures/full_fig_p039_1.png] view at source ↗
read the original abstract

We derive explicit non-asymptotic PAC-Bayes generalization bounds for Gibbs posteriors, that is, data-dependent distributions over model parameters obtained by exponentially tilting a prior with the empirical risk. Unlike classical worst-case complexity bounds based on uniform laws of large numbers, which require explicit control of the model space in terms of metric entropy (integrals), our analysis yields posterior-averaged risk bounds that can be applied to overparameterized models and adapt to the data structure and the intrinsic model complexity. The bound involves a marginal-type integral over the parameter space, which we analyze using tools from singular learning theory to obtain explicit and practically meaningful characterizations of the posterior risk. Applications to low-rank matrix completion and ReLU neural network regression and classification show that the resulting bounds are analytically tractable and substantially tighter than classical complexity-based bounds. Our results highlight the potential of PAC-Bayes analysis for precise finite-sample generalization guarantees in modern overparameterized and singular models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper derives explicit non-asymptotic PAC-Bayes generalization bounds for Gibbs posteriors (data-dependent distributions obtained by exponentially tilting a prior with empirical risk). Unlike classical uniform-convergence bounds requiring metric entropy control, the analysis expresses the bound via a marginal integral ∫ exp(−n R̂(θ)) π(dθ) over parameter space and applies singular learning theory (SLT) tools to obtain explicit, data-adaptive characterizations of posterior risk. Applications to low-rank matrix completion and ReLU neural network regression/classification are presented, with claims that the resulting bounds are analytically tractable and substantially tighter than classical complexity-based bounds.

Significance. If the non-asymptotic character is rigorously preserved and the SLT analysis supplies explicit finite-n bounds with controlled remainders, the work would offer a valuable route to precise generalization guarantees for singular, overparameterized models where standard entropy integrals are intractable. The combination of PAC-Bayes with SLT for explicit posterior-averaged risk is a promising direction, and the applications demonstrate potential practical utility beyond worst-case analysis.

major comments (2)
  1. [§3] §3 (main PAC-Bayes theorem): The derivation claims an explicit non-asymptotic bound by analyzing the marginal integral with SLT, yet the leading-term approximation (via learning coefficient λ and multiplicity) is used without a uniform non-asymptotic remainder estimate valid for all finite n. This risks reducing the bound to an asymptotic statement, undermining the central non-asymptotic claim and the comparison to classical bounds for practical sample sizes.
  2. [§5.2] §5.2 (ReLU network applications): The explicit bounds for regression and classification rely on a specific resolution of singularities for the ReLU loss; it is unclear whether the SLT expansion holds uniformly across the parameter space or whether additional error terms from the singularity resolution are controlled in the finite-n PAC-Bayes inequality.
minor comments (2)
  1. [§2] The notation for the temperature parameter in the Gibbs posterior definition should be introduced earlier and used consistently when stating the marginal integral.
  2. [§5] Figure 1 (bound comparison plots) would benefit from error bars or explicit sample-size annotations to clarify the regime where the new bounds are tighter.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. The comments correctly identify points where the non-asymptotic character and the control of approximation errors in the SLT analysis require clearer exposition. We address each major comment below and have revised the manuscript to strengthen the rigor of the finite-n claims without altering the core contributions.

read point-by-point responses
  1. Referee: [§3] §3 (main PAC-Bayes theorem): The derivation claims an explicit non-asymptotic bound by analyzing the marginal integral with SLT, yet the leading-term approximation (via learning coefficient λ and multiplicity) is used without a uniform non-asymptotic remainder estimate valid for all finite n. This risks reducing the bound to an asymptotic statement, undermining the central non-asymptotic claim and the comparison to classical bounds for practical sample sizes.

    Authors: The PAC-Bayes inequality itself is stated and proved exactly for any finite n, expressing the generalization gap in terms of the marginal integral ∫ exp(−n R̂(θ)) π(dθ). Singular learning theory is applied only to obtain an explicit, closed-form characterization of this integral. While the leading terms (involving the learning coefficient λ and multiplicity) are used for tractability, the underlying PAC-Bayes statement remains non-asymptotic. To address the concern about remainders, we have added a new remark in the revised §3 that recalls standard non-asymptotic error bounds from the SLT literature (of order O((log n)/n)) and shows how they propagate into the final PAC-Bayes expression, yielding a fully rigorous finite-n bound with explicit remainder. This preserves the comparison to classical bounds for practical n while keeping the expressions analytically tractable. revision: partial

  2. Referee: [§5.2] §5.2 (ReLU network applications): The explicit bounds for regression and classification rely on a specific resolution of singularities for the ReLU loss; it is unclear whether the SLT expansion holds uniformly across the parameter space or whether additional error terms from the singularity resolution are controlled in the finite-n PAC-Bayes inequality.

    Authors: We agree that uniformity of the singularity resolution must be justified. The manuscript employs a global resolution of singularities for the ReLU loss that covers the entire parameter space via a finite atlas of charts; the marginal integral is then bounded by summing local SLT expansions over this atlas. In the revised §5.2 we have inserted a dedicated paragraph that (i) recalls the covering argument, (ii) states the uniform control on the remainder terms arising from the resolution (again O((log n)/n) uniformly in the charts), and (iii) verifies that these remainders are absorbed into the PAC-Bayes bound without affecting the leading explicit terms. This makes the finite-n guarantee fully rigorous for both the regression and classification settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external SLT tools for the marginal integral.

full rationale

The paper starts from the standard PAC-Bayes inequality for Gibbs posteriors, expresses the bound via the marginal integral ∫ exp(−n R̂(θ)) π(dθ), and invokes singular learning theory (an established external body of work) to obtain explicit characterizations of that integral for the singular models in the applications. No step reduces the claimed non-asymptotic bound to a fitted quantity, a self-citation chain, or a redefinition of the target quantity; the SLT analysis supplies the explicit form without the paper re-deriving or assuming its own result. The derivation chain therefore remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the central claim rests on standard domain assumptions from PAC-Bayes and singular learning theory without new free parameters or invented entities visible.

axioms (1)
  • domain assumption Singular learning theory provides tools to analyze marginal-type integrals over parameter spaces for singular models to obtain explicit posterior risk characterizations.
    Invoked directly to derive the explicit bounds and tractable characterizations for the applications.

pith-pipeline@v0.9.0 · 5454 in / 1304 out tokens · 34112 ms · 2026-05-10T06:21:39.199703+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    arXiv preprint arXiv:2303.15739 , year=

    Bayesian Free Energy of Deep ReLU Neural Network in Overparametrized Cases , author=. arXiv preprint arXiv:2303.15739 , year=

  2. [2]

    The Annals of Statistics , volume =

    Bayesian fractional posteriors , author =. The Annals of Statistics , volume =

  3. [3]

    The Annals of Statistics , volume=

    -variational inference with statistical guarantees , author=. The Annals of Statistics , volume=. 2020 , publisher=

  4. [4]

    Bayesian and likelihood methods in statistics and econometrics , volume=

    The validity of posterior expansions based on Laplace’s method , author=. Bayesian and likelihood methods in statistics and econometrics , volume=

  5. [5]

    The Annals of Statistics , volume =

    Concentration of tempered posteriors and of their variational approximations , author =. The Annals of Statistics , volume =

  6. [6]

    Neural Computation , volume=

    Algebraic analysis for nonidentifiable learning machines , author=. Neural Computation , volume=. 2001 , publisher=

  7. [7]

    The Annals of Statistics , pages=

    From -Entropy to KL-Entropy: Analysis of Minimum Information Complexity Density Estimation , author=. The Annals of Statistics , pages=. 2006 , publisher=

  8. [8]

    Journal of Machine Learning Research , volume=

    A likelihood approach to nonparametric estimation of a singular distribution using deep generative models , author=. Journal of Machine Learning Research , volume=

  9. [9]

    2009 , publisher=

    Algebraic geometry and statistical learning theory , author=. 2009 , publisher=

  10. [10]

    Neural Networks , volume=

    Stochastic complexities of reduced rank regression in Bayesian estimation , author=. Neural Networks , volume=. 2005 , publisher=

  11. [11]

    arXiv preprint arXiv:2501.12747 , year=

    Singular leaning coefficients and efficiency in learning theory , author=. arXiv preprint arXiv:2501.12747 , year=

  12. [12]

    , Cottet, V

    Bayesian matrix completion: prior specification , author=. arXiv preprint arXiv:1406.1440 , year=

  13. [13]

    Bayesian Analysis , volume=

    Bayesian uncertainty quantification for low-rank matrix completion , author=. Bayesian Analysis , volume=. 2023 , publisher=

  14. [14]

    Information and Inference: A Journal of the IMA , volume=

    Size-independent sample complexity of neural networks , author=. Information and Inference: A Journal of the IMA , volume=. 2020 , publisher=

  15. [15]

    User-friendly introduction to

    User-friendly introduction to PAC-Bayes bounds , author=. arXiv preprint arXiv:2110.11216 , year=

  16. [16]

    IEEE Transactions on Information Theory , volume=

    Information-theoretic upper and lower bounds for statistical estimation , author=. IEEE Transactions on Information Theory , volume=. 2006 , publisher=

  17. [17]

    Bernoulli , volume=

    Gibbs posterior concentration rates under sub-exponential type losses , author=. Bernoulli , volume=. 2023 , publisher=

  18. [18]

    Annals of Statistics , volume=

    Posterior convergence rates of Dirichlet mixtures at smooth densities , author=. Annals of Statistics , volume=. 2007 , publisher=

  19. [19]

    Advances in Neural Information Processing Systems , volume=

    PAC-Bayes compression bounds so tight that they can explain generalization , author=. Advances in Neural Information Processing Systems , volume=

  20. [20]

    Communications of the ACM , volume=

    Understanding deep learning (still) requires rethinking generalization , author=. Communications of the ACM , volume=. 2021 , publisher=

  21. [21]

    Advances in Neural Information Processing Systems , volume=

    Uniform convergence may be unable to explain generalization in deep learning , author=. Advances in Neural Information Processing Systems , volume=

  22. [22]

    International Conference on Learning Representations , year=

    A Bayesian Perspective on Generalization and Stochastic Gradient Descent , author=. International Conference on Learning Representations , year=

  23. [23]

    Advances in neural information processing systems , volume=

    Exploring generalization in deep learning , author=. Advances in neural information processing systems , volume=

  24. [24]

    International Conference on Learning Representations , year=

    Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach , author=. International Conference on Learning Representations , year=

  25. [25]

    International Conference on Machine Learning , pages=

    Non-Vacuous Generalization Bounds for Large Language Models , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  26. [26]

    Advances in neural information processing systems , volume=

    Bayesian deep learning and a probabilistic perspective of generalization , author=. Advances in neural information processing systems , volume=

  27. [27]

    Journal of Statistical Planning and Inference , volume=

    Gibbs posterior inference on multivariate quantiles , author=. Journal of Statistical Planning and Inference , volume=. 2022 , publisher=

  28. [28]

    Advances in Neural Information Processing Systems , volume=

    PAC-Bayes analysis beyond the usual bounds , author=. Advances in Neural Information Processing Systems , volume=

  29. [29]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Asymptotic behaviour of the posterior distribution in overfitted mixture models , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2011 , publisher=

  30. [30]

    Estimating the Number of Components in Finite Mixture Models via Variational Approximation

    Estimating the Number of Components in Finite Mixture Models via Variational Approximation , author=. arXiv preprint arXiv:2404.16746 , year=

  31. [31]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    A general framework for updating belief distributions , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2016 , publisher=

  32. [32]

    International Conference on Algorithmic Learning Theory , pages=

    The safe Bayesian: learning the learning rate via the mixability gap , author=. International Conference on Algorithmic Learning Theory , pages=. 2012 , organization=

  33. [33]

    arXiv preprint arXiv:0712.0248 , year=

    PAC-Bayesian supervised classification: the thermodynamics of statistical learning , author=. arXiv preprint arXiv:0712.0248 , year=

  34. [34]

    Journal of Machine Learning Research , volume=

    On the properties of variational approximations of Gibbs posteriors , author=. Journal of Machine Learning Research , volume=

  35. [35]

    International Conference on Algorithmic Learning Theory , pages=

    Bayesian methods for low-rank matrix estimation: short survey and theoretical study , author=. International Conference on Algorithmic Learning Theory , pages=. 2013 , organization=

  36. [36]

    Proceedings of the twelfth annual conference on Computational learning theory , pages=

    PAC-Bayesian model averaging , author=. Proceedings of the twelfth annual conference on Computational learning theory , pages=

  37. [37]

    Advances in Neural Information Processing Systems , volume=

    PAC-Bayesian theory meets Bayesian inference , author=. Advances in Neural Information Processing Systems , volume=

  38. [38]

    Journal of machine learning research , volume=

    PAC-Bayesian generalisation error bounds for Gaussian process classification , author=. Journal of machine learning research , volume=

  39. [39]

    Proceedings of the IEEE , volume=

    Matrix completion with noise , author=. Proceedings of the IEEE , volume=. 2010 , publisher=

  40. [40]

    Mathematical Programming Computation , volume=

    Parallel stochastic gradient algorithms for large-scale matrix completion , author=. Mathematical Programming Computation , volume=. 2013 , publisher=

  41. [41]

    Conference On Learning Theory , pages=

    Size-independent sample complexity of neural networks , author=. Conference On Learning Theory , pages=. 2018 , organization=

  42. [42]

    International conference on machine learning , pages=

    Rademacher complexity for adversarially robust generalization , author=. International conference on machine learning , pages=. 2019 , organization=

  43. [43]

    Scandinavian Actuarial Journal , volume=

    Model misspecification, Bayesian versus credibility estimation, and Gibbs posteriors , author=. Scandinavian Actuarial Journal , volume=. 2020 , publisher=

  44. [44]

    The Annals of Statistics , pages=

    Estimating the dimension of a model , author=. The Annals of Statistics , pages=

  45. [45]

    Communications in Statistics—Theory and Methods , volume=

    A Bayesian learning coefficient of generalization error and Vandermonde matrix-type singularities , author=. Communications in Statistics—Theory and Methods , volume=. 2010 , publisher=

  46. [46]

    IEICE Trans , volume=

    Resolution of singularities and the generalization error with Bayesian estimation for layered neural network , author=. IEICE Trans , volume=

  47. [47]

    Neural networks , volume=

    Singularities in mixture models and upper bounds of stochastic complexity , author=. Neural networks , volume=. 2003 , publisher=

  48. [48]

    Journal of Machine Learning Research , volume=

    Classification with deep neural networks and logistic loss , author=. Journal of Machine Learning Research , volume=

  49. [49]

    Journal of Machine Learning Research , volume=

    When does gradient descent with logistic loss find interpolating two-layer networks? , author=. Journal of Machine Learning Research , volume=

  50. [50]

    International Conference on Learning Representations , year =

    Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , author=. International Conference on Learning Representations , year =

  51. [51]

    Conference on learning theory , pages=

    Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss , author=. Conference on learning theory , pages=. 2020 , organization=

  52. [52]

    Advances in neural information processing systems , volume=

    Empirical risk minimization under fairness constraints , author=. Advances in neural information processing systems , volume=

  53. [53]

    mixup: Beyond Empirical Risk Minimization

    mixup: Beyond empirical risk minimization , author=. arXiv preprint arXiv:1710.09412 , year=

  54. [54]

    IEEE transactions on neural networks , volume=

    An overview of statistical learning theory , author=. IEEE transactions on neural networks , volume=. 1999 , publisher=

  55. [55]

    2014 , publisher=

    Understanding machine learning: From theory to algorithms , author=. 2014 , publisher=

  56. [56]

    arXiv preprint arXiv:2208.04284 , year=

    On rademacher complexity-based generalization bounds for deep learning , author=. arXiv preprint arXiv:2208.04284 , year=

  57. [57]

    The Annals of Statistics , pages=

    Local Rademacher Complexities and Oracle Inequalities in Risk Minimization , author=. The Annals of Statistics , pages=. 2006 , publisher=

  58. [58]

    2019 , publisher=

    High-dimensional statistics: A non-asymptotic viewpoint , author=. 2019 , publisher=

  59. [59]

    Conference on Learning Theory , pages=

    Learning with square loss: Localization through offset rademacher complexity , author=. Conference on Learning Theory , pages=. 2015 , organization=

  60. [60]

    Complex powers and asymptotic expansions. II. , author=. Journal f

  61. [61]

    Proceedings of the London Mathematical Society , volume=

    Zeta functions for curves and log canonical models , author=. Proceedings of the London Mathematical Society , volume=. 1997 , publisher=

  62. [62]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    A Bayesian information criterion for singular models , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2017 , publisher=

  63. [63]

    Journal of the American Statistical Association , volume=

    Convexity, classification, and risk bounds , author=. Journal of the American Statistical Association , volume=. 2006 , publisher=

  64. [64]

    Journal of Machine Learning Research , volume=

    Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks , author=. Journal of Machine Learning Research , volume=

  65. [65]

    The Annals of Statistics , volume=

    Statistical behavior and consistency of classification methods based on convex risk minimization , author=. The Annals of Statistics , volume=. 2004 , publisher=

  66. [66]

    The Annals of Statistics , volume=

    Optimal aggregation of classifiers in statistical learning , author=. The Annals of Statistics , volume=. 2004 , publisher=

  67. [67]

    Algorithmic learning theory , pages=

    A tight excess risk bound via a unified PAC-Bayesian--Rademacher--Shtarkov--MDL complexity , author=. Algorithmic learning theory , pages=. 2019 , organization=

  68. [68]

    Conference on Learning Theory , pages=

    Distribution-dependent analysis of Gibbs-ERM principle , author=. Conference on Learning Theory , pages=. 2019 , organization=

  69. [69]

    arXiv preprint arXiv:2502.11071 , year=

    Generalization of the Gibbs algorithm with high probability at low temperatures , author=. arXiv preprint arXiv:2502.11071 , year=

  70. [70]

    International conference on machine learning , pages=

    Generalization bounds using data-dependent fractal dimensions , author=. International conference on machine learning , pages=. 2023 , organization=

  71. [71]

    IEEE Transactions on Information Theory , volume=

    Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements , author=. IEEE Transactions on Information Theory , volume=. 2011 , publisher=