From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD
Pith reviewed 2026-06-29 23:05 UTC · model grok-4.3
The pith
DP-SGD's approximate max-information is at most linear in dataset size, yielding PAC-Bayes bounds with learnable priors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove a finite-sample bound on the approximate max-information of DP-SGD that is at most linear in the dataset size, from which we obtain a general-purpose PAC-Bayes generalization bound in which the prior can be learned by DP-SGD, as well as a generalization bound for DP-SGD-trained models with a complexity term that is fully explicit and controlled by the optimization hyperparameters.
What carries the argument
Approximate max-information of the DP-SGD output distribution, bounded linearly in sample size through analysis of its standard Gaussian noise addition mechanism.
Load-bearing premise
The finite-sample max-information bound for DP-SGD holds under the standard noise-addition mechanism of DP-SGD and the definition of approximate max-information, without requiring further restrictions on data distribution or model class beyond those implicit in the DP-SGD setup.
What would settle it
A calculation or dataset where the approximate max-information of standard DP-SGD exceeds linear scaling in the number of samples would disprove the claimed bound.
Figures
read the original abstract
Understanding the relationship between generalization and privacy remains a central challenge in modern machine learning theory, particularly for deep networks trained by variants of differentially private stochastic gradient descent (DP-SGD). In this work we make progress on this persistent open problem by proving a finite-sample bound on the approximate max-information of DP-SGD that exhibits scaling properties comparable with (Dwork et al, 2015)'s classic result for $\epsilon$-differentially private algorithms, namely at most linear in the dataset size. From our result we obtain a general-purpose PAC-Bayes generalization bound in which the necessary prior distribution can be learned by DP-SGD, as well as a generalization bound for DP-SGD-trained models themselves, with a complexity term that is fully explicit and controlled by the optimization hyperparameters.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to prove a finite-sample bound on the approximate max-information of DP-SGD (via the standard Gaussian noise + clipping + composition mechanism) that is at most linear in the dataset size n. From this it derives a general-purpose PAC-Bayes generalization bound in which the prior may be learned by DP-SGD itself, together with an explicit generalization bound for DP-SGD-trained models whose complexity term is controlled by the optimization hyperparameters.
Significance. If the stated max-information bound holds, the work supplies a direct, finite-sample bridge from DP-SGD properties to generalization via approximate max-information, extending the linear scaling of Dwork et al. (2015) to the DP-SGD setting. This yields PAC-Bayes corollaries with privately learned priors and fully explicit, hyperparameter-controlled complexity terms, which would be a useful theoretical tool for analyzing the privacy-generalization tradeoff in private deep learning.
minor comments (2)
- [Abstract] Abstract: the phrase 'comparable with (Dwork et al, 2015)' should be expanded to a full citation on first use and the reference list entry should be checked for completeness.
- The definition of approximate max-information and the precise statement of the linear bound (including dependence on privacy parameters ε, δ and noise scale) should appear in the main text before the PAC-Bayes corollaries are derived, to make the logical flow self-contained.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our manuscript and the recommendation of minor revision. The referee's summary correctly captures the main claims regarding the linear max-information bound for DP-SGD and the resulting PAC-Bayes corollaries.
Circularity Check
No significant circularity; derivation self-contained from DP definitions
full rationale
The central result is a finite-sample bound on approximate max-information for the standard DP-SGD mechanism (Gaussian noise, clipping, composition), derived directly from the definition of approximate max-information and DP properties. This bound is then used to obtain PAC-Bayes corollaries with explicit complexity terms controlled by hyperparameters. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the cited Dwork et al. (2015) result is external and the derivation does not rename known empirical patterns or smuggle ansatzes. The paper is self-contained against external DP benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption DP-SGD satisfies the standard (ε,δ)-differential privacy definition via its noise mechanism
- domain assumption Approximate max-information admits a finite-sample linear bound under DP
Reference graph
Works this paper leans on
-
[1]
Generalization in Adaptive Data Analysis and Holdout Reuse
Dwork, C., Feldman, V ., Hardt, M., Pitassi, T., Reingold, O., and Roth, A. Generalization in adaptive data analysis and holdout reuse.arXiv:1506.02629, 2015a. Dwork, C., Feldman, V ., Hardt, M., Pitassi, T., Reingold, O., and Roth, A. Generalization in adaptive data analysis and holdout reuse. InConference on Neural Information Processing Systems (NeurIP...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Feldman, V . and Shenfeld, M. Privacy amplification by random allocation.arXiv:2502.08202,
-
[3]
García-Pérez, D., Parrado-Hernández, E., and Shawe-Taylor, J. Some theoretical improvements on the tightness of PAC-Bayes risk certificates for neural networks.arXiv preprint arXiv:2510.07935,
-
[4]
Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime
Maurer, A., Mirzaei, E., and Pontil, M. Generalization of Gibbs and Langevin Monte Carlo algorithms in the interpolation regime.arXiv:2510.06028,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
A., Dvijotham, K., Ganesh, A., Henzinger, M., Katz, J., McKenna, R., McMahan, H
11 Pillutla, K., Upadhyay, J., Choquette-Choo, C. A., Dvijotham, K., Ganesh, A., Henzinger, M., Katz, J., McKenna, R., McMahan, H. B., Rush, K., Steinke, T., and Thakurta, A. Correlated noise mechanisms for differentially private learning.arXiv:2506.08201,
-
[6]
Rivasplata, O., Tankasali, V . M., and Szepesvári, C. PAC-Bayes with backprop.arXiv:1908.07380,
-
[7]
Max-Information, Differential Privacy, and Post-Selection Hypothesis Testing
Rogers, R., Roth, A., Smith, A., and Thakkar, O. Max-information, differential privacy, and post-selection hypothesis testing.arXiv: 1604.03924,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Sinha, A., Mesnard, T., McKenna, R., Liu, D., Choquette-Choo, C. A., Huang, Y ., Yu, D., Kaissis, G., Charles, Z., Liu, R., Chua, L., Kamath, P., Manurangsi, P., He, S., Zhang, C., Ghazi, B., Pigem, B. D. B., Eruvbetine, P., Warkentin, T., Joulin, A., and Kumar, R. VaultGemma: A differentially private gemma model. arXiv:2510.15001,
-
[9]
Most commonly, in supervised classification, X would contain input-output pairs, Y, could be the model parameters, e.g
A.1 Learning Setup The learning setup in which we work (input space X , model space Y, loss function ℓ:X × Y →R +) provides a unified notation for a number of standard learning scenarios. Most commonly, in supervised classification, X would contain input-output pairs, Y, could be the model parameters, e.g. of a neural network, and ℓ measures if the model ...
2011
-
[10]
•θ t =θ t−1 − ηt√ˆvt+ϵ ˆmt, for •ˆmt = mt 1−βt 1 withm t =β 1mt−1 + (1−β 1)ut, and •ˆvt = vt 1−βt 2 withv t =β 2vt−1 + (1−β 2)u2 t
additionally perform a component- wise normalization of the updates, e.g. •θ t =θ t−1 − ηt√ˆvt+ϵ ˆmt, for •ˆmt = mt 1−βt 1 withm t =β 1mt−1 + (1−β 1)ut, and •ˆvt = vt 1−βt 2 withv t =β 2vt−1 + (1−β 2)u2 t . One can see that in all of them, the model parameters are a deterministic linear (DP-SGD) or non-linear (DP-Adam) combination of the update vectors, s...
2006
-
[11]
Then it holds for anyt∈ {1,
Lemma 9.In the setting above, letν=m ζ2 σ2 . Then it holds for anyt∈ {1, . . . , T}: E ˜S h ∥ ˜Gt −µ t∥2 Y t−1 0 i ≤ν,E ˜S h ∥ ˜Gt −µ t∥ Y t−1 0 i ≤ √ν,(49) and E h ∥Gt −µ t∥2 Y t−1 0 i ≤ν,E h ∥Gt −µ t∥ Y t−1 0 i ≤ √ν.(50) Note that compared to Lemma 3, the constant is ν instead of 2ν here, because for DP-SGD we can exploit that Ψ is a sum of independent ...
2018
-
[12]
As a consequence of the monotonicity, we obtain upper and lower bounds
≤ log 1.25√log 2 + log 1.25−3− p log 2−log 2≈ −4.03<0(91) This confirms that R′ is strictly decreasing as a function of b, and therefore R is strictly increasing as a function of β. As a consequence of the monotonicity, we obtain upper and lower bounds. For any α∈(0,3]andβ∈(0, 1 2], that is R(α, β)≤R(α= 3, β= 1 2)≈1.95<2,(92) R(α, β)≥lim β′→0 R(α= 0, β ′)...
2025
-
[13]
All reported guarantees are based only on the training data
data with 60,000 training sam- ples, and CIFAR-10 (Krizhevsky & Hinton, 2009)50,000 training samples. All reported guarantees are based only on the training data. Models.For MNIST, we use a fully-connected MLP with three 600-unit hidden layers and a 10-class output layer. For CIFAR-10, we use a nine-layer ConvNet with six convolutional layers followed by ...
2009
-
[14]
For all experiments, µρ and Σρ are trained by optimizing the corresponding bounds
The posterior is also a Gaussian distribution ρ=N(µ ρ,Σ ρ),(96) with diagonal Σρ. For all experiments, µρ and Σρ are trained by optimizing the corresponding bounds. Training the DP-SGD prior.For training the prior, we train a deterministic model based on the DP-SGD algorithm (Algorithm 1). For CIFAR-10, we use batch size m= 5000 , clipping norm from ζ∈ {0...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.