From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD

Christoph H. Lampert; Hossein Zakerinia

arxiv: 2605.26222 · v1 · pith:BFKO6DMDnew · submitted 2026-05-25 · 💻 cs.LG · stat.ML

From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD

Christoph H. Lampert , Hossein Zakerinia This is my paper

Pith reviewed 2026-06-29 23:05 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords differential privacyDP-SGDmax-informationPAC-Bayesgeneralization boundsprivacy

0 comments

The pith

DP-SGD's approximate max-information is at most linear in dataset size, yielding PAC-Bayes bounds with learnable priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves a finite-sample upper bound on the approximate max-information of DP-SGD that grows at most linearly with the size of the training set. This bound mirrors the scaling of classic differential privacy results and directly implies a PAC-Bayes generalization bound in which the required prior distribution over hypotheses can be produced by applying DP-SGD itself to the data. A second consequence is a generalization bound for models trained by DP-SGD whose complexity penalty is expressed explicitly in terms of the algorithm's noise scale, learning rate, and other optimization choices. The result addresses the link between privacy mechanisms and generalization without imposing extra assumptions on the data distribution or the model architecture.

Core claim

We prove a finite-sample bound on the approximate max-information of DP-SGD that is at most linear in the dataset size, from which we obtain a general-purpose PAC-Bayes generalization bound in which the prior can be learned by DP-SGD, as well as a generalization bound for DP-SGD-trained models with a complexity term that is fully explicit and controlled by the optimization hyperparameters.

What carries the argument

Approximate max-information of the DP-SGD output distribution, bounded linearly in sample size through analysis of its standard Gaussian noise addition mechanism.

Load-bearing premise

The finite-sample max-information bound for DP-SGD holds under the standard noise-addition mechanism of DP-SGD and the definition of approximate max-information, without requiring further restrictions on data distribution or model class beyond those implicit in the DP-SGD setup.

What would settle it

A calculation or dataset where the approximate max-information of standard DP-SGD exceeds linear scaling in the number of samples would disprove the claimed bound.

Figures

Figures reproduced from arXiv: 2605.26222 by Christoph H. Lampert, Hossein Zakerinia.

**Figure 1.** Figure 1: The coefficient constants for identical ϵ are comparable between Dwork et al. (2015b)’s expressions for ϵ-DP and our results for the Gaussian mechanism. Note that both results apply in different situations, as the Gaussian mechanism is not ϵ-DP for any ϵ. which holds because the effect of α cancels out in the first term. Then, we determine a value q that balances the exponents, i.e. we have to solve the sy… view at source ↗

**Figure 2.** Figure 2: Numeric values of complexity terms in Theorem 2 and Corollary 3 for different hyperpa [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Understanding the relationship between generalization and privacy remains a central challenge in modern machine learning theory, particularly for deep networks trained by variants of differentially private stochastic gradient descent (DP-SGD). In this work we make progress on this persistent open problem by proving a finite-sample bound on the approximate max-information of DP-SGD that exhibits scaling properties comparable with (Dwork et al, 2015)'s classic result for $\epsilon$-differentially private algorithms, namely at most linear in the dataset size. From our result we obtain a general-purpose PAC-Bayes generalization bound in which the necessary prior distribution can be learned by DP-SGD, as well as a generalization bound for DP-SGD-trained models themselves, with a complexity term that is fully explicit and controlled by the optimization hyperparameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a linear-in-n max-information bound for standard DP-SGD that directly yields PAC-Bayes bounds with a DP-learned prior and explicit hyperparameter-controlled complexity.

read the letter

The central result is a finite-sample bound showing that approximate max-information for DP-SGD stays at most linear in dataset size, matching the scaling from Dwork et al 2015 but now for the clipped noisy SGD case. From there they derive a general PAC-Bayes bound that lets the prior itself be fit by DP-SGD, plus a direct generalization bound for the trained model whose complexity term depends only on the noise scale, clipping threshold, and step count.

The new part is the explicit tailoring to DP-SGD's mechanism and the resulting forms that keep the prior learnable under privacy. If the steps go through cleanly, this removes a long-standing barrier between privacy accounting and data-dependent generalization bounds. The abstract is precise about what is claimed and what assumptions are not needed.

The derivation is the part that needs checking. The bound has to survive the composition over many SGD steps, the clipping operation, and the finite-sample nature of max-information without extra looseness that would make the final PAC-Bayes terms useless. The paper states it works under the usual Gaussian mechanism and DP-SGD setup, and the stress-test note finds no hidden distributional assumptions, which is consistent with the abstract. Still, the actual constants and how the approximation parameter in max-information is controlled will determine whether the bounds are informative in practice.

This is for readers who already work on private optimization or PAC-Bayes with learned priors. Someone trying to turn a DP-SGD run into a non-vacuous generalization guarantee will get the most out of it. The work is coherent on its own terms and engages the right prior results, so it deserves a serious referee to verify the proof steps and check the finite-sample handling.

Referee Report

0 major / 2 minor

Summary. The paper claims to prove a finite-sample bound on the approximate max-information of DP-SGD (via the standard Gaussian noise + clipping + composition mechanism) that is at most linear in the dataset size n. From this it derives a general-purpose PAC-Bayes generalization bound in which the prior may be learned by DP-SGD itself, together with an explicit generalization bound for DP-SGD-trained models whose complexity term is controlled by the optimization hyperparameters.

Significance. If the stated max-information bound holds, the work supplies a direct, finite-sample bridge from DP-SGD properties to generalization via approximate max-information, extending the linear scaling of Dwork et al. (2015) to the DP-SGD setting. This yields PAC-Bayes corollaries with privately learned priors and fully explicit, hyperparameter-controlled complexity terms, which would be a useful theoretical tool for analyzing the privacy-generalization tradeoff in private deep learning.

minor comments (2)

[Abstract] Abstract: the phrase 'comparable with (Dwork et al, 2015)' should be expanded to a full citation on first use and the reference list entry should be checked for completeness.
The definition of approximate max-information and the precise statement of the linear bound (including dependence on privacy parameters ε, δ and noise scale) should appear in the main text before the PAC-Bayes corollaries are derived, to make the logical flow self-contained.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript and the recommendation of minor revision. The referee's summary correctly captures the main claims regarding the linear max-information bound for DP-SGD and the resulting PAC-Bayes corollaries.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained from DP definitions

full rationale

The central result is a finite-sample bound on approximate max-information for the standard DP-SGD mechanism (Gaussian noise, clipping, composition), derived directly from the definition of approximate max-information and DP properties. This bound is then used to obtain PAC-Bayes corollaries with explicit complexity terms controlled by hyperparameters. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the cited Dwork et al. (2015) result is external and the derivation does not rename known empirical patterns or smuggle ansatzes. The paper is self-contained against external DP benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The result rests on standard differential privacy definitions and the finite-sample max-information framework; no free parameters, new entities, or ad-hoc axioms are indicated in the abstract.

axioms (2)

domain assumption DP-SGD satisfies the standard (ε,δ)-differential privacy definition via its noise mechanism
Invoked to obtain the max-information scaling.
domain assumption Approximate max-information admits a finite-sample linear bound under DP
Central premise enabling the generalization results.

pith-pipeline@v0.9.1-grok · 5662 in / 1472 out tokens · 41564 ms · 2026-06-29T23:05:35.503338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 8 canonical work pages · 3 internal anchors

[1]

Generalization in Adaptive Data Analysis and Holdout Reuse

Dwork, C., Feldman, V ., Hardt, M., Pitassi, T., Reingold, O., and Roth, A. Generalization in adaptive data analysis and holdout reuse.arXiv:1506.02629, 2015a. Dwork, C., Feldman, V ., Hardt, M., Pitassi, T., Reingold, O., and Roth, A. Generalization in adaptive data analysis and holdout reuse. InConference on Neural Information Processing Systems (NeurIP...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

and Shenfeld, M

Feldman, V . and Shenfeld, M. Privacy amplification by random allocation.arXiv:2502.08202,

work page arXiv
[3]

Some theoretical improvements on the tightness of PAC-Bayes risk certificates for neural networks.arXiv preprint arXiv:2510.07935,

García-Pérez, D., Parrado-Hernández, E., and Shawe-Taylor, J. Some theoretical improvements on the tightness of PAC-Bayes risk certificates for neural networks.arXiv preprint arXiv:2510.07935,

work page arXiv
[4]

Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime

Maurer, A., Mirzaei, E., and Pontil, M. Generalization of Gibbs and Langevin Monte Carlo algorithms in the interpolation regime.arXiv:2510.06028,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

A., Dvijotham, K., Ganesh, A., Henzinger, M., Katz, J., McKenna, R., McMahan, H

11 Pillutla, K., Upadhyay, J., Choquette-Choo, C. A., Dvijotham, K., Ganesh, A., Henzinger, M., Katz, J., McKenna, R., McMahan, H. B., Rush, K., Steinke, T., and Thakurta, A. Correlated noise mechanisms for differentially private learning.arXiv:2506.08201,

work page arXiv
[6]

M., and Szepesvári, C

Rivasplata, O., Tankasali, V . M., and Szepesvári, C. PAC-Bayes with backprop.arXiv:1908.07380,

work page arXiv 1908
[7]

Max-Information, Differential Privacy, and Post-Selection Hypothesis Testing

Rogers, R., Roth, A., Smith, A., and Thakkar, O. Max-information, differential privacy, and post-selection hypothesis testing.arXiv: 1604.03924,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

A., Huang, Y ., Yu, D., Kaissis, G., Charles, Z., Liu, R., Chua, L., Kamath, P., Manurangsi, P., He, S., Zhang, C., Ghazi, B., Pigem, B

Sinha, A., Mesnard, T., McKenna, R., Liu, D., Choquette-Choo, C. A., Huang, Y ., Yu, D., Kaissis, G., Charles, Z., Liu, R., Chua, L., Kamath, P., Manurangsi, P., He, S., Zhang, C., Ghazi, B., Pigem, B. D. B., Eruvbetine, P., Warkentin, T., Joulin, A., and Kumar, R. VaultGemma: A differentially private gemma model. arXiv:2510.15001,

work page arXiv
[9]

Most commonly, in supervised classification, X would contain input-output pairs, Y, could be the model parameters, e.g

A.1 Learning Setup The learning setup in which we work (input space X , model space Y, loss function ℓ:X × Y →R +) provides a unified notation for a number of standard learning scenarios. Most commonly, in supervised classification, X would contain input-output pairs, Y, could be the model parameters, e.g. of a neural network, and ℓ measures if the model ...

2011
[10]

•θ t =θ t−1 − ηt√ˆvt+ϵ ˆmt, for •ˆmt = mt 1−βt 1 withm t =β 1mt−1 + (1−β 1)ut, and •ˆvt = vt 1−βt 2 withv t =β 2vt−1 + (1−β 2)u2 t

additionally perform a component- wise normalization of the updates, e.g. •θ t =θ t−1 − ηt√ˆvt+ϵ ˆmt, for •ˆmt = mt 1−βt 1 withm t =β 1mt−1 + (1−β 1)ut, and •ˆvt = vt 1−βt 2 withv t =β 2vt−1 + (1−β 2)u2 t . One can see that in all of them, the model parameters are a deterministic linear (DP-SGD) or non-linear (DP-Adam) combination of the update vectors, s...

2006
[11]

Then it holds for anyt∈ {1,

Lemma 9.In the setting above, letν=m ζ2 σ2 . Then it holds for anyt∈ {1, . . . , T}: E ˜S h ∥ ˜Gt −µ t∥2 Y t−1 0 i ≤ν,E ˜S h ∥ ˜Gt −µ t∥ Y t−1 0 i ≤ √ν,(49) and E h ∥Gt −µ t∥2 Y t−1 0 i ≤ν,E h ∥Gt −µ t∥ Y t−1 0 i ≤ √ν.(50) Note that compared to Lemma 3, the constant is ν instead of 2ν here, because for DP-SGD we can exploit that Ψ is a sum of independent ...

2018
[12]

As a consequence of the monotonicity, we obtain upper and lower bounds

≤ log 1.25√log 2 + log 1.25−3− p log 2−log 2≈ −4.03<0(91) This confirms that R′ is strictly decreasing as a function of b, and therefore R is strictly increasing as a function of β. As a consequence of the monotonicity, we obtain upper and lower bounds. For any α∈(0,3]andβ∈(0, 1 2], that is R(α, β)≤R(α= 3, β= 1 2)≈1.95<2,(92) R(α, β)≥lim β′→0 R(α= 0, β ′)...

2025
[13]

All reported guarantees are based only on the training data

data with 60,000 training sam- ples, and CIFAR-10 (Krizhevsky & Hinton, 2009)50,000 training samples. All reported guarantees are based only on the training data. Models.For MNIST, we use a fully-connected MLP with three 600-unit hidden layers and a 10-class output layer. For CIFAR-10, we use a nine-layer ConvNet with six convolutional layers followed by ...

2009
[14]

For all experiments, µρ and Σρ are trained by optimizing the corresponding bounds

The posterior is also a Gaussian distribution ρ=N(µ ρ,Σ ρ),(96) with diagonal Σρ. For all experiments, µρ and Σρ are trained by optimizing the corresponding bounds. Training the DP-SGD prior.For training the prior, we train a deterministic model based on the DP-SGD algorithm (Algorithm 1). For CIFAR-10, we use batch size m= 5000 , clipping norm from ζ∈ {0...

2025

[1] [1]

Generalization in Adaptive Data Analysis and Holdout Reuse

Dwork, C., Feldman, V ., Hardt, M., Pitassi, T., Reingold, O., and Roth, A. Generalization in adaptive data analysis and holdout reuse.arXiv:1506.02629, 2015a. Dwork, C., Feldman, V ., Hardt, M., Pitassi, T., Reingold, O., and Roth, A. Generalization in adaptive data analysis and holdout reuse. InConference on Neural Information Processing Systems (NeurIP...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

and Shenfeld, M

Feldman, V . and Shenfeld, M. Privacy amplification by random allocation.arXiv:2502.08202,

work page arXiv

[3] [3]

Some theoretical improvements on the tightness of PAC-Bayes risk certificates for neural networks.arXiv preprint arXiv:2510.07935,

García-Pérez, D., Parrado-Hernández, E., and Shawe-Taylor, J. Some theoretical improvements on the tightness of PAC-Bayes risk certificates for neural networks.arXiv preprint arXiv:2510.07935,

work page arXiv

[4] [4]

Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime

Maurer, A., Mirzaei, E., and Pontil, M. Generalization of Gibbs and Langevin Monte Carlo algorithms in the interpolation regime.arXiv:2510.06028,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

A., Dvijotham, K., Ganesh, A., Henzinger, M., Katz, J., McKenna, R., McMahan, H

11 Pillutla, K., Upadhyay, J., Choquette-Choo, C. A., Dvijotham, K., Ganesh, A., Henzinger, M., Katz, J., McKenna, R., McMahan, H. B., Rush, K., Steinke, T., and Thakurta, A. Correlated noise mechanisms for differentially private learning.arXiv:2506.08201,

work page arXiv

[6] [6]

M., and Szepesvári, C

Rivasplata, O., Tankasali, V . M., and Szepesvári, C. PAC-Bayes with backprop.arXiv:1908.07380,

work page arXiv 1908

[7] [7]

Max-Information, Differential Privacy, and Post-Selection Hypothesis Testing

Rogers, R., Roth, A., Smith, A., and Thakkar, O. Max-information, differential privacy, and post-selection hypothesis testing.arXiv: 1604.03924,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

A., Huang, Y ., Yu, D., Kaissis, G., Charles, Z., Liu, R., Chua, L., Kamath, P., Manurangsi, P., He, S., Zhang, C., Ghazi, B., Pigem, B

Sinha, A., Mesnard, T., McKenna, R., Liu, D., Choquette-Choo, C. A., Huang, Y ., Yu, D., Kaissis, G., Charles, Z., Liu, R., Chua, L., Kamath, P., Manurangsi, P., He, S., Zhang, C., Ghazi, B., Pigem, B. D. B., Eruvbetine, P., Warkentin, T., Joulin, A., and Kumar, R. VaultGemma: A differentially private gemma model. arXiv:2510.15001,

work page arXiv

[9] [9]

Most commonly, in supervised classification, X would contain input-output pairs, Y, could be the model parameters, e.g

A.1 Learning Setup The learning setup in which we work (input space X , model space Y, loss function ℓ:X × Y →R +) provides a unified notation for a number of standard learning scenarios. Most commonly, in supervised classification, X would contain input-output pairs, Y, could be the model parameters, e.g. of a neural network, and ℓ measures if the model ...

2011

[10] [10]

•θ t =θ t−1 − ηt√ˆvt+ϵ ˆmt, for •ˆmt = mt 1−βt 1 withm t =β 1mt−1 + (1−β 1)ut, and •ˆvt = vt 1−βt 2 withv t =β 2vt−1 + (1−β 2)u2 t

additionally perform a component- wise normalization of the updates, e.g. •θ t =θ t−1 − ηt√ˆvt+ϵ ˆmt, for •ˆmt = mt 1−βt 1 withm t =β 1mt−1 + (1−β 1)ut, and •ˆvt = vt 1−βt 2 withv t =β 2vt−1 + (1−β 2)u2 t . One can see that in all of them, the model parameters are a deterministic linear (DP-SGD) or non-linear (DP-Adam) combination of the update vectors, s...

2006

[11] [11]

Then it holds for anyt∈ {1,

Lemma 9.In the setting above, letν=m ζ2 σ2 . Then it holds for anyt∈ {1, . . . , T}: E ˜S h ∥ ˜Gt −µ t∥2 Y t−1 0 i ≤ν,E ˜S h ∥ ˜Gt −µ t∥ Y t−1 0 i ≤ √ν,(49) and E h ∥Gt −µ t∥2 Y t−1 0 i ≤ν,E h ∥Gt −µ t∥ Y t−1 0 i ≤ √ν.(50) Note that compared to Lemma 3, the constant is ν instead of 2ν here, because for DP-SGD we can exploit that Ψ is a sum of independent ...

2018

[12] [12]

As a consequence of the monotonicity, we obtain upper and lower bounds

≤ log 1.25√log 2 + log 1.25−3− p log 2−log 2≈ −4.03<0(91) This confirms that R′ is strictly decreasing as a function of b, and therefore R is strictly increasing as a function of β. As a consequence of the monotonicity, we obtain upper and lower bounds. For any α∈(0,3]andβ∈(0, 1 2], that is R(α, β)≤R(α= 3, β= 1 2)≈1.95<2,(92) R(α, β)≥lim β′→0 R(α= 0, β ′)...

2025

[13] [13]

All reported guarantees are based only on the training data

data with 60,000 training sam- ples, and CIFAR-10 (Krizhevsky & Hinton, 2009)50,000 training samples. All reported guarantees are based only on the training data. Models.For MNIST, we use a fully-connected MLP with three 600-unit hidden layers and a 10-class output layer. For CIFAR-10, we use a nine-layer ConvNet with six convolutional layers followed by ...

2009

[14] [14]

For all experiments, µρ and Σρ are trained by optimizing the corresponding bounds

The posterior is also a Gaussian distribution ρ=N(µ ρ,Σ ρ),(96) with diagonal Σρ. For all experiments, µρ and Σρ are trained by optimizing the corresponding bounds. Training the DP-SGD prior.For training the prior, we train a deterministic model based on the DP-SGD algorithm (Algorithm 1). For CIFAR-10, we use batch size m= 5000 , clipping norm from ζ∈ {0...

2025