On the Generalization of Knowledge Distillation: An Information-Theoretic View

Bingying Li; Haiyun He

arxiv: 2605.13143 · v2 · pith:U3E6Y3MTnew · submitted 2026-05-13 · 💻 cs.IT · cs.LG· math.IT

On the Generalization of Knowledge Distillation: An Information-Theoretic View

Bingying Li , Haiyun He This is my paper

Pith reviewed 2026-05-19 17:58 UTC · model grok-4.3

classification 💻 cs.IT cs.LGmath.IT

keywords knowledge distillationgeneralization boundsKullback-Leibler divergencestochastic processesalgorithmic stabilityloss sharpnessinformation theory

0 comments

The pith

Modeling teacher and student training as coupled processes yields generalization bounds via their KL divergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models teacher and student training as coupled stochastic processes and defines a distillation divergence as the Kullback-Leibler divergence between their stochastic kernels. Using this measure, it derives an upper bound on the student's generalization error in terms of the teacher's gap under a sub-Gaussian assumption via algorithmic stability, along with a lower bound under a central condition that depends more directly on the divergence. The work also presents a loss-sharpness-aware bound showing that local flatness in the teacher can tighten the result, and decomposes the divergence explicitly in a linear Gaussian setting into bias, variance, and rank-bottleneck components.

Core claim

By treating teacher and student training as coupled stochastic processes, the authors introduce the distillation divergence as the Kullback-Leibler divergence between the corresponding stochastic kernels. This quantity allows derivation of an upper generalization bound for the student relative to the teacher's gap under sub-Gaussian assumptions through algorithmic stability, and a lower bound under a central condition with sharper dependence on the divergence. A loss-sharpness-aware refinement shows that the teacher's local flatness strictly improves the bound, while a linear Gaussian case study decomposes the divergence into interpretable bias, variance, and rank-bottleneck costs.

What carries the argument

The distillation divergence, defined as the Kullback-Leibler divergence between the stochastic kernels of the teacher and student training processes, quantifies the difference between the two processes and transfers generalization properties from teacher to student.

If this is right

If the distillation divergence remains small, the student's generalization gap stays close to the teacher's gap.
The lower bound implies that large distillation divergence prevents the student from generalizing much better than the teacher.
Incorporating the teacher's loss sharpness yields a strictly tighter bound when the teacher is locally flat.
In linear Gaussian models the divergence breaks into bias, variance, and rank costs that can guide choices such as model architecture or training rank.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Minimizing the distillation divergence during training could serve as a practical objective that improves distillation results beyond standard soft-label losses.
The coupled-process view may extend to other transfer settings by defining analogous divergences between source and target training kernels.
Testing the bounds on nonlinear networks would reveal whether the linear Gaussian decomposition offers useful design rules outside the analyzed case.

Load-bearing premise

Teacher and student training can be modeled as coupled stochastic processes whose kernels admit a well-defined KL divergence that can be bounded or decomposed in the stated ways.

What would settle it

An experiment or calculation showing that the student's generalization error exceeds the derived upper bound even when the distillation divergence is small and the sub-Gaussian assumption holds.

read the original abstract

Knowledge distillation is widely used to improve generalization in practice, yet its theoretical understanding remains elusive. In the standard distillation setting, a teacher model provides soft predictions to guide the training of a student model. We model teacher and student training as coupled stochastic processes and introduce a distillation divergence, defined as the Kullback-Leibler divergence between these two stochastic kernels. Within this framework, we derive two generalization bounds for the student model relative to the teacher's generalization gap: an upper bound under a sub-Gaussian assumption via algorithmic stability, and a lower bound under a central condition with sharper dependence on the distillation divergence. We further develop a loss-sharpness-aware bound with an explicit tightness regime, showing that the teacher's local flatness can strictly tighten the bound. Additionally, in a linear Gaussian case study, the distillation divergence admits an interpretable decomposition into bias, variance, and rank-bottleneck costs, yielding practical guidance for distillation design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a distillation divergence from coupled teacher-student stochastic processes and derives upper/lower generalization bounds plus a linear-Gaussian decomposition, but the joint-process modeling may not cover standard fixed-teacher distillation.

read the letter

The main takeaway is that they treat teacher and student training as coupled stochastic processes, introduce a distillation divergence as the KL between the two kernels, and then bound the student's generalization gap relative to the teacher's using that quantity. An upper bound comes from sub-Gaussian assumptions and algorithmic stability; a lower bound uses a central condition for tighter dependence on the divergence. They also give a sharpness-aware version that shows how teacher flatness can tighten the result, and in the linear-Gaussian case they decompose the divergence into bias, variance, and rank-bottleneck terms. These objects and the explicit bounds look new relative to the cited prior work on distillation theory. The decomposition in particular could be useful for thinking about simple settings where distillation helps or hurts. The coupled-process assumption is the clearest soft spot. Standard knowledge distillation pre-trains the teacher and then runs student training separately, so the joint law required for a direct KL between kernels is an idealization. If the paper does not supply approximation arguments or show how the bounds carry over to the sequential case, the results apply mainly to a simultaneous-training regime that is not the usual one. The sub-Gaussian and central-condition assumptions are standard in this literature but still need to be checked against actual distillation dynamics. This work is aimed at theorists who care about information-theoretic views of generalization in distillation. A reader focused on practical large-model distillation might get limited immediate guidance, though the Gaussian decomposition offers some design intuition. The paper shows clear engagement with existing bounds and distillation results, so it deserves a serious referee to examine the derivations and the modeling choices. I would send it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper models teacher and student training in knowledge distillation as coupled stochastic processes and defines a distillation divergence as the KL divergence between their stochastic kernels. It derives an upper bound on the student's generalization gap (relative to the teacher's) under a sub-Gaussian assumption via algorithmic stability, a lower bound under a central condition with sharper dependence on the divergence, a loss-sharpness-aware bound with an explicit tightness regime for the teacher's local flatness, and an interpretable bias-variance-rank decomposition of the divergence in a linear-Gaussian case study.

Significance. If the coupled-process modeling and bounds hold with the stated assumptions, the work provides a useful information-theoretic lens on distillation that explicitly ties the divergence to generalization gaps and offers practical design guidance via the linear-case decomposition. The combination of stability-based upper bounds, central-condition lower bounds, and the sharpness tightness regime is a constructive contribution to KD theory.

major comments (3)

[Framework section] Framework section: The central construction defines the distillation divergence via KL between stochastic kernels of coupled teacher-student processes, which presupposes a joint law over simultaneous training dynamics. This modeling choice is load-bearing for both the upper and lower bounds, yet standard KD practice fixes a pre-trained teacher and trains the student independently; the manuscript should clarify whether the bounds extend to the sequential regime or require additional approximation arguments.
[Generalization bounds] Generalization bounds (upper bound via algorithmic stability): The sub-Gaussian assumption and stability parameter are invoked to relate the student gap to the teacher gap with explicit dependence on the distillation divergence. The paper should verify or discuss how these assumptions are satisfied for typical neural-network losses in distillation, as violation would weaken the claimed explicit dependence.
[Linear-Gaussian case study] Linear-Gaussian case study: The bias-variance-rank decomposition of the distillation divergence is presented as yielding practical guidance. The derivation should be shown to follow directly from the KL definition of the divergence without post-hoc parameter fitting, and the regime of validity (e.g., when the rank-bottleneck term dominates) should be stated explicitly.

minor comments (2)

[Abstract] Abstract: The phrase 'sharper dependence on the distillation divergence' for the lower bound could be made more precise by indicating the functional form of the dependence if space permits.
[Notation and definitions] Notation: Ensure consistent use of symbols for the stochastic kernels and the distillation divergence across the framework and bound derivations to avoid reader confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the paper's significance. We address each major comment below and plan to incorporate revisions to clarify the modeling choices and strengthen the discussions.

read point-by-point responses

Referee: [Framework section] Framework section: The central construction defines the distillation divergence via KL between stochastic kernels of coupled teacher-student processes, which presupposes a joint law over simultaneous training dynamics. This modeling choice is load-bearing for both the upper and lower bounds, yet standard KD practice fixes a pre-trained teacher and trains the student independently; the manuscript should clarify whether the bounds extend to the sequential regime or require additional approximation arguments.

Authors: We agree that clarifying this point is important. Our framework models the processes as coupled to rigorously define the distillation divergence and derive the bounds. In the standard sequential setting, the pre-trained teacher can be viewed as having a fixed stochastic kernel, and the joint law can be approximated by the product of the teacher's converged distribution and the student's training process. We will add a paragraph in the Framework section discussing this approximation and how the bounds apply with minor modifications. revision: yes
Referee: [Generalization bounds] Generalization bounds (upper bound via algorithmic stability): The sub-Gaussian assumption and stability parameter are invoked to relate the student gap to the teacher gap with explicit dependence on the distillation divergence. The paper should verify or discuss how these assumptions are satisfied for typical neural-network losses in distillation, as violation would weaken the claimed explicit dependence.

Authors: The sub-Gaussian assumption is a common technical condition in algorithmic stability analyses and is satisfied for losses with bounded range or sub-Gaussian tails, which can be ensured in distillation by using temperature scaling to control the softness of the predictions. We will include a discussion in the relevant section on how this assumption holds for typical KD losses like the KL divergence between teacher and student outputs, assuming bounded logits or appropriate regularization. revision: yes
Referee: [Linear-Gaussian case study] Linear-Gaussian case study: The bias-variance-rank decomposition of the distillation divergence is presented as yielding practical guidance. The derivation should be shown to follow directly from the KL definition of the divergence without post-hoc parameter fitting, and the regime of validity (e.g., when the rank-bottleneck term dominates) should be stated explicitly.

Authors: The decomposition is obtained directly by applying the closed-form KL divergence formula for multivariate Gaussians to the stochastic kernels defined in our framework, resulting in separate terms for bias (mean shift), variance (covariance mismatch), and rank (dimensionality reduction effect). There is no post-hoc fitting involved. We will explicitly delineate the validity regime in the case study, specifying that the rank-bottleneck dominates in low-rank student models or when the teacher's feature space has higher effective rank. revision: yes

Circularity Check

0 steps flagged

Framework and bounds are self-contained; no reduction to inputs by construction

full rationale

The paper defines a distillation divergence via KL between kernels of coupled teacher-student stochastic processes, then derives generalization bounds (upper via sub-Gaussian + stability; lower via central condition) that explicitly depend on this quantity and the teacher's gap. This is a standard modeling choice followed by derivation under stated assumptions, not a self-definitional loop or fitted input renamed as prediction. The linear-Gaussian decomposition is an analysis of the defined divergence rather than a tautology. No load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work are indicated. The derivation chain remains independent of the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on modeling teacher and student as coupled stochastic processes and on two domain assumptions for the bounds; the divergence itself is an invented quantity whose properties are derived rather than measured.

axioms (2)

domain assumption sub-Gaussian assumption on the loss for the algorithmic-stability upper bound
Invoked to obtain the upper bound via stability arguments
domain assumption central condition for the lower bound
Required to obtain sharper dependence on the distillation divergence

invented entities (1)

distillation divergence no independent evidence
purpose: Quantify difference between teacher and student stochastic kernels via KL divergence
Defined as the KL between the two kernels; used as the central quantity in all bounds

pith-pipeline@v0.9.0 · 5687 in / 1479 out tokens · 35082 ms · 2026-05-19T17:58:38.186019+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We model teacher and student training as coupled stochastic processes and introduce a distillation divergence, defined as the Kullback–Leibler divergence between these two stochastic kernels.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Distillation Generalization Upper Bound) ... genS ≤ genT + σ √(2 Kn)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 8 internal anchors

[1]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[2]

Deep Mutual Learning

Y . Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” 2017. [Online]. Available: https://arxiv.org/abs/1706.00384

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Learning from multiple teacher networks,

S. You, C. Xu, C. Xu, and D. Tao, “Learning from multiple teacher networks,” inProceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, pp. 1285– 1294

work page 2017
[4]

Towards understanding knowledge distil- lation,

M. Phuong and C. Lampert, “Towards understanding knowledge distil- lation,” inInternational conference on machine learning. PMLR, 2019, pp. 5142–5151

work page 2019
[5]

Do Deep Nets Really Need to be Deep?

L. J. Ba and R. Caruana, “Do deep nets really need to be deep?” 2014. [Online]. Available: https://arxiv.org/abs/1312.6184

work page internal anchor Pith review Pith/arXiv arXiv 2014
[6]

Unifying distillation and privileged information

D. Lopez-Paz, L. Bottou, B. Sch ¨olkopf, and V . Vapnik, “Unifying distillation and privileged information,” 2016. [Online]. Available: https://arxiv.org/abs/1511.03643

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

Learning using privileged information: Similarity control and knowledge transfer,

V . Vapnik and R. Izmailov, “Learning using privileged information: Similarity control and knowledge transfer,”Journal of Machine Learning Research, vol. 16, no. 61, pp. 2023–2049, 2015. [Online]. Available: http://jmlr.org/papers/v16/vapnik15b.html

work page 2023
[8]

Generalization bounds via distillation,

D. Hsu, Z. Ji, M. Telgarsky, and L. Wang, “Generalization bounds via distillation,” 2021. [Online]. Available: https://arxiv.org/abs/2104.05641

work page arXiv 2021
[9]

A statistical perspective on distillation,

A. K. Menon, A. S. Rawat, S. Reddi, S. Kim, and S. Kumar, “A statistical perspective on distillation,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol

work page
[10]

7632–7642

PMLR, 18–24 Jul 2021, pp. 7632–7642. [Online]. Available: https://proceedings.mlr.press/v139/menon21a.html

work page 2021
[11]

Knowledge distillation performs partial variance reduction,

M. Safaryan, A. Peste, and D. Alistarh, “Knowledge distillation performs partial variance reduction,” 2023. [Online]. Available: https://arxiv.org/abs/2305.17581

work page arXiv 2023
[12]

Revisiting knowledge distillation via label smoothing regularization,

L. Yuan, F. E. H. Tay, G. Li, T. Wang, and J. Feng, “Revisiting knowledge distillation via label smoothing regularization,” 2021. [Online]. Available: https://arxiv.org/abs/1909.11723

work page arXiv 2021
[13]

Knowledge distillation in wide neural networks: Risk bound, data efficiency and imperfect teacher,

G. Ji and Z. Zhu, “Knowledge distillation in wide neural networks: Risk bound, data efficiency and imperfect teacher,” 2020. [Online]. Available: https://arxiv.org/abs/2010.10090

work page arXiv 2020
[14]

Revisiting self-distillation,

M. Pham, M. Cho, A. Joshi, and C. Hegde, “Revisiting self-distillation,”

work page
[15]

Available: https://arxiv.org/abs/2206.08491

[Online]. Available: https://arxiv.org/abs/2206.08491

work page arXiv
[16]

Peter Holderrieth, Yilun Xu, and Tommi Jaakkola

S. Hochreiter and J. Schmidhuber, “Flat minima,”Neural Computation, vol. 9, no. 1, pp. 1–42, 01 1997. [Online]. Available: https: //doi.org/10.1162/neco.1997.9.1.1

work page doi:10.1162/neco.1997.9.1.1 1997
[17]

Generalization matters: Loss minima flattening via parameter hybridization for efficient online knowledge distillation,

T. Zhang, M. Xue, J. Zhang, H. Zhang, Y . Wang, L. Cheng, J. Song, and M. Song, “Generalization matters: Loss minima flattening via parameter hybridization for efficient online knowledge distillation,”

work page
[18]

Available: https://arxiv.org/abs/2303.14666

[Online]. Available: https://arxiv.org/abs/2303.14666

work page arXiv
[19]

Sharpness-Aware Minimization for Efficiently Improving Generalization

P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” 2021. [Online]. Available: https://arxiv.org/abs/2010.01412

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Leveraging flatness to improve information-theoretic generalization bounds for sgd,

Z. Peng, J. Zhang, Y . Wang, L. Qi, Y . Shi, and Y . Gao, “Leveraging flatness to improve information-theoretic generalization bounds for sgd,” 2026. [Online]. Available: https://arxiv.org/abs/2601.01465

work page arXiv 2026
[21]

The information bottleneck method

N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” 2000. [Online]. Available: https://arxiv.org/abs/ physics/0004057

work page internal anchor Pith review Pith/arXiv arXiv 2000
[22]

Efficient knowledge distillation from model checkpoints,

C. Wang, Q. Yang, R. Huang, S. Song, and G. Huang, “Efficient knowledge distillation from model checkpoints,” 2022. [Online]. Available: https://arxiv.org/abs/2210.06458

work page arXiv 2022
[23]

Variational Information Distillation for Knowledge Transfer

S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai, “Variational information distillation for knowledge transfer,” 2019. [Online]. Available: https://arxiv.org/abs/1904.05835

work page internal anchor Pith review Pith/arXiv arXiv 2019
[24]

URLhttps://arxiv.org/abs/1910.10699

Y . Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” 2022. [Online]. Available: https://arxiv.org/abs/1910.10699

work page arXiv 2022
[25]

Bayes conditional distribution estimation for knowledge distillation based on conditional mutual information,

L. Ye, S. M. Hamidi, R. Tan, and E.-H. Yang, “Bayes conditional distribution estimation for knowledge distillation based on conditional mutual information,” 2024. [Online]. Available: https://arxiv.org/abs/ 2401.08732

work page arXiv 2024
[26]

Information-theoretic analysis of generalization capability of learning algorithms

A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1705.07809

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Fast rate information- theoretic bounds on generalization errors,

X. Wu, J. H. Manton, U. Aickelin, and J. Zhu, “Fast rate information- theoretic bounds on generalization errors,” 2025. [Online]. Available: https://arxiv.org/abs/2303.14658

work page arXiv 2025
[28]

Drone: Data-aware low- rank compression for large nlp models,

P. Chen, H.-F. Yu, I. Dhillon, and C.-J. Hsieh, “Drone: Data-aware low- rank compression for large nlp models,”Advances in neural information processing systems, vol. 34, pp. 29 321–29 334, 2021. V. APPENDIX A. Preliminary Tools

work page 2021
[29]

For any measurable functiongwithE P [eg]<∞, EQ[g]≤KL(Q∥P) + logE P [eg].(10) Proof.Define the Radon–Nikodym derivativer:= dQ dP

Donsker–Varadhan change of measure inequality: Lemma 6(Donsker–Varadhan inequality).LetP, Qbe probability measures on the same measurable space withQ≪P. For any measurable functiongwithE P [eg]<∞, EQ[g]≤KL(Q∥P) + logE P [eg].(10) Proof.Define the Radon–Nikodym derivativer:= dQ dP . ThenE P [r] = 1and KL(Q∥P) =E Q log dQ dP =E P [rlogr]. By Jensen’s inequa...

work page
[30]

Then for anyλ∈R, logE[e λX]≤λE[X] + λ2σ2 2 .(11) Proof.By definition, E[eλX] =E h eλ(X−E[X]) i ·e λE[X]

A standard sub-Gaussian mgf bound: Lemma 7(Sub-Gaussian mgf bound).LetXbeσ 2-sub-Gaussian, meaninglogE[exp(λ(X−E[X]))]≤ λ2σ2 2 for allλ∈R. Then for anyλ∈R, logE[e λX]≤λE[X] + λ2σ2 2 .(11) Proof.By definition, E[eλX] =E h eλ(X−E[X]) i ·e λE[X] . Taking logs and applying the sub-Gaussian condition gives (11). B. Proof of Theorem 1

work page
[31]

Then genS ≤gen T +σ p 2Kn.(12)

Statement: Theorem 8(Distillation generalization upper bound).Assume thath(D n, fT )isσ 2-sub-Gaussian under∆ Dn,fT . Then genS ≤gen T +σ p 2Kn.(12)

work page
[32]

Then (10) gives E∆ ˆDn ,fS [λh( ˆDn, fS)]≤KL(∆ ˆDn,fS ∥∆Dn,fT ) + logE∆Dn ,fT [eλh(Dn,fT )]

Proof: Proof.Fix anyλ >0and choose in Lemma 6 P= ∆ Dn,fT , Q= ∆ ˆDn,fS , g(d, f) =λ h(d, f). Then (10) gives E∆ ˆDn ,fS [λh( ˆDn, fS)]≤KL(∆ ˆDn,fS ∥∆Dn,fT ) + logE∆Dn ,fT [eλh(Dn,fT )]. By definitions, the left side equalsλgen S and the KL term equalsK n, so λgenS ≤K n + logE ∆Dn ,fT [eλh(Dn,fT )].(13) Sinceh(D n, fT )isσ 2-sub-Gaussian with meangen T , L...

work page
[33]

Assume the teacher algorithmA T isβ-uniformly stable: for any neighboring datasetsD n andD (i) n differing in one example, sup z∈Z ℓ(AT (Dn), z)−ℓ(A T (D(i) n ), z) ≤β

Statement: Proposition 9(Sub-Gaussianity via stability).Assume the loss is bounded:ℓ(f, z)∈[a, b]for allf, z. Assume the teacher algorithmA T isβ-uniformly stable: for any neighboring datasetsD n andD (i) n differing in one example, sup z∈Z ℓ(AT (Dn), z)−ℓ(A T (D(i) n ), z) ≤β. Let H(D n) :=L P (AT (Dn))−L Dn(AT (Dn)). ThenH(D n)satisfies bounded differen...

work page
[34]

, zn)and letD (i) n = (z1,

Proof: Proof.LetD n = (z1, . . . , zn)and letD (i) n = (z1, . . . , zi−1, z′ i, zi+1, . . . , zn)be a neighboring dataset. Define f:=A T (Dn), f ′ :=A T (D(i) n ). Then H(D n) =E Z∼P [ℓ(f, Z)]− 1 n nX j=1 ℓ(f, zj), H(D(i) n ) =E Z∼P [ℓ(f ′, Z)]− 1 n nX j=1 ℓ(f ′, z(i) j ), wherez (i) j =z j forj̸=iandz (i) i =z ′ i. By uniform stability and taking expecta...

work page
[35]

Central condition definition: Definition 2((η, c)-central condition).A random variableXsatisfies the(η, c)-central condition underPifη >0and 0< c≤1and logE P [e−ηX]≤ −cηE P [X].(22)

work page
[36]

Then genS ≥c·gen T − 1 η Kn.(23)

Statement: Theorem 10(Distillation generalization lower bound).Assumeh(D n, fT )satisfies the(η, c)-central condition under∆ Dn,fT . Then genS ≥c·gen T − 1 η Kn.(23)

work page
[37]

Proof: Proof.Apply Lemma 6 with P= ∆ Dn,fT , Q= ∆ ˆDn,fS , g(d, f) =−η h(d, f). Then E∆ ˆDn ,fS [−η h( ˆDn, fS)]≤K n + logE ∆Dn ,fT [e−η h(Dn,fT )].(24) By assumption,h(D n, fT )satisfies (22) withX=h(D n, fT )andP= ∆ Dn,fT : logE ∆Dn ,fT [e−η h(Dn,fT )]≤ −cηE ∆Dn ,fT [h(Dn, fT )] =−cηgen T .(25) Substitute (25) into (24): −ηE ∆ ˆDn ,fS [h( ˆDn, fS)]≤K n ...

work page
[38]

Then for anyη >0, logE[e −ηX]≤ −ηµ+ η2σ2 2 =−ηµ 1− ησ2 2µ

Sub-Gaussianity implies a central condition for smallη: Remark 2(Deriving a valid(η, c)from sub-Gaussianity).AssumeXisσ 2-sub-Gaussian with meanµ=E[X]>0. Then for anyη >0, logE[e −ηX]≤ −ηµ+ η2σ2 2 =−ηµ 1− ησ2 2µ . ThusXsatisfies the(η, c)-central condition with c≤1− ησ2 2µ provided that0< η < 2µ σ2 . E. Linear Gaussian Case Study (Detailed KL Decomposition)

work page
[39]

We use the vectorization identity vec(W X) = (X ⊤ ⊗I k) vec(W), valid forW∈R k×d andX∈R d×n

Matrix normal definition and basic identities: Definition 3(Matrix normal distribution).A random matrixA∈R k×n follows a matrix normal distributionA∼ MN(M, U, V)if vec(A)∼ N(vec(M), V⊗U), whereM∈R k×n,U∈R k×k,V∈R n×n. We use the vectorization identity vec(W X) = (X ⊤ ⊗I k) vec(W), valid forW∈R k×d andX∈R d×n. For Gaussians with the same covariance, we use...

work page
[40]

Assume a noisy linear label channel Y|X∼ MN(W ⋆X, I k, ν 2In), whereW ⋆ ∈R k×d is the ground-truth linear map andν >0is the noise level

Generative model:Collect features and labels column-wise intoX∈R d×n andY∈R k×n so thatD n = (X, Y). Assume a noisy linear label channel Y|X∼ MN(W ⋆X, I k, ν 2In), whereW ⋆ ∈R k×d is the ground-truth linear map andν >0is the noise level

work page
[41]

GivenD n = (X, Y), define the Gibbs posterior with inverse temperatureβ T : qT (W|D n)∝p 0(W) exp − βT 2ν2 ∥Y−W X∥ 2 F

Teacher as a Gibbs learner (closed form):Let the teacher parameter beW∈R k×d with prior p0(W) =MN(0, I k, λ −1Id). GivenD n = (X, Y), define the Gibbs posterior with inverse temperatureβ T : qT (W|D n)∝p 0(W) exp − βT 2ν2 ∥Y−W X∥ 2 F . Lemma 11(Closed form ofq T ).The posterior is matrix normal: qT (W|D n) =MN( ¯WT , I k,Σ T ),Σ T = λId + βT ν2 XX ⊤ −1 , ...

work page
[42]

Hence the (unnormalized) log density is quadratic inwwith precision λIkd + βT ν2 B⊤ X BX =λI kd + βT ν2 (XX ⊤ ⊗I k), so the covariance is(λI d + βT ν2 XX ⊤)−1 ⊗I k = ΣT ⊗I k

The prior impliesw∼ N(0, λ −1Ikd). Hence the (unnormalized) log density is quadratic inwwith precision λIkd + βT ν2 B⊤ X BX =λI kd + βT ν2 (XX ⊤ ⊗I k), so the covariance is(λI d + βT ν2 XX ⊤)−1 ⊗I k = ΣT ⊗I k. The mean is the corresponding linear term mapped back to matrix form, yielding ¯WT = βT ν2 Y X ⊤ΣT

work page
[43]

Pseudo-data generation:SampleW T ∼q T (· |D n)and generate pseudo labels through the same noisy channel: ˆY|(W T , X)∼ MN(W T X, I k, ν 2In), then define ˆDn = (X, ˆY)

work page
[44]

Introduce the rank-κmap M ⋆(WT , X)as a best rank-κapproximation in prediction space: M ⋆(WT , X)∈arg min M: rank(M)=κ ∥WT X−W T M X∥2 F

Student capacity constraint via a rank bottleneck:Let the student parameter beΘ∈R k×d. Introduce the rank-κmap M ⋆(WT , X)as a best rank-κapproximation in prediction space: M ⋆(WT , X)∈arg min M: rank(M)=κ ∥WT X−W T M X∥2 F . Define a local Gaussian student conditional kernel qS(Θ|W T , X) =MN(W T M ⋆(WT , X), I k,Σ S),Σ S ≻0

work page
[45]

Process-level KL and its two terms:Define the distillation divergence Kn := KL(∆ ˆDn,fS ∥∆ Dn,fT ). Using the KL chain rule on the dataset-model pair, Kn = KL(∆ ˆDn ∥∆ Dn) | {z } Dataset shift +E ˆDn h KL(∆fS | ˆDn ∥∆ fT |Dn) i | {z } Algorithm shift .(27) We now bound the two terms

work page
[46]

Condition on(X, D n)and use convexity of KL.:GivenXandD n, ˆY|(X, D n)is a mixture overW T |D n: ˆY|(X, D n)∼ Z qT (WT |D n)MN(W T X, I k, ν 2In)dW T

Dataset shift bound and bias-variance decomposition: a) Step 1. Condition on(X, D n)and use convexity of KL.:GivenXandD n, ˆY|(X, D n)is a mixture overW T |D n: ˆY|(X, D n)∼ Z qT (WT |D n)MN(W T X, I k, ν 2In)dW T . The real label law (givenX) isMN(W ⋆X, Ik, ν2In). By convexity of KL in its first argument, KL( ˆY|X, D n ∥Y|X)≤E WT |Dn h KL MN(W T X, Ik, ν...

work page
[47]

Since the student kernel is conditionally Gaussian given latentW T (andX),q S(Θ| ˆDn)is generally a mixture overW T

Algorithm shift bound and the rank-bottleneck decomposition:Define the (expected) algorithm-shift term KLalg :=E ˆDn h KL(∆fS | ˆDn ∥∆ fT |Dn) i . Since the student kernel is conditionally Gaussian given latentW T (andX),q S(Θ| ˆDn)is generally a mixture overW T . By convexity of KL in the first argument, conditioning and then averaging yields the reducti...

work page
[48]

Final compact decomposition ofK n:Combine (27), (33), and (39): Kn ≤E Dn h 1 2ν2 Bias(Dn) +kVar(D n) | {z } Teacher prediction error + Apx(D n)| {z } Student capacity / rank bottleneck + Cov(ΣS,Σ T )| {z } Geometry mismatch + Spread(Dn)| {z } Posterior spread i . This yields an interpretable checklist: improve teacher bias/variance to tighten dataset shif...

work page
[49]

For any datasetDand modelf, define empirical and population sharpness: SD(f) :=E U[LD(f+U)]−L D(f),(40) SP (f) :=E U[LP (f+U)]−L P (f)

Definitions:LetUbe a random perturbation, independent of all other randomness, uniformly distributed on the Euclidean ball{u:∥u∥ ≤ρ}. For any datasetDand modelf, define empirical and population sharpness: SD(f) :=E U[LD(f+U)]−L D(f),(40) SP (f) :=E U[LP (f+U)]−L P (f). Define the perturbed generalization gap hU(D, f) :=E U[h(D, f+U)] =E U[LP (f+U)−L D(f+U)].(41)

work page
[50]

Then genS ≤gen T +E ∆Dn ,fT [SP (fT )] + (σu +ν) p 2Kn.(43)

Statement: Theorem 12(Sharpness-aware distillation generalization bound).Assume: •(i) (Local optimality in expectation) under∆ ˆDn,fS , E[LP (fS)]≤E[E U[LP (fS +U)]].(42) •(ii) Under the teacher process∆ Dn,fT , bothh U(Dn, fT )andS Dn(fT )are sub-Gaussian with proxiesσ 2 u andν 2, respectively. Then genS ≤gen T +E ∆Dn ,fT [SP (fT )] + (σu +ν) p 2Kn.(43)

work page
[51]

Using (42),gen S ≤E ∆ ˆDn ,fS EU[LP (fS +U)]−L ˆDn (fS)

Proof: Proof.By definition,gen S =E ∆ ˆDn ,fS LP (fS)−L ˆDn (fS) . Using (42),gen S ≤E ∆ ˆDn ,fS EU[LP (fS +U)]−L ˆDn (fS) . Add and subtractE U[L ˆDn (fS +U)]inside the expectation: genS ≤E ∆ ˆDn ,fS EU(LP (fS +U)−L ˆDn (fS +U)) +E ∆ ˆDn ,fS EU(L ˆDn (fS +U))−L ˆDn (fS) . Recognize the two terms using (41) and (40): genS ≤E ∆ ˆDn ,fS hU( ˆDn, fS) +E ∆ ˆD...

work page
[52]

•(A2) Parameter stability: for neighboring datasetsD, D (i),∥A T (D)−A T (D(i))∥ ≤κ/n

Assumptions used:We use the following conditions, matching the main text: •(A1) Bounded loss:ℓ(·;z)∈[a, b]. •(A2) Parameter stability: for neighboring datasetsD, D (i),∥A T (D)−A T (D(i))∥ ≤κ/n. •(A3) Global Lipschitz: for allzand allf, f ′,|ℓ(f;z)−ℓ(f ′;z)| ≤L∥f−f ′∥. •(A4) Local regularity onB(f T ,2ρ): for allz,ℓ(·;z)isα-smooth and∥∇ℓ(f T ;z)∥ ≤g 0. Th...

work page
[53]

Then |h(Dn, fT )−h(D (i) n , f ′ T )| ≤ 2κL+ (b−a) n .(48) Consequently,h(D n, fT )is sub-Gaussian with proxyσ 0 = 1√n κL+ b−a 2

Bounded differences for the (unperturbed) teacher gap: Lemma 13(Bounded differences forh(D n, fT )under (A1)–(A3)).Letf T =A T (Dn)andf ′ T =A T (D(i) n )for neighboring datasets. Then |h(Dn, fT )−h(D (i) n , f ′ T )| ≤ 2κL+ (b−a) n .(48) Consequently,h(D n, fT )is sub-Gaussian with proxyσ 0 = 1√n κL+ b−a 2 . Proof.Write h(D, f) =L P (f)−L D(f). Then |h(D...

work page
[54]

Proof.The proof repeats Lemma 13, replacing the global Lipschitz constantLby the local Lipschitz scale(g 0 +αρ)valid on the perturbation region

Bounded differences for the perturbed gaph U : Lemma 14(Proxy forσ u(ρ)under (A1), (A2), (A4)).Under (A1), (A2), (A4), the perturbed gaph U(Dn, fT )is sub-Gaussian with proxy σu(ρ) = 1√n κ(g0 +αρ) + b−a 2 . Proof.The proof repeats Lemma 13, replacing the global Lipschitz constantLby the local Lipschitz scale(g 0 +αρ)valid on the perturbation region. Step ...

work page
[55]

Bounded differences for empirical sharpness and the proxyν(ρ): Lemma 15(Proxy forν(ρ)under (A2), (A4)).Under (A2) and (A4), assume moreover that for any neighboring datasets D, D(i), if we set f:=A T (D), f ′ :=A T (D(i)), then theρ-neighborhood of the line segment joiningfandf ′ lies inside the local region on which the Hessian bound in (A4) is valid. Th...

work page
[56]

Then E[SP (fT )]≤ 1 2 τopρ2

Bounding the population sharpness by curvature: Lemma 16(Population sharpness bound under (A5 ′)).Assume sup ∥v∥≤ρ ∥∇2LP (fT +v)∥ op ≤τ op. Then E[SP (fT )]≤ 1 2 τopρ2. Proof.Recall SP (fT ) =E U LP (fT +U)−L P (fT ) . By Taylor’s theorem with integral remainder, LP (fT +u)−L P (fT ) =⟨∇L P (fT ), u⟩+ Z 1 0 (1−t)u ⊤∇2LP (fT +tu)u dt. Taking expectation ov...

work page
[57]

We seek conditions under which Bsh(ρ)< B std

Baseline and sharpness-aware bounds:Define Bstd := genT +σ 0 p 2Kn, B sh(ρ) := genT +E[S P (fT )] + (σu(ρ) +ν(ρ)) p 2Kn. We seek conditions under which Bsh(ρ)< B std

work page
[58]

By Lemma 16, E[SP (fT )]≤ 1 2 τopρ2

Proof: Proof.A sufficient condition forB sh(ρ)< B std is E[SP (fT )]< σ0 −σ u(ρ)−ν(ρ) p 2Kn. By Lemma 16, E[SP (fT )]≤ 1 2 τopρ2. Moreover, using the standard proxy σ0 = 1√n κL+ b−a 2 , the local proxy σu(ρ)≤ 1√n κ(g0 +αρ) + b−a 2 , and Lemma 15, ν(ρ) = 1√n ακ 2 ρ+g 0ρ+ α 2 ρ2 , we obtain the lower bound σ0 −σ u(ρ)−ν(ρ)≥ 1√n A0 −A 1ρ−A 2ρ2 , where A0 :=κ(...

work page

[1] [1]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[2] [2]

Deep Mutual Learning

Y . Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” 2017. [Online]. Available: https://arxiv.org/abs/1706.00384

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

Learning from multiple teacher networks,

S. You, C. Xu, C. Xu, and D. Tao, “Learning from multiple teacher networks,” inProceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, pp. 1285– 1294

work page 2017

[4] [4]

Towards understanding knowledge distil- lation,

M. Phuong and C. Lampert, “Towards understanding knowledge distil- lation,” inInternational conference on machine learning. PMLR, 2019, pp. 5142–5151

work page 2019

[5] [5]

Do Deep Nets Really Need to be Deep?

L. J. Ba and R. Caruana, “Do deep nets really need to be deep?” 2014. [Online]. Available: https://arxiv.org/abs/1312.6184

work page internal anchor Pith review Pith/arXiv arXiv 2014

[6] [6]

Unifying distillation and privileged information

D. Lopez-Paz, L. Bottou, B. Sch ¨olkopf, and V . Vapnik, “Unifying distillation and privileged information,” 2016. [Online]. Available: https://arxiv.org/abs/1511.03643

work page internal anchor Pith review Pith/arXiv arXiv 2016

[7] [7]

Learning using privileged information: Similarity control and knowledge transfer,

V . Vapnik and R. Izmailov, “Learning using privileged information: Similarity control and knowledge transfer,”Journal of Machine Learning Research, vol. 16, no. 61, pp. 2023–2049, 2015. [Online]. Available: http://jmlr.org/papers/v16/vapnik15b.html

work page 2023

[8] [8]

Generalization bounds via distillation,

D. Hsu, Z. Ji, M. Telgarsky, and L. Wang, “Generalization bounds via distillation,” 2021. [Online]. Available: https://arxiv.org/abs/2104.05641

work page arXiv 2021

[9] [9]

A statistical perspective on distillation,

A. K. Menon, A. S. Rawat, S. Reddi, S. Kim, and S. Kumar, “A statistical perspective on distillation,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol

work page

[10] [10]

7632–7642

PMLR, 18–24 Jul 2021, pp. 7632–7642. [Online]. Available: https://proceedings.mlr.press/v139/menon21a.html

work page 2021

[11] [11]

Knowledge distillation performs partial variance reduction,

M. Safaryan, A. Peste, and D. Alistarh, “Knowledge distillation performs partial variance reduction,” 2023. [Online]. Available: https://arxiv.org/abs/2305.17581

work page arXiv 2023

[12] [12]

Revisiting knowledge distillation via label smoothing regularization,

L. Yuan, F. E. H. Tay, G. Li, T. Wang, and J. Feng, “Revisiting knowledge distillation via label smoothing regularization,” 2021. [Online]. Available: https://arxiv.org/abs/1909.11723

work page arXiv 2021

[13] [13]

Knowledge distillation in wide neural networks: Risk bound, data efficiency and imperfect teacher,

G. Ji and Z. Zhu, “Knowledge distillation in wide neural networks: Risk bound, data efficiency and imperfect teacher,” 2020. [Online]. Available: https://arxiv.org/abs/2010.10090

work page arXiv 2020

[14] [14]

Revisiting self-distillation,

M. Pham, M. Cho, A. Joshi, and C. Hegde, “Revisiting self-distillation,”

work page

[15] [15]

Available: https://arxiv.org/abs/2206.08491

[Online]. Available: https://arxiv.org/abs/2206.08491

work page arXiv

[16] [16]

Peter Holderrieth, Yilun Xu, and Tommi Jaakkola

S. Hochreiter and J. Schmidhuber, “Flat minima,”Neural Computation, vol. 9, no. 1, pp. 1–42, 01 1997. [Online]. Available: https: //doi.org/10.1162/neco.1997.9.1.1

work page doi:10.1162/neco.1997.9.1.1 1997

[17] [17]

Generalization matters: Loss minima flattening via parameter hybridization for efficient online knowledge distillation,

T. Zhang, M. Xue, J. Zhang, H. Zhang, Y . Wang, L. Cheng, J. Song, and M. Song, “Generalization matters: Loss minima flattening via parameter hybridization for efficient online knowledge distillation,”

work page

[18] [18]

Available: https://arxiv.org/abs/2303.14666

[Online]. Available: https://arxiv.org/abs/2303.14666

work page arXiv

[19] [19]

Sharpness-Aware Minimization for Efficiently Improving Generalization

P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” 2021. [Online]. Available: https://arxiv.org/abs/2010.01412

work page internal anchor Pith review Pith/arXiv arXiv 2021

[20] [20]

Leveraging flatness to improve information-theoretic generalization bounds for sgd,

Z. Peng, J. Zhang, Y . Wang, L. Qi, Y . Shi, and Y . Gao, “Leveraging flatness to improve information-theoretic generalization bounds for sgd,” 2026. [Online]. Available: https://arxiv.org/abs/2601.01465

work page arXiv 2026

[21] [21]

The information bottleneck method

N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” 2000. [Online]. Available: https://arxiv.org/abs/ physics/0004057

work page internal anchor Pith review Pith/arXiv arXiv 2000

[22] [22]

Efficient knowledge distillation from model checkpoints,

C. Wang, Q. Yang, R. Huang, S. Song, and G. Huang, “Efficient knowledge distillation from model checkpoints,” 2022. [Online]. Available: https://arxiv.org/abs/2210.06458

work page arXiv 2022

[23] [23]

Variational Information Distillation for Knowledge Transfer

S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai, “Variational information distillation for knowledge transfer,” 2019. [Online]. Available: https://arxiv.org/abs/1904.05835

work page internal anchor Pith review Pith/arXiv arXiv 2019

[24] [24]

URLhttps://arxiv.org/abs/1910.10699

Y . Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” 2022. [Online]. Available: https://arxiv.org/abs/1910.10699

work page arXiv 2022

[25] [25]

Bayes conditional distribution estimation for knowledge distillation based on conditional mutual information,

L. Ye, S. M. Hamidi, R. Tan, and E.-H. Yang, “Bayes conditional distribution estimation for knowledge distillation based on conditional mutual information,” 2024. [Online]. Available: https://arxiv.org/abs/ 2401.08732

work page arXiv 2024

[26] [26]

Information-theoretic analysis of generalization capability of learning algorithms

A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1705.07809

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Fast rate information- theoretic bounds on generalization errors,

X. Wu, J. H. Manton, U. Aickelin, and J. Zhu, “Fast rate information- theoretic bounds on generalization errors,” 2025. [Online]. Available: https://arxiv.org/abs/2303.14658

work page arXiv 2025

[28] [28]

Drone: Data-aware low- rank compression for large nlp models,

P. Chen, H.-F. Yu, I. Dhillon, and C.-J. Hsieh, “Drone: Data-aware low- rank compression for large nlp models,”Advances in neural information processing systems, vol. 34, pp. 29 321–29 334, 2021. V. APPENDIX A. Preliminary Tools

work page 2021

[29] [29]

For any measurable functiongwithE P [eg]<∞, EQ[g]≤KL(Q∥P) + logE P [eg].(10) Proof.Define the Radon–Nikodym derivativer:= dQ dP

Donsker–Varadhan change of measure inequality: Lemma 6(Donsker–Varadhan inequality).LetP, Qbe probability measures on the same measurable space withQ≪P. For any measurable functiongwithE P [eg]<∞, EQ[g]≤KL(Q∥P) + logE P [eg].(10) Proof.Define the Radon–Nikodym derivativer:= dQ dP . ThenE P [r] = 1and KL(Q∥P) =E Q log dQ dP =E P [rlogr]. By Jensen’s inequa...

work page

[30] [30]

Then for anyλ∈R, logE[e λX]≤λE[X] + λ2σ2 2 .(11) Proof.By definition, E[eλX] =E h eλ(X−E[X]) i ·e λE[X]

A standard sub-Gaussian mgf bound: Lemma 7(Sub-Gaussian mgf bound).LetXbeσ 2-sub-Gaussian, meaninglogE[exp(λ(X−E[X]))]≤ λ2σ2 2 for allλ∈R. Then for anyλ∈R, logE[e λX]≤λE[X] + λ2σ2 2 .(11) Proof.By definition, E[eλX] =E h eλ(X−E[X]) i ·e λE[X] . Taking logs and applying the sub-Gaussian condition gives (11). B. Proof of Theorem 1

work page

[31] [31]

Then genS ≤gen T +σ p 2Kn.(12)

Statement: Theorem 8(Distillation generalization upper bound).Assume thath(D n, fT )isσ 2-sub-Gaussian under∆ Dn,fT . Then genS ≤gen T +σ p 2Kn.(12)

work page

[32] [32]

Then (10) gives E∆ ˆDn ,fS [λh( ˆDn, fS)]≤KL(∆ ˆDn,fS ∥∆Dn,fT ) + logE∆Dn ,fT [eλh(Dn,fT )]

Proof: Proof.Fix anyλ >0and choose in Lemma 6 P= ∆ Dn,fT , Q= ∆ ˆDn,fS , g(d, f) =λ h(d, f). Then (10) gives E∆ ˆDn ,fS [λh( ˆDn, fS)]≤KL(∆ ˆDn,fS ∥∆Dn,fT ) + logE∆Dn ,fT [eλh(Dn,fT )]. By definitions, the left side equalsλgen S and the KL term equalsK n, so λgenS ≤K n + logE ∆Dn ,fT [eλh(Dn,fT )].(13) Sinceh(D n, fT )isσ 2-sub-Gaussian with meangen T , L...

work page

[33] [33]

Assume the teacher algorithmA T isβ-uniformly stable: for any neighboring datasetsD n andD (i) n differing in one example, sup z∈Z ℓ(AT (Dn), z)−ℓ(A T (D(i) n ), z) ≤β

Statement: Proposition 9(Sub-Gaussianity via stability).Assume the loss is bounded:ℓ(f, z)∈[a, b]for allf, z. Assume the teacher algorithmA T isβ-uniformly stable: for any neighboring datasetsD n andD (i) n differing in one example, sup z∈Z ℓ(AT (Dn), z)−ℓ(A T (D(i) n ), z) ≤β. Let H(D n) :=L P (AT (Dn))−L Dn(AT (Dn)). ThenH(D n)satisfies bounded differen...

work page

[34] [34]

, zn)and letD (i) n = (z1,

Proof: Proof.LetD n = (z1, . . . , zn)and letD (i) n = (z1, . . . , zi−1, z′ i, zi+1, . . . , zn)be a neighboring dataset. Define f:=A T (Dn), f ′ :=A T (D(i) n ). Then H(D n) =E Z∼P [ℓ(f, Z)]− 1 n nX j=1 ℓ(f, zj), H(D(i) n ) =E Z∼P [ℓ(f ′, Z)]− 1 n nX j=1 ℓ(f ′, z(i) j ), wherez (i) j =z j forj̸=iandz (i) i =z ′ i. By uniform stability and taking expecta...

work page

[35] [35]

Central condition definition: Definition 2((η, c)-central condition).A random variableXsatisfies the(η, c)-central condition underPifη >0and 0< c≤1and logE P [e−ηX]≤ −cηE P [X].(22)

work page

[36] [36]

Then genS ≥c·gen T − 1 η Kn.(23)

Statement: Theorem 10(Distillation generalization lower bound).Assumeh(D n, fT )satisfies the(η, c)-central condition under∆ Dn,fT . Then genS ≥c·gen T − 1 η Kn.(23)

work page

[37] [37]

Proof: Proof.Apply Lemma 6 with P= ∆ Dn,fT , Q= ∆ ˆDn,fS , g(d, f) =−η h(d, f). Then E∆ ˆDn ,fS [−η h( ˆDn, fS)]≤K n + logE ∆Dn ,fT [e−η h(Dn,fT )].(24) By assumption,h(D n, fT )satisfies (22) withX=h(D n, fT )andP= ∆ Dn,fT : logE ∆Dn ,fT [e−η h(Dn,fT )]≤ −cηE ∆Dn ,fT [h(Dn, fT )] =−cηgen T .(25) Substitute (25) into (24): −ηE ∆ ˆDn ,fS [h( ˆDn, fS)]≤K n ...

work page

[38] [38]

Then for anyη >0, logE[e −ηX]≤ −ηµ+ η2σ2 2 =−ηµ 1− ησ2 2µ

Sub-Gaussianity implies a central condition for smallη: Remark 2(Deriving a valid(η, c)from sub-Gaussianity).AssumeXisσ 2-sub-Gaussian with meanµ=E[X]>0. Then for anyη >0, logE[e −ηX]≤ −ηµ+ η2σ2 2 =−ηµ 1− ησ2 2µ . ThusXsatisfies the(η, c)-central condition with c≤1− ησ2 2µ provided that0< η < 2µ σ2 . E. Linear Gaussian Case Study (Detailed KL Decomposition)

work page

[39] [39]

We use the vectorization identity vec(W X) = (X ⊤ ⊗I k) vec(W), valid forW∈R k×d andX∈R d×n

Matrix normal definition and basic identities: Definition 3(Matrix normal distribution).A random matrixA∈R k×n follows a matrix normal distributionA∼ MN(M, U, V)if vec(A)∼ N(vec(M), V⊗U), whereM∈R k×n,U∈R k×k,V∈R n×n. We use the vectorization identity vec(W X) = (X ⊤ ⊗I k) vec(W), valid forW∈R k×d andX∈R d×n. For Gaussians with the same covariance, we use...

work page

[40] [40]

Assume a noisy linear label channel Y|X∼ MN(W ⋆X, I k, ν 2In), whereW ⋆ ∈R k×d is the ground-truth linear map andν >0is the noise level

Generative model:Collect features and labels column-wise intoX∈R d×n andY∈R k×n so thatD n = (X, Y). Assume a noisy linear label channel Y|X∼ MN(W ⋆X, I k, ν 2In), whereW ⋆ ∈R k×d is the ground-truth linear map andν >0is the noise level

work page

[41] [41]

GivenD n = (X, Y), define the Gibbs posterior with inverse temperatureβ T : qT (W|D n)∝p 0(W) exp − βT 2ν2 ∥Y−W X∥ 2 F

Teacher as a Gibbs learner (closed form):Let the teacher parameter beW∈R k×d with prior p0(W) =MN(0, I k, λ −1Id). GivenD n = (X, Y), define the Gibbs posterior with inverse temperatureβ T : qT (W|D n)∝p 0(W) exp − βT 2ν2 ∥Y−W X∥ 2 F . Lemma 11(Closed form ofq T ).The posterior is matrix normal: qT (W|D n) =MN( ¯WT , I k,Σ T ),Σ T = λId + βT ν2 XX ⊤ −1 , ...

work page

[42] [42]

Hence the (unnormalized) log density is quadratic inwwith precision λIkd + βT ν2 B⊤ X BX =λI kd + βT ν2 (XX ⊤ ⊗I k), so the covariance is(λI d + βT ν2 XX ⊤)−1 ⊗I k = ΣT ⊗I k

The prior impliesw∼ N(0, λ −1Ikd). Hence the (unnormalized) log density is quadratic inwwith precision λIkd + βT ν2 B⊤ X BX =λI kd + βT ν2 (XX ⊤ ⊗I k), so the covariance is(λI d + βT ν2 XX ⊤)−1 ⊗I k = ΣT ⊗I k. The mean is the corresponding linear term mapped back to matrix form, yielding ¯WT = βT ν2 Y X ⊤ΣT

work page

[43] [43]

Pseudo-data generation:SampleW T ∼q T (· |D n)and generate pseudo labels through the same noisy channel: ˆY|(W T , X)∼ MN(W T X, I k, ν 2In), then define ˆDn = (X, ˆY)

work page

[44] [44]

Introduce the rank-κmap M ⋆(WT , X)as a best rank-κapproximation in prediction space: M ⋆(WT , X)∈arg min M: rank(M)=κ ∥WT X−W T M X∥2 F

Student capacity constraint via a rank bottleneck:Let the student parameter beΘ∈R k×d. Introduce the rank-κmap M ⋆(WT , X)as a best rank-κapproximation in prediction space: M ⋆(WT , X)∈arg min M: rank(M)=κ ∥WT X−W T M X∥2 F . Define a local Gaussian student conditional kernel qS(Θ|W T , X) =MN(W T M ⋆(WT , X), I k,Σ S),Σ S ≻0

work page

[45] [45]

Process-level KL and its two terms:Define the distillation divergence Kn := KL(∆ ˆDn,fS ∥∆ Dn,fT ). Using the KL chain rule on the dataset-model pair, Kn = KL(∆ ˆDn ∥∆ Dn) | {z } Dataset shift +E ˆDn h KL(∆fS | ˆDn ∥∆ fT |Dn) i | {z } Algorithm shift .(27) We now bound the two terms

work page

[46] [46]

Condition on(X, D n)and use convexity of KL.:GivenXandD n, ˆY|(X, D n)is a mixture overW T |D n: ˆY|(X, D n)∼ Z qT (WT |D n)MN(W T X, I k, ν 2In)dW T

Dataset shift bound and bias-variance decomposition: a) Step 1. Condition on(X, D n)and use convexity of KL.:GivenXandD n, ˆY|(X, D n)is a mixture overW T |D n: ˆY|(X, D n)∼ Z qT (WT |D n)MN(W T X, I k, ν 2In)dW T . The real label law (givenX) isMN(W ⋆X, Ik, ν2In). By convexity of KL in its first argument, KL( ˆY|X, D n ∥Y|X)≤E WT |Dn h KL MN(W T X, Ik, ν...

work page

[47] [47]

Since the student kernel is conditionally Gaussian given latentW T (andX),q S(Θ| ˆDn)is generally a mixture overW T

Algorithm shift bound and the rank-bottleneck decomposition:Define the (expected) algorithm-shift term KLalg :=E ˆDn h KL(∆fS | ˆDn ∥∆ fT |Dn) i . Since the student kernel is conditionally Gaussian given latentW T (andX),q S(Θ| ˆDn)is generally a mixture overW T . By convexity of KL in the first argument, conditioning and then averaging yields the reducti...

work page

[48] [48]

Final compact decomposition ofK n:Combine (27), (33), and (39): Kn ≤E Dn h 1 2ν2 Bias(Dn) +kVar(D n) | {z } Teacher prediction error + Apx(D n)| {z } Student capacity / rank bottleneck + Cov(ΣS,Σ T )| {z } Geometry mismatch + Spread(Dn)| {z } Posterior spread i . This yields an interpretable checklist: improve teacher bias/variance to tighten dataset shif...

work page

[49] [49]

For any datasetDand modelf, define empirical and population sharpness: SD(f) :=E U[LD(f+U)]−L D(f),(40) SP (f) :=E U[LP (f+U)]−L P (f)

Definitions:LetUbe a random perturbation, independent of all other randomness, uniformly distributed on the Euclidean ball{u:∥u∥ ≤ρ}. For any datasetDand modelf, define empirical and population sharpness: SD(f) :=E U[LD(f+U)]−L D(f),(40) SP (f) :=E U[LP (f+U)]−L P (f). Define the perturbed generalization gap hU(D, f) :=E U[h(D, f+U)] =E U[LP (f+U)−L D(f+U)].(41)

work page

[50] [50]

Then genS ≤gen T +E ∆Dn ,fT [SP (fT )] + (σu +ν) p 2Kn.(43)

Statement: Theorem 12(Sharpness-aware distillation generalization bound).Assume: •(i) (Local optimality in expectation) under∆ ˆDn,fS , E[LP (fS)]≤E[E U[LP (fS +U)]].(42) •(ii) Under the teacher process∆ Dn,fT , bothh U(Dn, fT )andS Dn(fT )are sub-Gaussian with proxiesσ 2 u andν 2, respectively. Then genS ≤gen T +E ∆Dn ,fT [SP (fT )] + (σu +ν) p 2Kn.(43)

work page

[51] [51]

Using (42),gen S ≤E ∆ ˆDn ,fS EU[LP (fS +U)]−L ˆDn (fS)

Proof: Proof.By definition,gen S =E ∆ ˆDn ,fS LP (fS)−L ˆDn (fS) . Using (42),gen S ≤E ∆ ˆDn ,fS EU[LP (fS +U)]−L ˆDn (fS) . Add and subtractE U[L ˆDn (fS +U)]inside the expectation: genS ≤E ∆ ˆDn ,fS EU(LP (fS +U)−L ˆDn (fS +U)) +E ∆ ˆDn ,fS EU(L ˆDn (fS +U))−L ˆDn (fS) . Recognize the two terms using (41) and (40): genS ≤E ∆ ˆDn ,fS hU( ˆDn, fS) +E ∆ ˆD...

work page

[52] [52]

•(A2) Parameter stability: for neighboring datasetsD, D (i),∥A T (D)−A T (D(i))∥ ≤κ/n

Assumptions used:We use the following conditions, matching the main text: •(A1) Bounded loss:ℓ(·;z)∈[a, b]. •(A2) Parameter stability: for neighboring datasetsD, D (i),∥A T (D)−A T (D(i))∥ ≤κ/n. •(A3) Global Lipschitz: for allzand allf, f ′,|ℓ(f;z)−ℓ(f ′;z)| ≤L∥f−f ′∥. •(A4) Local regularity onB(f T ,2ρ): for allz,ℓ(·;z)isα-smooth and∥∇ℓ(f T ;z)∥ ≤g 0. Th...

work page

[53] [53]

Then |h(Dn, fT )−h(D (i) n , f ′ T )| ≤ 2κL+ (b−a) n .(48) Consequently,h(D n, fT )is sub-Gaussian with proxyσ 0 = 1√n κL+ b−a 2

Bounded differences for the (unperturbed) teacher gap: Lemma 13(Bounded differences forh(D n, fT )under (A1)–(A3)).Letf T =A T (Dn)andf ′ T =A T (D(i) n )for neighboring datasets. Then |h(Dn, fT )−h(D (i) n , f ′ T )| ≤ 2κL+ (b−a) n .(48) Consequently,h(D n, fT )is sub-Gaussian with proxyσ 0 = 1√n κL+ b−a 2 . Proof.Write h(D, f) =L P (f)−L D(f). Then |h(D...

work page

[54] [54]

Proof.The proof repeats Lemma 13, replacing the global Lipschitz constantLby the local Lipschitz scale(g 0 +αρ)valid on the perturbation region

Bounded differences for the perturbed gaph U : Lemma 14(Proxy forσ u(ρ)under (A1), (A2), (A4)).Under (A1), (A2), (A4), the perturbed gaph U(Dn, fT )is sub-Gaussian with proxy σu(ρ) = 1√n κ(g0 +αρ) + b−a 2 . Proof.The proof repeats Lemma 13, replacing the global Lipschitz constantLby the local Lipschitz scale(g 0 +αρ)valid on the perturbation region. Step ...

work page

[55] [55]

Bounded differences for empirical sharpness and the proxyν(ρ): Lemma 15(Proxy forν(ρ)under (A2), (A4)).Under (A2) and (A4), assume moreover that for any neighboring datasets D, D(i), if we set f:=A T (D), f ′ :=A T (D(i)), then theρ-neighborhood of the line segment joiningfandf ′ lies inside the local region on which the Hessian bound in (A4) is valid. Th...

work page

[56] [56]

Then E[SP (fT )]≤ 1 2 τopρ2

Bounding the population sharpness by curvature: Lemma 16(Population sharpness bound under (A5 ′)).Assume sup ∥v∥≤ρ ∥∇2LP (fT +v)∥ op ≤τ op. Then E[SP (fT )]≤ 1 2 τopρ2. Proof.Recall SP (fT ) =E U LP (fT +U)−L P (fT ) . By Taylor’s theorem with integral remainder, LP (fT +u)−L P (fT ) =⟨∇L P (fT ), u⟩+ Z 1 0 (1−t)u ⊤∇2LP (fT +tu)u dt. Taking expectation ov...

work page

[57] [57]

We seek conditions under which Bsh(ρ)< B std

Baseline and sharpness-aware bounds:Define Bstd := genT +σ 0 p 2Kn, B sh(ρ) := genT +E[S P (fT )] + (σu(ρ) +ν(ρ)) p 2Kn. We seek conditions under which Bsh(ρ)< B std

work page

[58] [58]

By Lemma 16, E[SP (fT )]≤ 1 2 τopρ2

Proof: Proof.A sufficient condition forB sh(ρ)< B std is E[SP (fT )]< σ0 −σ u(ρ)−ν(ρ) p 2Kn. By Lemma 16, E[SP (fT )]≤ 1 2 τopρ2. Moreover, using the standard proxy σ0 = 1√n κL+ b−a 2 , the local proxy σu(ρ)≤ 1√n κ(g0 +αρ) + b−a 2 , and Lemma 15, ν(ρ) = 1√n ακ 2 ρ+g 0ρ+ α 2 ρ2 , we obtain the lower bound σ0 −σ u(ρ)−ν(ρ)≥ 1√n A0 −A 1ρ−A 2ρ2 , where A0 :=κ(...

work page