On the Generalization of Knowledge Distillation: An Information-Theoretic View
Pith reviewed 2026-05-19 17:58 UTC · model grok-4.3
The pith
Modeling teacher and student training as coupled processes yields generalization bounds via their KL divergence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By treating teacher and student training as coupled stochastic processes, the authors introduce the distillation divergence as the Kullback-Leibler divergence between the corresponding stochastic kernels. This quantity allows derivation of an upper generalization bound for the student relative to the teacher's gap under sub-Gaussian assumptions through algorithmic stability, and a lower bound under a central condition with sharper dependence on the divergence. A loss-sharpness-aware refinement shows that the teacher's local flatness strictly improves the bound, while a linear Gaussian case study decomposes the divergence into interpretable bias, variance, and rank-bottleneck costs.
What carries the argument
The distillation divergence, defined as the Kullback-Leibler divergence between the stochastic kernels of the teacher and student training processes, quantifies the difference between the two processes and transfers generalization properties from teacher to student.
If this is right
- If the distillation divergence remains small, the student's generalization gap stays close to the teacher's gap.
- The lower bound implies that large distillation divergence prevents the student from generalizing much better than the teacher.
- Incorporating the teacher's loss sharpness yields a strictly tighter bound when the teacher is locally flat.
- In linear Gaussian models the divergence breaks into bias, variance, and rank costs that can guide choices such as model architecture or training rank.
Where Pith is reading between the lines
- Minimizing the distillation divergence during training could serve as a practical objective that improves distillation results beyond standard soft-label losses.
- The coupled-process view may extend to other transfer settings by defining analogous divergences between source and target training kernels.
- Testing the bounds on nonlinear networks would reveal whether the linear Gaussian decomposition offers useful design rules outside the analyzed case.
Load-bearing premise
Teacher and student training can be modeled as coupled stochastic processes whose kernels admit a well-defined KL divergence that can be bounded or decomposed in the stated ways.
What would settle it
An experiment or calculation showing that the student's generalization error exceeds the derived upper bound even when the distillation divergence is small and the sub-Gaussian assumption holds.
read the original abstract
Knowledge distillation is widely used to improve generalization in practice, yet its theoretical understanding remains elusive. In the standard distillation setting, a teacher model provides soft predictions to guide the training of a student model. We model teacher and student training as coupled stochastic processes and introduce a distillation divergence, defined as the Kullback-Leibler divergence between these two stochastic kernels. Within this framework, we derive two generalization bounds for the student model relative to the teacher's generalization gap: an upper bound under a sub-Gaussian assumption via algorithmic stability, and a lower bound under a central condition with sharper dependence on the distillation divergence. We further develop a loss-sharpness-aware bound with an explicit tightness regime, showing that the teacher's local flatness can strictly tighten the bound. Additionally, in a linear Gaussian case study, the distillation divergence admits an interpretable decomposition into bias, variance, and rank-bottleneck costs, yielding practical guidance for distillation design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper models teacher and student training in knowledge distillation as coupled stochastic processes and defines a distillation divergence as the KL divergence between their stochastic kernels. It derives an upper bound on the student's generalization gap (relative to the teacher's) under a sub-Gaussian assumption via algorithmic stability, a lower bound under a central condition with sharper dependence on the divergence, a loss-sharpness-aware bound with an explicit tightness regime for the teacher's local flatness, and an interpretable bias-variance-rank decomposition of the divergence in a linear-Gaussian case study.
Significance. If the coupled-process modeling and bounds hold with the stated assumptions, the work provides a useful information-theoretic lens on distillation that explicitly ties the divergence to generalization gaps and offers practical design guidance via the linear-case decomposition. The combination of stability-based upper bounds, central-condition lower bounds, and the sharpness tightness regime is a constructive contribution to KD theory.
major comments (3)
- [Framework section] Framework section: The central construction defines the distillation divergence via KL between stochastic kernels of coupled teacher-student processes, which presupposes a joint law over simultaneous training dynamics. This modeling choice is load-bearing for both the upper and lower bounds, yet standard KD practice fixes a pre-trained teacher and trains the student independently; the manuscript should clarify whether the bounds extend to the sequential regime or require additional approximation arguments.
- [Generalization bounds] Generalization bounds (upper bound via algorithmic stability): The sub-Gaussian assumption and stability parameter are invoked to relate the student gap to the teacher gap with explicit dependence on the distillation divergence. The paper should verify or discuss how these assumptions are satisfied for typical neural-network losses in distillation, as violation would weaken the claimed explicit dependence.
- [Linear-Gaussian case study] Linear-Gaussian case study: The bias-variance-rank decomposition of the distillation divergence is presented as yielding practical guidance. The derivation should be shown to follow directly from the KL definition of the divergence without post-hoc parameter fitting, and the regime of validity (e.g., when the rank-bottleneck term dominates) should be stated explicitly.
minor comments (2)
- [Abstract] Abstract: The phrase 'sharper dependence on the distillation divergence' for the lower bound could be made more precise by indicating the functional form of the dependence if space permits.
- [Notation and definitions] Notation: Ensure consistent use of symbols for the stochastic kernels and the distillation divergence across the framework and bound derivations to avoid reader confusion.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of the paper's significance. We address each major comment below and plan to incorporate revisions to clarify the modeling choices and strengthen the discussions.
read point-by-point responses
-
Referee: [Framework section] Framework section: The central construction defines the distillation divergence via KL between stochastic kernels of coupled teacher-student processes, which presupposes a joint law over simultaneous training dynamics. This modeling choice is load-bearing for both the upper and lower bounds, yet standard KD practice fixes a pre-trained teacher and trains the student independently; the manuscript should clarify whether the bounds extend to the sequential regime or require additional approximation arguments.
Authors: We agree that clarifying this point is important. Our framework models the processes as coupled to rigorously define the distillation divergence and derive the bounds. In the standard sequential setting, the pre-trained teacher can be viewed as having a fixed stochastic kernel, and the joint law can be approximated by the product of the teacher's converged distribution and the student's training process. We will add a paragraph in the Framework section discussing this approximation and how the bounds apply with minor modifications. revision: yes
-
Referee: [Generalization bounds] Generalization bounds (upper bound via algorithmic stability): The sub-Gaussian assumption and stability parameter are invoked to relate the student gap to the teacher gap with explicit dependence on the distillation divergence. The paper should verify or discuss how these assumptions are satisfied for typical neural-network losses in distillation, as violation would weaken the claimed explicit dependence.
Authors: The sub-Gaussian assumption is a common technical condition in algorithmic stability analyses and is satisfied for losses with bounded range or sub-Gaussian tails, which can be ensured in distillation by using temperature scaling to control the softness of the predictions. We will include a discussion in the relevant section on how this assumption holds for typical KD losses like the KL divergence between teacher and student outputs, assuming bounded logits or appropriate regularization. revision: yes
-
Referee: [Linear-Gaussian case study] Linear-Gaussian case study: The bias-variance-rank decomposition of the distillation divergence is presented as yielding practical guidance. The derivation should be shown to follow directly from the KL definition of the divergence without post-hoc parameter fitting, and the regime of validity (e.g., when the rank-bottleneck term dominates) should be stated explicitly.
Authors: The decomposition is obtained directly by applying the closed-form KL divergence formula for multivariate Gaussians to the stochastic kernels defined in our framework, resulting in separate terms for bias (mean shift), variance (covariance mismatch), and rank (dimensionality reduction effect). There is no post-hoc fitting involved. We will explicitly delineate the validity regime in the case study, specifying that the rank-bottleneck dominates in low-rank student models or when the teacher's feature space has higher effective rank. revision: yes
Circularity Check
Framework and bounds are self-contained; no reduction to inputs by construction
full rationale
The paper defines a distillation divergence via KL between kernels of coupled teacher-student stochastic processes, then derives generalization bounds (upper via sub-Gaussian + stability; lower via central condition) that explicitly depend on this quantity and the teacher's gap. This is a standard modeling choice followed by derivation under stated assumptions, not a self-definitional loop or fitted input renamed as prediction. The linear-Gaussian decomposition is an analysis of the defined divergence rather than a tautology. No load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work are indicated. The derivation chain remains independent of the target results.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption sub-Gaussian assumption on the loss for the algorithmic-stability upper bound
- domain assumption central condition for the lower bound
invented entities (1)
-
distillation divergence
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We model teacher and student training as coupled stochastic processes and introduce a distillation divergence, defined as the Kullback–Leibler divergence between these two stochastic kernels.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1 (Distillation Generalization Upper Bound) ... genS ≤ genT + σ √(2 Kn)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Distilling the Knowledge in a Neural Network
G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[2]
Y . Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” 2017. [Online]. Available: https://arxiv.org/abs/1706.00384
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[3]
Learning from multiple teacher networks,
S. You, C. Xu, C. Xu, and D. Tao, “Learning from multiple teacher networks,” inProceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, pp. 1285– 1294
work page 2017
-
[4]
Towards understanding knowledge distil- lation,
M. Phuong and C. Lampert, “Towards understanding knowledge distil- lation,” inInternational conference on machine learning. PMLR, 2019, pp. 5142–5151
work page 2019
-
[5]
Do Deep Nets Really Need to be Deep?
L. J. Ba and R. Caruana, “Do deep nets really need to be deep?” 2014. [Online]. Available: https://arxiv.org/abs/1312.6184
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[6]
Unifying distillation and privileged information
D. Lopez-Paz, L. Bottou, B. Sch ¨olkopf, and V . Vapnik, “Unifying distillation and privileged information,” 2016. [Online]. Available: https://arxiv.org/abs/1511.03643
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[7]
Learning using privileged information: Similarity control and knowledge transfer,
V . Vapnik and R. Izmailov, “Learning using privileged information: Similarity control and knowledge transfer,”Journal of Machine Learning Research, vol. 16, no. 61, pp. 2023–2049, 2015. [Online]. Available: http://jmlr.org/papers/v16/vapnik15b.html
work page 2023
-
[8]
Generalization bounds via distillation,
D. Hsu, Z. Ji, M. Telgarsky, and L. Wang, “Generalization bounds via distillation,” 2021. [Online]. Available: https://arxiv.org/abs/2104.05641
-
[9]
A statistical perspective on distillation,
A. K. Menon, A. S. Rawat, S. Reddi, S. Kim, and S. Kumar, “A statistical perspective on distillation,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol
- [10]
-
[11]
Knowledge distillation performs partial variance reduction,
M. Safaryan, A. Peste, and D. Alistarh, “Knowledge distillation performs partial variance reduction,” 2023. [Online]. Available: https://arxiv.org/abs/2305.17581
-
[12]
Revisiting knowledge distillation via label smoothing regularization,
L. Yuan, F. E. H. Tay, G. Li, T. Wang, and J. Feng, “Revisiting knowledge distillation via label smoothing regularization,” 2021. [Online]. Available: https://arxiv.org/abs/1909.11723
-
[13]
Knowledge distillation in wide neural networks: Risk bound, data efficiency and imperfect teacher,
G. Ji and Z. Zhu, “Knowledge distillation in wide neural networks: Risk bound, data efficiency and imperfect teacher,” 2020. [Online]. Available: https://arxiv.org/abs/2010.10090
-
[14]
M. Pham, M. Cho, A. Joshi, and C. Hegde, “Revisiting self-distillation,”
-
[15]
Available: https://arxiv.org/abs/2206.08491
[Online]. Available: https://arxiv.org/abs/2206.08491
-
[16]
Peter Holderrieth, Yilun Xu, and Tommi Jaakkola
S. Hochreiter and J. Schmidhuber, “Flat minima,”Neural Computation, vol. 9, no. 1, pp. 1–42, 01 1997. [Online]. Available: https: //doi.org/10.1162/neco.1997.9.1.1
-
[17]
T. Zhang, M. Xue, J. Zhang, H. Zhang, Y . Wang, L. Cheng, J. Song, and M. Song, “Generalization matters: Loss minima flattening via parameter hybridization for efficient online knowledge distillation,”
-
[18]
Available: https://arxiv.org/abs/2303.14666
[Online]. Available: https://arxiv.org/abs/2303.14666
-
[19]
Sharpness-Aware Minimization for Efficiently Improving Generalization
P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” 2021. [Online]. Available: https://arxiv.org/abs/2010.01412
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
Leveraging flatness to improve information-theoretic generalization bounds for sgd,
Z. Peng, J. Zhang, Y . Wang, L. Qi, Y . Shi, and Y . Gao, “Leveraging flatness to improve information-theoretic generalization bounds for sgd,” 2026. [Online]. Available: https://arxiv.org/abs/2601.01465
-
[21]
The information bottleneck method
N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” 2000. [Online]. Available: https://arxiv.org/abs/ physics/0004057
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[22]
Efficient knowledge distillation from model checkpoints,
C. Wang, Q. Yang, R. Huang, S. Song, and G. Huang, “Efficient knowledge distillation from model checkpoints,” 2022. [Online]. Available: https://arxiv.org/abs/2210.06458
-
[23]
Variational Information Distillation for Knowledge Transfer
S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai, “Variational information distillation for knowledge transfer,” 2019. [Online]. Available: https://arxiv.org/abs/1904.05835
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[24]
URLhttps://arxiv.org/abs/1910.10699
Y . Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” 2022. [Online]. Available: https://arxiv.org/abs/1910.10699
-
[25]
L. Ye, S. M. Hamidi, R. Tan, and E.-H. Yang, “Bayes conditional distribution estimation for knowledge distillation based on conditional mutual information,” 2024. [Online]. Available: https://arxiv.org/abs/ 2401.08732
-
[26]
Information-theoretic analysis of generalization capability of learning algorithms
A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1705.07809
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
Fast rate information- theoretic bounds on generalization errors,
X. Wu, J. H. Manton, U. Aickelin, and J. Zhu, “Fast rate information- theoretic bounds on generalization errors,” 2025. [Online]. Available: https://arxiv.org/abs/2303.14658
-
[28]
Drone: Data-aware low- rank compression for large nlp models,
P. Chen, H.-F. Yu, I. Dhillon, and C.-J. Hsieh, “Drone: Data-aware low- rank compression for large nlp models,”Advances in neural information processing systems, vol. 34, pp. 29 321–29 334, 2021. V. APPENDIX A. Preliminary Tools
work page 2021
-
[29]
Donsker–Varadhan change of measure inequality: Lemma 6(Donsker–Varadhan inequality).LetP, Qbe probability measures on the same measurable space withQ≪P. For any measurable functiongwithE P [eg]<∞, EQ[g]≤KL(Q∥P) + logE P [eg].(10) Proof.Define the Radon–Nikodym derivativer:= dQ dP . ThenE P [r] = 1and KL(Q∥P) =E Q log dQ dP =E P [rlogr]. By Jensen’s inequa...
-
[30]
A standard sub-Gaussian mgf bound: Lemma 7(Sub-Gaussian mgf bound).LetXbeσ 2-sub-Gaussian, meaninglogE[exp(λ(X−E[X]))]≤ λ2σ2 2 for allλ∈R. Then for anyλ∈R, logE[e λX]≤λE[X] + λ2σ2 2 .(11) Proof.By definition, E[eλX] =E h eλ(X−E[X]) i ·e λE[X] . Taking logs and applying the sub-Gaussian condition gives (11). B. Proof of Theorem 1
-
[31]
Then genS ≤gen T +σ p 2Kn.(12)
Statement: Theorem 8(Distillation generalization upper bound).Assume thath(D n, fT )isσ 2-sub-Gaussian under∆ Dn,fT . Then genS ≤gen T +σ p 2Kn.(12)
-
[32]
Then (10) gives E∆ ˆDn ,fS [λh( ˆDn, fS)]≤KL(∆ ˆDn,fS ∥∆Dn,fT ) + logE∆Dn ,fT [eλh(Dn,fT )]
Proof: Proof.Fix anyλ >0and choose in Lemma 6 P= ∆ Dn,fT , Q= ∆ ˆDn,fS , g(d, f) =λ h(d, f). Then (10) gives E∆ ˆDn ,fS [λh( ˆDn, fS)]≤KL(∆ ˆDn,fS ∥∆Dn,fT ) + logE∆Dn ,fT [eλh(Dn,fT )]. By definitions, the left side equalsλgen S and the KL term equalsK n, so λgenS ≤K n + logE ∆Dn ,fT [eλh(Dn,fT )].(13) Sinceh(D n, fT )isσ 2-sub-Gaussian with meangen T , L...
-
[33]
Statement: Proposition 9(Sub-Gaussianity via stability).Assume the loss is bounded:ℓ(f, z)∈[a, b]for allf, z. Assume the teacher algorithmA T isβ-uniformly stable: for any neighboring datasetsD n andD (i) n differing in one example, sup z∈Z ℓ(AT (Dn), z)−ℓ(A T (D(i) n ), z) ≤β. Let H(D n) :=L P (AT (Dn))−L Dn(AT (Dn)). ThenH(D n)satisfies bounded differen...
-
[34]
Proof: Proof.LetD n = (z1, . . . , zn)and letD (i) n = (z1, . . . , zi−1, z′ i, zi+1, . . . , zn)be a neighboring dataset. Define f:=A T (Dn), f ′ :=A T (D(i) n ). Then H(D n) =E Z∼P [ℓ(f, Z)]− 1 n nX j=1 ℓ(f, zj), H(D(i) n ) =E Z∼P [ℓ(f ′, Z)]− 1 n nX j=1 ℓ(f ′, z(i) j ), wherez (i) j =z j forj̸=iandz (i) i =z ′ i. By uniform stability and taking expecta...
-
[35]
Central condition definition: Definition 2((η, c)-central condition).A random variableXsatisfies the(η, c)-central condition underPifη >0and 0< c≤1and logE P [e−ηX]≤ −cηE P [X].(22)
-
[36]
Then genS ≥c·gen T − 1 η Kn.(23)
Statement: Theorem 10(Distillation generalization lower bound).Assumeh(D n, fT )satisfies the(η, c)-central condition under∆ Dn,fT . Then genS ≥c·gen T − 1 η Kn.(23)
-
[37]
Proof: Proof.Apply Lemma 6 with P= ∆ Dn,fT , Q= ∆ ˆDn,fS , g(d, f) =−η h(d, f). Then E∆ ˆDn ,fS [−η h( ˆDn, fS)]≤K n + logE ∆Dn ,fT [e−η h(Dn,fT )].(24) By assumption,h(D n, fT )satisfies (22) withX=h(D n, fT )andP= ∆ Dn,fT : logE ∆Dn ,fT [e−η h(Dn,fT )]≤ −cηE ∆Dn ,fT [h(Dn, fT )] =−cηgen T .(25) Substitute (25) into (24): −ηE ∆ ˆDn ,fS [h( ˆDn, fS)]≤K n ...
-
[38]
Then for anyη >0, logE[e −ηX]≤ −ηµ+ η2σ2 2 =−ηµ 1− ησ2 2µ
Sub-Gaussianity implies a central condition for smallη: Remark 2(Deriving a valid(η, c)from sub-Gaussianity).AssumeXisσ 2-sub-Gaussian with meanµ=E[X]>0. Then for anyη >0, logE[e −ηX]≤ −ηµ+ η2σ2 2 =−ηµ 1− ησ2 2µ . ThusXsatisfies the(η, c)-central condition with c≤1− ησ2 2µ provided that0< η < 2µ σ2 . E. Linear Gaussian Case Study (Detailed KL Decomposition)
-
[39]
We use the vectorization identity vec(W X) = (X ⊤ ⊗I k) vec(W), valid forW∈R k×d andX∈R d×n
Matrix normal definition and basic identities: Definition 3(Matrix normal distribution).A random matrixA∈R k×n follows a matrix normal distributionA∼ MN(M, U, V)if vec(A)∼ N(vec(M), V⊗U), whereM∈R k×n,U∈R k×k,V∈R n×n. We use the vectorization identity vec(W X) = (X ⊤ ⊗I k) vec(W), valid forW∈R k×d andX∈R d×n. For Gaussians with the same covariance, we use...
-
[40]
Generative model:Collect features and labels column-wise intoX∈R d×n andY∈R k×n so thatD n = (X, Y). Assume a noisy linear label channel Y|X∼ MN(W ⋆X, I k, ν 2In), whereW ⋆ ∈R k×d is the ground-truth linear map andν >0is the noise level
-
[41]
Teacher as a Gibbs learner (closed form):Let the teacher parameter beW∈R k×d with prior p0(W) =MN(0, I k, λ −1Id). GivenD n = (X, Y), define the Gibbs posterior with inverse temperatureβ T : qT (W|D n)∝p 0(W) exp − βT 2ν2 ∥Y−W X∥ 2 F . Lemma 11(Closed form ofq T ).The posterior is matrix normal: qT (W|D n) =MN( ¯WT , I k,Σ T ),Σ T = λId + βT ν2 XX ⊤ −1 , ...
-
[42]
The prior impliesw∼ N(0, λ −1Ikd). Hence the (unnormalized) log density is quadratic inwwith precision λIkd + βT ν2 B⊤ X BX =λI kd + βT ν2 (XX ⊤ ⊗I k), so the covariance is(λI d + βT ν2 XX ⊤)−1 ⊗I k = ΣT ⊗I k. The mean is the corresponding linear term mapped back to matrix form, yielding ¯WT = βT ν2 Y X ⊤ΣT
-
[43]
Pseudo-data generation:SampleW T ∼q T (· |D n)and generate pseudo labels through the same noisy channel: ˆY|(W T , X)∼ MN(W T X, I k, ν 2In), then define ˆDn = (X, ˆY)
-
[44]
Student capacity constraint via a rank bottleneck:Let the student parameter beΘ∈R k×d. Introduce the rank-κmap M ⋆(WT , X)as a best rank-κapproximation in prediction space: M ⋆(WT , X)∈arg min M: rank(M)=κ ∥WT X−W T M X∥2 F . Define a local Gaussian student conditional kernel qS(Θ|W T , X) =MN(W T M ⋆(WT , X), I k,Σ S),Σ S ≻0
-
[45]
Process-level KL and its two terms:Define the distillation divergence Kn := KL(∆ ˆDn,fS ∥∆ Dn,fT ). Using the KL chain rule on the dataset-model pair, Kn = KL(∆ ˆDn ∥∆ Dn) | {z } Dataset shift +E ˆDn h KL(∆fS | ˆDn ∥∆ fT |Dn) i | {z } Algorithm shift .(27) We now bound the two terms
-
[46]
Dataset shift bound and bias-variance decomposition: a) Step 1. Condition on(X, D n)and use convexity of KL.:GivenXandD n, ˆY|(X, D n)is a mixture overW T |D n: ˆY|(X, D n)∼ Z qT (WT |D n)MN(W T X, I k, ν 2In)dW T . The real label law (givenX) isMN(W ⋆X, Ik, ν2In). By convexity of KL in its first argument, KL( ˆY|X, D n ∥Y|X)≤E WT |Dn h KL MN(W T X, Ik, ν...
-
[47]
Algorithm shift bound and the rank-bottleneck decomposition:Define the (expected) algorithm-shift term KLalg :=E ˆDn h KL(∆fS | ˆDn ∥∆ fT |Dn) i . Since the student kernel is conditionally Gaussian given latentW T (andX),q S(Θ| ˆDn)is generally a mixture overW T . By convexity of KL in the first argument, conditioning and then averaging yields the reducti...
-
[48]
Final compact decomposition ofK n:Combine (27), (33), and (39): Kn ≤E Dn h 1 2ν2 Bias(Dn) +kVar(D n) | {z } Teacher prediction error + Apx(D n)| {z } Student capacity / rank bottleneck + Cov(ΣS,Σ T )| {z } Geometry mismatch + Spread(Dn)| {z } Posterior spread i . This yields an interpretable checklist: improve teacher bias/variance to tighten dataset shif...
-
[49]
Definitions:LetUbe a random perturbation, independent of all other randomness, uniformly distributed on the Euclidean ball{u:∥u∥ ≤ρ}. For any datasetDand modelf, define empirical and population sharpness: SD(f) :=E U[LD(f+U)]−L D(f),(40) SP (f) :=E U[LP (f+U)]−L P (f). Define the perturbed generalization gap hU(D, f) :=E U[h(D, f+U)] =E U[LP (f+U)−L D(f+U)].(41)
-
[50]
Then genS ≤gen T +E ∆Dn ,fT [SP (fT )] + (σu +ν) p 2Kn.(43)
Statement: Theorem 12(Sharpness-aware distillation generalization bound).Assume: •(i) (Local optimality in expectation) under∆ ˆDn,fS , E[LP (fS)]≤E[E U[LP (fS +U)]].(42) •(ii) Under the teacher process∆ Dn,fT , bothh U(Dn, fT )andS Dn(fT )are sub-Gaussian with proxiesσ 2 u andν 2, respectively. Then genS ≤gen T +E ∆Dn ,fT [SP (fT )] + (σu +ν) p 2Kn.(43)
-
[51]
Using (42),gen S ≤E ∆ ˆDn ,fS EU[LP (fS +U)]−L ˆDn (fS)
Proof: Proof.By definition,gen S =E ∆ ˆDn ,fS LP (fS)−L ˆDn (fS) . Using (42),gen S ≤E ∆ ˆDn ,fS EU[LP (fS +U)]−L ˆDn (fS) . Add and subtractE U[L ˆDn (fS +U)]inside the expectation: genS ≤E ∆ ˆDn ,fS EU(LP (fS +U)−L ˆDn (fS +U)) +E ∆ ˆDn ,fS EU(L ˆDn (fS +U))−L ˆDn (fS) . Recognize the two terms using (41) and (40): genS ≤E ∆ ˆDn ,fS hU( ˆDn, fS) +E ∆ ˆD...
-
[52]
•(A2) Parameter stability: for neighboring datasetsD, D (i),∥A T (D)−A T (D(i))∥ ≤κ/n
Assumptions used:We use the following conditions, matching the main text: •(A1) Bounded loss:ℓ(·;z)∈[a, b]. •(A2) Parameter stability: for neighboring datasetsD, D (i),∥A T (D)−A T (D(i))∥ ≤κ/n. •(A3) Global Lipschitz: for allzand allf, f ′,|ℓ(f;z)−ℓ(f ′;z)| ≤L∥f−f ′∥. •(A4) Local regularity onB(f T ,2ρ): for allz,ℓ(·;z)isα-smooth and∥∇ℓ(f T ;z)∥ ≤g 0. Th...
-
[53]
Bounded differences for the (unperturbed) teacher gap: Lemma 13(Bounded differences forh(D n, fT )under (A1)–(A3)).Letf T =A T (Dn)andf ′ T =A T (D(i) n )for neighboring datasets. Then |h(Dn, fT )−h(D (i) n , f ′ T )| ≤ 2κL+ (b−a) n .(48) Consequently,h(D n, fT )is sub-Gaussian with proxyσ 0 = 1√n κL+ b−a 2 . Proof.Write h(D, f) =L P (f)−L D(f). Then |h(D...
-
[54]
Bounded differences for the perturbed gaph U : Lemma 14(Proxy forσ u(ρ)under (A1), (A2), (A4)).Under (A1), (A2), (A4), the perturbed gaph U(Dn, fT )is sub-Gaussian with proxy σu(ρ) = 1√n κ(g0 +αρ) + b−a 2 . Proof.The proof repeats Lemma 13, replacing the global Lipschitz constantLby the local Lipschitz scale(g 0 +αρ)valid on the perturbation region. Step ...
-
[55]
Bounded differences for empirical sharpness and the proxyν(ρ): Lemma 15(Proxy forν(ρ)under (A2), (A4)).Under (A2) and (A4), assume moreover that for any neighboring datasets D, D(i), if we set f:=A T (D), f ′ :=A T (D(i)), then theρ-neighborhood of the line segment joiningfandf ′ lies inside the local region on which the Hessian bound in (A4) is valid. Th...
-
[56]
Bounding the population sharpness by curvature: Lemma 16(Population sharpness bound under (A5 ′)).Assume sup ∥v∥≤ρ ∥∇2LP (fT +v)∥ op ≤τ op. Then E[SP (fT )]≤ 1 2 τopρ2. Proof.Recall SP (fT ) =E U LP (fT +U)−L P (fT ) . By Taylor’s theorem with integral remainder, LP (fT +u)−L P (fT ) =⟨∇L P (fT ), u⟩+ Z 1 0 (1−t)u ⊤∇2LP (fT +tu)u dt. Taking expectation ov...
-
[57]
We seek conditions under which Bsh(ρ)< B std
Baseline and sharpness-aware bounds:Define Bstd := genT +σ 0 p 2Kn, B sh(ρ) := genT +E[S P (fT )] + (σu(ρ) +ν(ρ)) p 2Kn. We seek conditions under which Bsh(ρ)< B std
-
[58]
By Lemma 16, E[SP (fT )]≤ 1 2 τopρ2
Proof: Proof.A sufficient condition forB sh(ρ)< B std is E[SP (fT )]< σ0 −σ u(ρ)−ν(ρ) p 2Kn. By Lemma 16, E[SP (fT )]≤ 1 2 τopρ2. Moreover, using the standard proxy σ0 = 1√n κL+ b−a 2 , the local proxy σu(ρ)≤ 1√n κ(g0 +αρ) + b−a 2 , and Lemma 15, ν(ρ) = 1√n ακ 2 ρ+g 0ρ+ α 2 ρ2 , we obtain the lower bound σ0 −σ u(ρ)−ν(ρ)≥ 1√n A0 −A 1ρ−A 2ρ2 , where A0 :=κ(...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.