pith. sign in

arxiv: 2605.13143 · v2 · pith:U3E6Y3MTnew · submitted 2026-05-13 · 💻 cs.IT · cs.LG· math.IT

On the Generalization of Knowledge Distillation: An Information-Theoretic View

Pith reviewed 2026-05-19 17:58 UTC · model grok-4.3

classification 💻 cs.IT cs.LGmath.IT
keywords knowledge distillationgeneralization boundsKullback-Leibler divergencestochastic processesalgorithmic stabilityloss sharpnessinformation theory
0
0 comments X

The pith

Modeling teacher and student training as coupled processes yields generalization bounds via their KL divergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models teacher and student training as coupled stochastic processes and defines a distillation divergence as the Kullback-Leibler divergence between their stochastic kernels. Using this measure, it derives an upper bound on the student's generalization error in terms of the teacher's gap under a sub-Gaussian assumption via algorithmic stability, along with a lower bound under a central condition that depends more directly on the divergence. The work also presents a loss-sharpness-aware bound showing that local flatness in the teacher can tighten the result, and decomposes the divergence explicitly in a linear Gaussian setting into bias, variance, and rank-bottleneck components.

Core claim

By treating teacher and student training as coupled stochastic processes, the authors introduce the distillation divergence as the Kullback-Leibler divergence between the corresponding stochastic kernels. This quantity allows derivation of an upper generalization bound for the student relative to the teacher's gap under sub-Gaussian assumptions through algorithmic stability, and a lower bound under a central condition with sharper dependence on the divergence. A loss-sharpness-aware refinement shows that the teacher's local flatness strictly improves the bound, while a linear Gaussian case study decomposes the divergence into interpretable bias, variance, and rank-bottleneck costs.

What carries the argument

The distillation divergence, defined as the Kullback-Leibler divergence between the stochastic kernels of the teacher and student training processes, quantifies the difference between the two processes and transfers generalization properties from teacher to student.

If this is right

  • If the distillation divergence remains small, the student's generalization gap stays close to the teacher's gap.
  • The lower bound implies that large distillation divergence prevents the student from generalizing much better than the teacher.
  • Incorporating the teacher's loss sharpness yields a strictly tighter bound when the teacher is locally flat.
  • In linear Gaussian models the divergence breaks into bias, variance, and rank costs that can guide choices such as model architecture or training rank.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Minimizing the distillation divergence during training could serve as a practical objective that improves distillation results beyond standard soft-label losses.
  • The coupled-process view may extend to other transfer settings by defining analogous divergences between source and target training kernels.
  • Testing the bounds on nonlinear networks would reveal whether the linear Gaussian decomposition offers useful design rules outside the analyzed case.

Load-bearing premise

Teacher and student training can be modeled as coupled stochastic processes whose kernels admit a well-defined KL divergence that can be bounded or decomposed in the stated ways.

What would settle it

An experiment or calculation showing that the student's generalization error exceeds the derived upper bound even when the distillation divergence is small and the sub-Gaussian assumption holds.

read the original abstract

Knowledge distillation is widely used to improve generalization in practice, yet its theoretical understanding remains elusive. In the standard distillation setting, a teacher model provides soft predictions to guide the training of a student model. We model teacher and student training as coupled stochastic processes and introduce a distillation divergence, defined as the Kullback-Leibler divergence between these two stochastic kernels. Within this framework, we derive two generalization bounds for the student model relative to the teacher's generalization gap: an upper bound under a sub-Gaussian assumption via algorithmic stability, and a lower bound under a central condition with sharper dependence on the distillation divergence. We further develop a loss-sharpness-aware bound with an explicit tightness regime, showing that the teacher's local flatness can strictly tighten the bound. Additionally, in a linear Gaussian case study, the distillation divergence admits an interpretable decomposition into bias, variance, and rank-bottleneck costs, yielding practical guidance for distillation design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper models teacher and student training in knowledge distillation as coupled stochastic processes and defines a distillation divergence as the KL divergence between their stochastic kernels. It derives an upper bound on the student's generalization gap (relative to the teacher's) under a sub-Gaussian assumption via algorithmic stability, a lower bound under a central condition with sharper dependence on the divergence, a loss-sharpness-aware bound with an explicit tightness regime for the teacher's local flatness, and an interpretable bias-variance-rank decomposition of the divergence in a linear-Gaussian case study.

Significance. If the coupled-process modeling and bounds hold with the stated assumptions, the work provides a useful information-theoretic lens on distillation that explicitly ties the divergence to generalization gaps and offers practical design guidance via the linear-case decomposition. The combination of stability-based upper bounds, central-condition lower bounds, and the sharpness tightness regime is a constructive contribution to KD theory.

major comments (3)
  1. [Framework section] Framework section: The central construction defines the distillation divergence via KL between stochastic kernels of coupled teacher-student processes, which presupposes a joint law over simultaneous training dynamics. This modeling choice is load-bearing for both the upper and lower bounds, yet standard KD practice fixes a pre-trained teacher and trains the student independently; the manuscript should clarify whether the bounds extend to the sequential regime or require additional approximation arguments.
  2. [Generalization bounds] Generalization bounds (upper bound via algorithmic stability): The sub-Gaussian assumption and stability parameter are invoked to relate the student gap to the teacher gap with explicit dependence on the distillation divergence. The paper should verify or discuss how these assumptions are satisfied for typical neural-network losses in distillation, as violation would weaken the claimed explicit dependence.
  3. [Linear-Gaussian case study] Linear-Gaussian case study: The bias-variance-rank decomposition of the distillation divergence is presented as yielding practical guidance. The derivation should be shown to follow directly from the KL definition of the divergence without post-hoc parameter fitting, and the regime of validity (e.g., when the rank-bottleneck term dominates) should be stated explicitly.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'sharper dependence on the distillation divergence' for the lower bound could be made more precise by indicating the functional form of the dependence if space permits.
  2. [Notation and definitions] Notation: Ensure consistent use of symbols for the stochastic kernels and the distillation divergence across the framework and bound derivations to avoid reader confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the paper's significance. We address each major comment below and plan to incorporate revisions to clarify the modeling choices and strengthen the discussions.

read point-by-point responses
  1. Referee: [Framework section] Framework section: The central construction defines the distillation divergence via KL between stochastic kernels of coupled teacher-student processes, which presupposes a joint law over simultaneous training dynamics. This modeling choice is load-bearing for both the upper and lower bounds, yet standard KD practice fixes a pre-trained teacher and trains the student independently; the manuscript should clarify whether the bounds extend to the sequential regime or require additional approximation arguments.

    Authors: We agree that clarifying this point is important. Our framework models the processes as coupled to rigorously define the distillation divergence and derive the bounds. In the standard sequential setting, the pre-trained teacher can be viewed as having a fixed stochastic kernel, and the joint law can be approximated by the product of the teacher's converged distribution and the student's training process. We will add a paragraph in the Framework section discussing this approximation and how the bounds apply with minor modifications. revision: yes

  2. Referee: [Generalization bounds] Generalization bounds (upper bound via algorithmic stability): The sub-Gaussian assumption and stability parameter are invoked to relate the student gap to the teacher gap with explicit dependence on the distillation divergence. The paper should verify or discuss how these assumptions are satisfied for typical neural-network losses in distillation, as violation would weaken the claimed explicit dependence.

    Authors: The sub-Gaussian assumption is a common technical condition in algorithmic stability analyses and is satisfied for losses with bounded range or sub-Gaussian tails, which can be ensured in distillation by using temperature scaling to control the softness of the predictions. We will include a discussion in the relevant section on how this assumption holds for typical KD losses like the KL divergence between teacher and student outputs, assuming bounded logits or appropriate regularization. revision: yes

  3. Referee: [Linear-Gaussian case study] Linear-Gaussian case study: The bias-variance-rank decomposition of the distillation divergence is presented as yielding practical guidance. The derivation should be shown to follow directly from the KL definition of the divergence without post-hoc parameter fitting, and the regime of validity (e.g., when the rank-bottleneck term dominates) should be stated explicitly.

    Authors: The decomposition is obtained directly by applying the closed-form KL divergence formula for multivariate Gaussians to the stochastic kernels defined in our framework, resulting in separate terms for bias (mean shift), variance (covariance mismatch), and rank (dimensionality reduction effect). There is no post-hoc fitting involved. We will explicitly delineate the validity regime in the case study, specifying that the rank-bottleneck dominates in low-rank student models or when the teacher's feature space has higher effective rank. revision: yes

Circularity Check

0 steps flagged

Framework and bounds are self-contained; no reduction to inputs by construction

full rationale

The paper defines a distillation divergence via KL between kernels of coupled teacher-student stochastic processes, then derives generalization bounds (upper via sub-Gaussian + stability; lower via central condition) that explicitly depend on this quantity and the teacher's gap. This is a standard modeling choice followed by derivation under stated assumptions, not a self-definitional loop or fitted input renamed as prediction. The linear-Gaussian decomposition is an analysis of the defined divergence rather than a tautology. No load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work are indicated. The derivation chain remains independent of the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on modeling teacher and student as coupled stochastic processes and on two domain assumptions for the bounds; the divergence itself is an invented quantity whose properties are derived rather than measured.

axioms (2)
  • domain assumption sub-Gaussian assumption on the loss for the algorithmic-stability upper bound
    Invoked to obtain the upper bound via stability arguments
  • domain assumption central condition for the lower bound
    Required to obtain sharper dependence on the distillation divergence
invented entities (1)
  • distillation divergence no independent evidence
    purpose: Quantify difference between teacher and student stochastic kernels via KL divergence
    Defined as the KL between the two kernels; used as the central quantity in all bounds

pith-pipeline@v0.9.0 · 5687 in / 1479 out tokens · 35082 ms · 2026-05-19T17:58:38.186019+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 8 internal anchors

  1. [1]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

  2. [2]

    Deep Mutual Learning

    Y . Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” 2017. [Online]. Available: https://arxiv.org/abs/1706.00384

  3. [3]

    Learning from multiple teacher networks,

    S. You, C. Xu, C. Xu, and D. Tao, “Learning from multiple teacher networks,” inProceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, pp. 1285– 1294

  4. [4]

    Towards understanding knowledge distil- lation,

    M. Phuong and C. Lampert, “Towards understanding knowledge distil- lation,” inInternational conference on machine learning. PMLR, 2019, pp. 5142–5151

  5. [5]

    Do Deep Nets Really Need to be Deep?

    L. J. Ba and R. Caruana, “Do deep nets really need to be deep?” 2014. [Online]. Available: https://arxiv.org/abs/1312.6184

  6. [6]

    Unifying distillation and privileged information

    D. Lopez-Paz, L. Bottou, B. Sch ¨olkopf, and V . Vapnik, “Unifying distillation and privileged information,” 2016. [Online]. Available: https://arxiv.org/abs/1511.03643

  7. [7]

    Learning using privileged information: Similarity control and knowledge transfer,

    V . Vapnik and R. Izmailov, “Learning using privileged information: Similarity control and knowledge transfer,”Journal of Machine Learning Research, vol. 16, no. 61, pp. 2023–2049, 2015. [Online]. Available: http://jmlr.org/papers/v16/vapnik15b.html

  8. [8]

    Generalization bounds via distillation,

    D. Hsu, Z. Ji, M. Telgarsky, and L. Wang, “Generalization bounds via distillation,” 2021. [Online]. Available: https://arxiv.org/abs/2104.05641

  9. [9]

    A statistical perspective on distillation,

    A. K. Menon, A. S. Rawat, S. Reddi, S. Kim, and S. Kumar, “A statistical perspective on distillation,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol

  10. [10]

    7632–7642

    PMLR, 18–24 Jul 2021, pp. 7632–7642. [Online]. Available: https://proceedings.mlr.press/v139/menon21a.html

  11. [11]

    Knowledge distillation performs partial variance reduction,

    M. Safaryan, A. Peste, and D. Alistarh, “Knowledge distillation performs partial variance reduction,” 2023. [Online]. Available: https://arxiv.org/abs/2305.17581

  12. [12]

    Revisiting knowledge distillation via label smoothing regularization,

    L. Yuan, F. E. H. Tay, G. Li, T. Wang, and J. Feng, “Revisiting knowledge distillation via label smoothing regularization,” 2021. [Online]. Available: https://arxiv.org/abs/1909.11723

  13. [13]

    Knowledge distillation in wide neural networks: Risk bound, data efficiency and imperfect teacher,

    G. Ji and Z. Zhu, “Knowledge distillation in wide neural networks: Risk bound, data efficiency and imperfect teacher,” 2020. [Online]. Available: https://arxiv.org/abs/2010.10090

  14. [14]

    Revisiting self-distillation,

    M. Pham, M. Cho, A. Joshi, and C. Hegde, “Revisiting self-distillation,”

  15. [15]

    Available: https://arxiv.org/abs/2206.08491

    [Online]. Available: https://arxiv.org/abs/2206.08491

  16. [16]

    Peter Holderrieth, Yilun Xu, and Tommi Jaakkola

    S. Hochreiter and J. Schmidhuber, “Flat minima,”Neural Computation, vol. 9, no. 1, pp. 1–42, 01 1997. [Online]. Available: https: //doi.org/10.1162/neco.1997.9.1.1

  17. [17]

    Generalization matters: Loss minima flattening via parameter hybridization for efficient online knowledge distillation,

    T. Zhang, M. Xue, J. Zhang, H. Zhang, Y . Wang, L. Cheng, J. Song, and M. Song, “Generalization matters: Loss minima flattening via parameter hybridization for efficient online knowledge distillation,”

  18. [18]

    Available: https://arxiv.org/abs/2303.14666

    [Online]. Available: https://arxiv.org/abs/2303.14666

  19. [19]

    Sharpness-Aware Minimization for Efficiently Improving Generalization

    P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” 2021. [Online]. Available: https://arxiv.org/abs/2010.01412

  20. [20]

    Leveraging flatness to improve information-theoretic generalization bounds for sgd,

    Z. Peng, J. Zhang, Y . Wang, L. Qi, Y . Shi, and Y . Gao, “Leveraging flatness to improve information-theoretic generalization bounds for sgd,” 2026. [Online]. Available: https://arxiv.org/abs/2601.01465

  21. [21]

    The information bottleneck method

    N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” 2000. [Online]. Available: https://arxiv.org/abs/ physics/0004057

  22. [22]

    Efficient knowledge distillation from model checkpoints,

    C. Wang, Q. Yang, R. Huang, S. Song, and G. Huang, “Efficient knowledge distillation from model checkpoints,” 2022. [Online]. Available: https://arxiv.org/abs/2210.06458

  23. [23]

    Variational Information Distillation for Knowledge Transfer

    S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai, “Variational information distillation for knowledge transfer,” 2019. [Online]. Available: https://arxiv.org/abs/1904.05835

  24. [24]

    URLhttps://arxiv.org/abs/1910.10699

    Y . Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” 2022. [Online]. Available: https://arxiv.org/abs/1910.10699

  25. [25]

    Bayes conditional distribution estimation for knowledge distillation based on conditional mutual information,

    L. Ye, S. M. Hamidi, R. Tan, and E.-H. Yang, “Bayes conditional distribution estimation for knowledge distillation based on conditional mutual information,” 2024. [Online]. Available: https://arxiv.org/abs/ 2401.08732

  26. [26]

    Information-theoretic analysis of generalization capability of learning algorithms

    A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1705.07809

  27. [27]

    Fast rate information- theoretic bounds on generalization errors,

    X. Wu, J. H. Manton, U. Aickelin, and J. Zhu, “Fast rate information- theoretic bounds on generalization errors,” 2025. [Online]. Available: https://arxiv.org/abs/2303.14658

  28. [28]

    Drone: Data-aware low- rank compression for large nlp models,

    P. Chen, H.-F. Yu, I. Dhillon, and C.-J. Hsieh, “Drone: Data-aware low- rank compression for large nlp models,”Advances in neural information processing systems, vol. 34, pp. 29 321–29 334, 2021. V. APPENDIX A. Preliminary Tools

  29. [29]

    For any measurable functiongwithE P [eg]<∞, EQ[g]≤KL(Q∥P) + logE P [eg].(10) Proof.Define the Radon–Nikodym derivativer:= dQ dP

    Donsker–Varadhan change of measure inequality: Lemma 6(Donsker–Varadhan inequality).LetP, Qbe probability measures on the same measurable space withQ≪P. For any measurable functiongwithE P [eg]<∞, EQ[g]≤KL(Q∥P) + logE P [eg].(10) Proof.Define the Radon–Nikodym derivativer:= dQ dP . ThenE P [r] = 1and KL(Q∥P) =E Q log dQ dP =E P [rlogr]. By Jensen’s inequa...

  30. [30]

    Then for anyλ∈R, logE[e λX]≤λE[X] + λ2σ2 2 .(11) Proof.By definition, E[eλX] =E h eλ(X−E[X]) i ·e λE[X]

    A standard sub-Gaussian mgf bound: Lemma 7(Sub-Gaussian mgf bound).LetXbeσ 2-sub-Gaussian, meaninglogE[exp(λ(X−E[X]))]≤ λ2σ2 2 for allλ∈R. Then for anyλ∈R, logE[e λX]≤λE[X] + λ2σ2 2 .(11) Proof.By definition, E[eλX] =E h eλ(X−E[X]) i ·e λE[X] . Taking logs and applying the sub-Gaussian condition gives (11). B. Proof of Theorem 1

  31. [31]

    Then genS ≤gen T +σ p 2Kn.(12)

    Statement: Theorem 8(Distillation generalization upper bound).Assume thath(D n, fT )isσ 2-sub-Gaussian under∆ Dn,fT . Then genS ≤gen T +σ p 2Kn.(12)

  32. [32]

    Then (10) gives E∆ ˆDn ,fS [λh( ˆDn, fS)]≤KL(∆ ˆDn,fS ∥∆Dn,fT ) + logE∆Dn ,fT [eλh(Dn,fT )]

    Proof: Proof.Fix anyλ >0and choose in Lemma 6 P= ∆ Dn,fT , Q= ∆ ˆDn,fS , g(d, f) =λ h(d, f). Then (10) gives E∆ ˆDn ,fS [λh( ˆDn, fS)]≤KL(∆ ˆDn,fS ∥∆Dn,fT ) + logE∆Dn ,fT [eλh(Dn,fT )]. By definitions, the left side equalsλgen S and the KL term equalsK n, so λgenS ≤K n + logE ∆Dn ,fT [eλh(Dn,fT )].(13) Sinceh(D n, fT )isσ 2-sub-Gaussian with meangen T , L...

  33. [33]

    Assume the teacher algorithmA T isβ-uniformly stable: for any neighboring datasetsD n andD (i) n differing in one example, sup z∈Z ℓ(AT (Dn), z)−ℓ(A T (D(i) n ), z) ≤β

    Statement: Proposition 9(Sub-Gaussianity via stability).Assume the loss is bounded:ℓ(f, z)∈[a, b]for allf, z. Assume the teacher algorithmA T isβ-uniformly stable: for any neighboring datasetsD n andD (i) n differing in one example, sup z∈Z ℓ(AT (Dn), z)−ℓ(A T (D(i) n ), z) ≤β. Let H(D n) :=L P (AT (Dn))−L Dn(AT (Dn)). ThenH(D n)satisfies bounded differen...

  34. [34]

    , zn)and letD (i) n = (z1,

    Proof: Proof.LetD n = (z1, . . . , zn)and letD (i) n = (z1, . . . , zi−1, z′ i, zi+1, . . . , zn)be a neighboring dataset. Define f:=A T (Dn), f ′ :=A T (D(i) n ). Then H(D n) =E Z∼P [ℓ(f, Z)]− 1 n nX j=1 ℓ(f, zj), H(D(i) n ) =E Z∼P [ℓ(f ′, Z)]− 1 n nX j=1 ℓ(f ′, z(i) j ), wherez (i) j =z j forj̸=iandz (i) i =z ′ i. By uniform stability and taking expecta...

  35. [35]

    Central condition definition: Definition 2((η, c)-central condition).A random variableXsatisfies the(η, c)-central condition underPifη >0and 0< c≤1and logE P [e−ηX]≤ −cηE P [X].(22)

  36. [36]

    Then genS ≥c·gen T − 1 η Kn.(23)

    Statement: Theorem 10(Distillation generalization lower bound).Assumeh(D n, fT )satisfies the(η, c)-central condition under∆ Dn,fT . Then genS ≥c·gen T − 1 η Kn.(23)

  37. [37]

    Proof: Proof.Apply Lemma 6 with P= ∆ Dn,fT , Q= ∆ ˆDn,fS , g(d, f) =−η h(d, f). Then E∆ ˆDn ,fS [−η h( ˆDn, fS)]≤K n + logE ∆Dn ,fT [e−η h(Dn,fT )].(24) By assumption,h(D n, fT )satisfies (22) withX=h(D n, fT )andP= ∆ Dn,fT : logE ∆Dn ,fT [e−η h(Dn,fT )]≤ −cηE ∆Dn ,fT [h(Dn, fT )] =−cηgen T .(25) Substitute (25) into (24): −ηE ∆ ˆDn ,fS [h( ˆDn, fS)]≤K n ...

  38. [38]

    Then for anyη >0, logE[e −ηX]≤ −ηµ+ η2σ2 2 =−ηµ 1− ησ2 2µ

    Sub-Gaussianity implies a central condition for smallη: Remark 2(Deriving a valid(η, c)from sub-Gaussianity).AssumeXisσ 2-sub-Gaussian with meanµ=E[X]>0. Then for anyη >0, logE[e −ηX]≤ −ηµ+ η2σ2 2 =−ηµ 1− ησ2 2µ . ThusXsatisfies the(η, c)-central condition with c≤1− ησ2 2µ provided that0< η < 2µ σ2 . E. Linear Gaussian Case Study (Detailed KL Decomposition)

  39. [39]

    We use the vectorization identity vec(W X) = (X ⊤ ⊗I k) vec(W), valid forW∈R k×d andX∈R d×n

    Matrix normal definition and basic identities: Definition 3(Matrix normal distribution).A random matrixA∈R k×n follows a matrix normal distributionA∼ MN(M, U, V)if vec(A)∼ N(vec(M), V⊗U), whereM∈R k×n,U∈R k×k,V∈R n×n. We use the vectorization identity vec(W X) = (X ⊤ ⊗I k) vec(W), valid forW∈R k×d andX∈R d×n. For Gaussians with the same covariance, we use...

  40. [40]

    Assume a noisy linear label channel Y|X∼ MN(W ⋆X, I k, ν 2In), whereW ⋆ ∈R k×d is the ground-truth linear map andν >0is the noise level

    Generative model:Collect features and labels column-wise intoX∈R d×n andY∈R k×n so thatD n = (X, Y). Assume a noisy linear label channel Y|X∼ MN(W ⋆X, I k, ν 2In), whereW ⋆ ∈R k×d is the ground-truth linear map andν >0is the noise level

  41. [41]

    GivenD n = (X, Y), define the Gibbs posterior with inverse temperatureβ T : qT (W|D n)∝p 0(W) exp − βT 2ν2 ∥Y−W X∥ 2 F

    Teacher as a Gibbs learner (closed form):Let the teacher parameter beW∈R k×d with prior p0(W) =MN(0, I k, λ −1Id). GivenD n = (X, Y), define the Gibbs posterior with inverse temperatureβ T : qT (W|D n)∝p 0(W) exp − βT 2ν2 ∥Y−W X∥ 2 F . Lemma 11(Closed form ofq T ).The posterior is matrix normal: qT (W|D n) =MN( ¯WT , I k,Σ T ),Σ T = λId + βT ν2 XX ⊤ −1 , ...

  42. [42]

    Hence the (unnormalized) log density is quadratic inwwith precision λIkd + βT ν2 B⊤ X BX =λI kd + βT ν2 (XX ⊤ ⊗I k), so the covariance is(λI d + βT ν2 XX ⊤)−1 ⊗I k = ΣT ⊗I k

    The prior impliesw∼ N(0, λ −1Ikd). Hence the (unnormalized) log density is quadratic inwwith precision λIkd + βT ν2 B⊤ X BX =λI kd + βT ν2 (XX ⊤ ⊗I k), so the covariance is(λI d + βT ν2 XX ⊤)−1 ⊗I k = ΣT ⊗I k. The mean is the corresponding linear term mapped back to matrix form, yielding ¯WT = βT ν2 Y X ⊤ΣT

  43. [43]

    Pseudo-data generation:SampleW T ∼q T (· |D n)and generate pseudo labels through the same noisy channel: ˆY|(W T , X)∼ MN(W T X, I k, ν 2In), then define ˆDn = (X, ˆY)

  44. [44]

    Introduce the rank-κmap M ⋆(WT , X)as a best rank-κapproximation in prediction space: M ⋆(WT , X)∈arg min M: rank(M)=κ ∥WT X−W T M X∥2 F

    Student capacity constraint via a rank bottleneck:Let the student parameter beΘ∈R k×d. Introduce the rank-κmap M ⋆(WT , X)as a best rank-κapproximation in prediction space: M ⋆(WT , X)∈arg min M: rank(M)=κ ∥WT X−W T M X∥2 F . Define a local Gaussian student conditional kernel qS(Θ|W T , X) =MN(W T M ⋆(WT , X), I k,Σ S),Σ S ≻0

  45. [45]

    Process-level KL and its two terms:Define the distillation divergence Kn := KL(∆ ˆDn,fS ∥∆ Dn,fT ). Using the KL chain rule on the dataset-model pair, Kn = KL(∆ ˆDn ∥∆ Dn) | {z } Dataset shift +E ˆDn h KL(∆fS | ˆDn ∥∆ fT |Dn) i | {z } Algorithm shift .(27) We now bound the two terms

  46. [46]

    Condition on(X, D n)and use convexity of KL.:GivenXandD n, ˆY|(X, D n)is a mixture overW T |D n: ˆY|(X, D n)∼ Z qT (WT |D n)MN(W T X, I k, ν 2In)dW T

    Dataset shift bound and bias-variance decomposition: a) Step 1. Condition on(X, D n)and use convexity of KL.:GivenXandD n, ˆY|(X, D n)is a mixture overW T |D n: ˆY|(X, D n)∼ Z qT (WT |D n)MN(W T X, I k, ν 2In)dW T . The real label law (givenX) isMN(W ⋆X, Ik, ν2In). By convexity of KL in its first argument, KL( ˆY|X, D n ∥Y|X)≤E WT |Dn h KL MN(W T X, Ik, ν...

  47. [47]

    Since the student kernel is conditionally Gaussian given latentW T (andX),q S(Θ| ˆDn)is generally a mixture overW T

    Algorithm shift bound and the rank-bottleneck decomposition:Define the (expected) algorithm-shift term KLalg :=E ˆDn h KL(∆fS | ˆDn ∥∆ fT |Dn) i . Since the student kernel is conditionally Gaussian given latentW T (andX),q S(Θ| ˆDn)is generally a mixture overW T . By convexity of KL in the first argument, conditioning and then averaging yields the reducti...

  48. [48]

    Final compact decomposition ofK n:Combine (27), (33), and (39): Kn ≤E Dn h 1 2ν2 Bias(Dn) +kVar(D n) | {z } Teacher prediction error + Apx(D n)| {z } Student capacity / rank bottleneck + Cov(ΣS,Σ T )| {z } Geometry mismatch + Spread(Dn)| {z } Posterior spread i . This yields an interpretable checklist: improve teacher bias/variance to tighten dataset shif...

  49. [49]

    For any datasetDand modelf, define empirical and population sharpness: SD(f) :=E U[LD(f+U)]−L D(f),(40) SP (f) :=E U[LP (f+U)]−L P (f)

    Definitions:LetUbe a random perturbation, independent of all other randomness, uniformly distributed on the Euclidean ball{u:∥u∥ ≤ρ}. For any datasetDand modelf, define empirical and population sharpness: SD(f) :=E U[LD(f+U)]−L D(f),(40) SP (f) :=E U[LP (f+U)]−L P (f). Define the perturbed generalization gap hU(D, f) :=E U[h(D, f+U)] =E U[LP (f+U)−L D(f+U)].(41)

  50. [50]

    Then genS ≤gen T +E ∆Dn ,fT [SP (fT )] + (σu +ν) p 2Kn.(43)

    Statement: Theorem 12(Sharpness-aware distillation generalization bound).Assume: •(i) (Local optimality in expectation) under∆ ˆDn,fS , E[LP (fS)]≤E[E U[LP (fS +U)]].(42) •(ii) Under the teacher process∆ Dn,fT , bothh U(Dn, fT )andS Dn(fT )are sub-Gaussian with proxiesσ 2 u andν 2, respectively. Then genS ≤gen T +E ∆Dn ,fT [SP (fT )] + (σu +ν) p 2Kn.(43)

  51. [51]

    Using (42),gen S ≤E ∆ ˆDn ,fS EU[LP (fS +U)]−L ˆDn (fS)

    Proof: Proof.By definition,gen S =E ∆ ˆDn ,fS LP (fS)−L ˆDn (fS) . Using (42),gen S ≤E ∆ ˆDn ,fS EU[LP (fS +U)]−L ˆDn (fS) . Add and subtractE U[L ˆDn (fS +U)]inside the expectation: genS ≤E ∆ ˆDn ,fS EU(LP (fS +U)−L ˆDn (fS +U)) +E ∆ ˆDn ,fS EU(L ˆDn (fS +U))−L ˆDn (fS) . Recognize the two terms using (41) and (40): genS ≤E ∆ ˆDn ,fS hU( ˆDn, fS) +E ∆ ˆD...

  52. [52]

    •(A2) Parameter stability: for neighboring datasetsD, D (i),∥A T (D)−A T (D(i))∥ ≤κ/n

    Assumptions used:We use the following conditions, matching the main text: •(A1) Bounded loss:ℓ(·;z)∈[a, b]. •(A2) Parameter stability: for neighboring datasetsD, D (i),∥A T (D)−A T (D(i))∥ ≤κ/n. •(A3) Global Lipschitz: for allzand allf, f ′,|ℓ(f;z)−ℓ(f ′;z)| ≤L∥f−f ′∥. •(A4) Local regularity onB(f T ,2ρ): for allz,ℓ(·;z)isα-smooth and∥∇ℓ(f T ;z)∥ ≤g 0. Th...

  53. [53]

    Then |h(Dn, fT )−h(D (i) n , f ′ T )| ≤ 2κL+ (b−a) n .(48) Consequently,h(D n, fT )is sub-Gaussian with proxyσ 0 = 1√n κL+ b−a 2

    Bounded differences for the (unperturbed) teacher gap: Lemma 13(Bounded differences forh(D n, fT )under (A1)–(A3)).Letf T =A T (Dn)andf ′ T =A T (D(i) n )for neighboring datasets. Then |h(Dn, fT )−h(D (i) n , f ′ T )| ≤ 2κL+ (b−a) n .(48) Consequently,h(D n, fT )is sub-Gaussian with proxyσ 0 = 1√n κL+ b−a 2 . Proof.Write h(D, f) =L P (f)−L D(f). Then |h(D...

  54. [54]

    Proof.The proof repeats Lemma 13, replacing the global Lipschitz constantLby the local Lipschitz scale(g 0 +αρ)valid on the perturbation region

    Bounded differences for the perturbed gaph U : Lemma 14(Proxy forσ u(ρ)under (A1), (A2), (A4)).Under (A1), (A2), (A4), the perturbed gaph U(Dn, fT )is sub-Gaussian with proxy σu(ρ) = 1√n κ(g0 +αρ) + b−a 2 . Proof.The proof repeats Lemma 13, replacing the global Lipschitz constantLby the local Lipschitz scale(g 0 +αρ)valid on the perturbation region. Step ...

  55. [55]

    Bounded differences for empirical sharpness and the proxyν(ρ): Lemma 15(Proxy forν(ρ)under (A2), (A4)).Under (A2) and (A4), assume moreover that for any neighboring datasets D, D(i), if we set f:=A T (D), f ′ :=A T (D(i)), then theρ-neighborhood of the line segment joiningfandf ′ lies inside the local region on which the Hessian bound in (A4) is valid. Th...

  56. [56]

    Then E[SP (fT )]≤ 1 2 τopρ2

    Bounding the population sharpness by curvature: Lemma 16(Population sharpness bound under (A5 ′)).Assume sup ∥v∥≤ρ ∥∇2LP (fT +v)∥ op ≤τ op. Then E[SP (fT )]≤ 1 2 τopρ2. Proof.Recall SP (fT ) =E U LP (fT +U)−L P (fT ) . By Taylor’s theorem with integral remainder, LP (fT +u)−L P (fT ) =⟨∇L P (fT ), u⟩+ Z 1 0 (1−t)u ⊤∇2LP (fT +tu)u dt. Taking expectation ov...

  57. [57]

    We seek conditions under which Bsh(ρ)< B std

    Baseline and sharpness-aware bounds:Define Bstd := genT +σ 0 p 2Kn, B sh(ρ) := genT +E[S P (fT )] + (σu(ρ) +ν(ρ)) p 2Kn. We seek conditions under which Bsh(ρ)< B std

  58. [58]

    By Lemma 16, E[SP (fT )]≤ 1 2 τopρ2

    Proof: Proof.A sufficient condition forB sh(ρ)< B std is E[SP (fT )]< σ0 −σ u(ρ)−ν(ρ) p 2Kn. By Lemma 16, E[SP (fT )]≤ 1 2 τopρ2. Moreover, using the standard proxy σ0 = 1√n κL+ b−a 2 , the local proxy σu(ρ)≤ 1√n κ(g0 +αρ) + b−a 2 , and Lemma 15, ν(ρ) = 1√n ακ 2 ρ+g 0ρ+ α 2 ρ2 , we obtain the lower bound σ0 −σ u(ρ)−ν(ρ)≥ 1√n A0 −A 1ρ−A 2ρ2 , where A0 :=κ(...