pith. machine review for the scientific record. sign in

arxiv: 2604.25110 · v2 · submitted 2026-04-28 · 💻 cs.LG · cs.AI

Recognition: unknown

Knowledge Distillation Must Account for What It Loses

Wenshuo Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords knowledge distillationmodel compressionevaluation metricscapability preservationlossy projectiondistillation lossesposition paper
0
0 comments X

The pith

Distillation often lets students match teacher task scores while losing the capabilities that make those scores reliable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Knowledge distillation compresses large teacher models into smaller students for practical use. The paper contends that judging success solely by retained task performance misses whether the student still behaves like the teacher in the ways that matter for reliability. It reframes distillation as a lossy projection that can match selected observables while dropping other properties. Evidence from existing work is organized into a taxonomy of recurring off-metric losses that are measurable yet rarely reported. The authors propose scenario-specific preservation targets and a Distillation Loss Statement to make explicit what is kept, what is discarded, and why the losses are tolerable.

Core claim

The paper claims that current evaluation assumes retained task scores imply retained teacher capabilities, but reframing distillation as a lossy projection shows students can match selected observables without preserving the capabilities that make teacher behavior reliable; existing studies contain concrete, recurring, measurable off-metric losses that are unaccounted for when only retention is reported.

What carries the argument

Reframing knowledge distillation as a lossy projection, together with a taxonomy of off-metric distillation losses and the proposed Distillation Loss Statement that reports preserved elements, lost elements, and acceptable remaining losses.

If this is right

  • Evaluations will need to check preservation of specific teacher capabilities beyond headline task metrics.
  • Different deployment scenarios will require distinct preservation targets rather than uniform score matching.
  • A Distillation Loss Statement will document what was kept, what was lost, and the justification for remaining losses.
  • Studies will shift from reporting only retained performance to also quantifying and accepting off-metric losses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmark suites could add capability probes that are independent of the original training task to expose hidden losses.
  • In regulated domains such as healthcare or autonomous systems, the statement could become part of model release documentation.
  • The same logic may apply to other compression methods like pruning or quantization where performance metrics can mask behavioral drift.

Load-bearing premise

That retained task scores reliably indicate preserved teacher capabilities and that off-metric losses are concrete enough to be identified and measured in practice.

What would settle it

A controlled distillation experiment in which students achieve equivalent task scores to the teacher yet show no measurable differences on capability tests for robustness, calibration, or out-of-distribution behavior would weaken the claim.

read the original abstract

This position paper argues that knowledge distillation must account for what it loses: student models should be judged not only by retained task scores, but by whether they preserve the teacher capabilities that make those scores reliable. This matters because distillation is increasingly used to turn large teacher models into deployable students, yet headline metrics can obscure losses in the capabilities that make teacher behavior reliable. Conceptually, we show that current evaluation often assumes retained task scores imply retained teacher capabilities. Reframing distillation as a lossy projection exposes this flaw: students may match selected teacher observables without preserving the capabilities that make them reliable. We then synthesize existing evidence into a taxonomy of off-metric distillation losses, showing that such losses are concrete, recurring, and measurable, yet often unaccounted for when studies report what students retain rather than what they lose. To make the position actionable, we propose scenario-specific preservation targets and a Distillation Loss Statement that reports what was preserved, what was lost, and why the remaining losses are acceptable. The goal is not lossless distillation, but accountable distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This position paper argues that knowledge distillation must account for what it loses: student models should be judged not only by retained task scores, but by whether they preserve the teacher capabilities that make those scores reliable. It reframes distillation as a lossy projection to expose the flaw in assuming retained task scores imply retained capabilities, synthesizes existing evidence into a taxonomy of off-metric distillation losses (showing they are concrete, recurring, and measurable yet often unaccounted for), and proposes scenario-specific preservation targets along with a Distillation Loss Statement that reports what was preserved, what was lost, and why remaining losses are acceptable.

Significance. If the position holds, this work could shift evaluation norms in distillation research toward more accountable reporting of capability losses, particularly for deployed student models in reliability-sensitive settings. The synthesis of prior evidence into a structured taxonomy and the introduction of concrete tools (preservation targets and the Distillation Loss Statement) provide a practical framework that builds directly on existing literature without introducing new parameters or ungrounded entities.

major comments (2)
  1. [Proposal for preservation targets and Distillation Loss Statement] The proposal for scenario-specific preservation targets and the Distillation Loss Statement is central to the claim of actionability. The manuscript does not supply a template, example format, or worked illustration of the Statement (e.g., what fields it would contain or how it would be populated for a concrete distillation scenario), which is load-bearing for readers to assess its feasibility.
  2. [Taxonomy of off-metric distillation losses] The taxonomy of off-metric distillation losses asserts that such losses 'are concrete, recurring, and measurable, yet often unaccounted for.' Because this synthesis underpins the reframing and the call for change, the manuscript should include at least one specific citation or brief summary per category that demonstrates an observed loss in prior work that was omitted from standard task-score reporting.
minor comments (1)
  1. [Abstract] The abstract introduces the term 'Distillation Loss Statement' without a one-sentence definition or parenthetical gloss; a brief clarification on first use would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our position paper and the recommendation for minor revision. We address each major comment below, agreeing to incorporate concrete additions that strengthen the actionability and evidentiary support of our proposals without altering the core arguments.

read point-by-point responses
  1. Referee: [Proposal for preservation targets and Distillation Loss Statement] The proposal for scenario-specific preservation targets and the Distillation Loss Statement is central to the claim of actionability. The manuscript does not supply a template, example format, or worked illustration of the Statement (e.g., what fields it would contain or how it would be populated for a concrete distillation scenario), which is load-bearing for readers to assess its feasibility.

    Authors: We agree that an explicit template and worked example are necessary to demonstrate feasibility. In the revised manuscript, we will add a new subsection providing a clear template for the Distillation Loss Statement with fields including Scenario Description, Preservation Targets, Measured Losses (with methods), and Justification for Acceptability of Remaining Losses. We will populate this template with a worked illustration drawn from a standard distillation scenario in the literature (e.g., distilling a vision transformer for image classification), showing how the fields would be completed based on patterns from existing studies. This addition will be placed in the section on making the position actionable. revision: yes

  2. Referee: [Taxonomy of off-metric distillation losses] The taxonomy of off-metric distillation losses asserts that such losses 'are concrete, recurring, and measurable, yet often unaccounted for.' Because this synthesis underpins the reframing and the call for change, the manuscript should include at least one specific citation or brief summary per category that demonstrates an observed loss in prior work that was omitted from standard task-score reporting.

    Authors: We acknowledge the value of grounding each taxonomy category with specific evidence. We will revise the taxonomy section to include, for every loss category, at least one citation to prior work accompanied by a brief summary of the observed off-metric loss that was not reported via standard task scores. These citations and summaries will be selected from the existing literature synthesized in the paper, ensuring the additions remain within the scope of a position paper and do not require new experiments. This will directly support the claim that such losses are recurring yet unaccounted for. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is a position paper that reframes knowledge distillation conceptually as a lossy projection and synthesizes existing literature into a taxonomy of off-metric losses, without introducing equations, derivations, fitted parameters, or quantitative predictions. The central claims rest on references to prior external evidence rather than internal self-citations, self-definitions, or renamings that reduce to the paper's own inputs by construction. No load-bearing step equates a claimed result to a fitted input or prior author work in a circular manner; the argument remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central position rests on the domain assumption that distillation is inherently lossy in unmeasured capabilities and that current metrics obscure this. It introduces the Distillation Loss Statement as a new reporting construct without independent prior evidence.

axioms (1)
  • domain assumption Current distillation evaluation assumes retained task scores imply retained teacher capabilities
    Explicitly stated in the abstract as the flaw being addressed.
invented entities (1)
  • Distillation Loss Statement no independent evidence
    purpose: A reporting format that documents preserved capabilities, lost capabilities, and justification for acceptable losses
    Proposed as a new actionable tool in the abstract with no reference to prior existence or validation.

pith-pipeline@v0.9.0 · 5474 in / 1250 out tokens · 60462 ms · 2026-05-08T03:27:12.282950+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

104 extracted references · 17 canonical work pages · 8 internal anchors

  1. [1]

    Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531

  2. [2]

    J., and Tao, D

    Gou, J., Yu, B., Maybank, S. J., and Tao, D. (2021). Knowledge Distillation: A Survey.International Journal of Computer Vision, 129(6):1789–1819

  3. [3]

    Sanh, V ., Debut, L., Chaumond, J., and Wolf, T. (2019). DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv preprint arXiv:1910.01108

  4. [4]

    Xu, X., Li, M., Tao, C., Shen, T., Cheng, R., Li, J., Xu, C., Tao, D., and Zhou, T. (2024). A Survey on Knowledge Distillation of Large Language Models. arXiv preprint arXiv:2402.13116

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948

  6. [6]

    Kang, M., Jeong, J., Lee, S., Cho, J., and Hwang, S. J. (2025). Distilling LLM Agent into Small Models with Retrieval and Code Tools. OpenReview, NeurIPS 2025 Conference (Spotlight)

  7. [7]

    Stanton, S., Izacard, G., and Roux, N. L. (2021). Does Knowledge Distillation Really Work? InAdvances in Neural Information Processing Systems

  8. [8]

    Ojha, U., Li, Y ., Sundara Rajan, A., Liang, Y ., and Lee, Y . J. (2023). What Knowledge Gets Distilled in Knowledge Distillation? InAdvances in Neural Information Processing Systems, 36:11037–11048

  9. [9]

    Mohanty, S., Roosta, T., and Passban, P. (2023). What Is Lost in Knowledge Distillation? arXiv preprint arXiv:2311.04142

  10. [10]

    Hebbalaguppe, R., Baranwal, M., Prakash, J., Madan, N., Anand, K., and Arora, C. (2024). Understanding Calibration Transfer in Knowledge Distillation. OpenReview, ICLR 2024 withdrawn submission

  11. [11]

    and Rei, M

    Stacey, J. and Rei, M. (2024). Distilling Robustness into Natural Language Inference Models with Domain- Targeted Augmentation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2239–2258

  12. [12]

    A., Carlini, N., and Tramèr, F

    Jagielski, M., Nasr, M., Lee, K., Choquette-Choo, C. A., Carlini, N., and Tramèr, F. (2023). Students Parrot Their Teachers: Membership Inference on Model Distillation. InAdvances in Neural Information Processing Systems, 36

  13. [13]

    S., Lu, H., Cai, Y ., and Haddadi, H

    Zhang, Z., Shamsabadi, A. S., Lu, H., Cai, Y ., and Haddadi, H. (2025). Membership and Memorization in LLM Knowledge Distillation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20074–20084

  14. [14]

    Reasoning Models Don't Always Say What They Think

    Chen, Y ., Benton, J., Radhakrishnan, A., Uesato, J., Denison, C., Schulman, J., Somani, A., Hase, P., Wagner, M., Roger, F., Mikulik, V ., Bowman, S. R., Leike, J., Kaplan, J., and Perez, E. (2025). Reasoning Models Don’t Always Say What They Think. arXiv preprint arXiv:2505.05410

  15. [15]

    A Survey of On-Policy Distillation for Large Language Models

    Song, M. and Zheng, M. (2026). A Survey of On-Policy Distillation for Large Language Models. arXiv preprint arXiv:2604.00626

  16. [16]

    Shumailov, I., Shumaylov, Z., Zhao, Y ., Papernot, N., Anderson, R., and Gal, Y . (2024). AI Models Collapse When Trained on Recursively Generated Data.Nature, 631:755–759

  17. [17]

    D., and Gebru, T

    Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., and Gebru, T. (2019). Model Cards for Model Reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency, pages 220–229

  18. [18]

    W., Wallach, H., Daumé III, H., and Crawford, K

    Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé III, H., and Crawford, K. (2021). Datasheets for Datasets.Communications of the ACM, 64(12):86–92

  19. [19]

    Zhao, D., Andrews, J. T. A., Papakyriakopoulos, O., and Xiang, A. (2024). Position: Measure Dataset Diversity, Don’t Just Claim It. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 60644–60673

  20. [20]

    Tramèr, F., Kamath, G., and Carlini, N. (2024). Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 48453–48467. 10

  21. [21]

    Shao, R., Yi, J., Chen, P.-Y ., and Hsieh, C.-J. (2022). How and When Adversarial Robustness Transfers in Knowledge Distillation? arXiv preprint arXiv:2110.12072

  22. [22]

    and Ioannou, Y

    Mohammadshahi, A. and Ioannou, Y . (2025). What Is Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias.Transactions on Machine Learning Research

  23. [23]

    Zhang, M., Liu, D., Zhang, K., Franco, J., and Liu, H. (2026). Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety. arXiv preprint arXiv:2602.11157

  24. [24]

    K., Rawat, A

    Menon, A. K., Rawat, A. S., Reddi, S. J., Kim, S., and Kumar, S. (2021). A Statistical Perspective on Distillation. InProceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7632–7642

  25. [25]

    Gu, Y ., Dong, L., Wei, F., and Huang, M. (2024). MiniLLM: Knowledge Distillation of Large Language Models. InInternational Conference on Learning Representations

  26. [26]

    E., Chassang, A., Gatta, C., and Bengio, Y

    Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y . (2015). FitNets: Hints for Thin Deep Nets. InInternational Conference on Learning Representations

  27. [27]

    Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou, M. (2020). MiniLM: Deep Self-Attention Dis- tillation for Task-Agnostic Compression of Pre-Trained Transformers. InAdvances in Neural Information Processing Systems

  28. [28]

    K., and Kumar, S

    Lukasik, M., Bhojanapalli, S., Menon, A. K., and Kumar, S. (2021). Teacher’s Pet: Understanding and Mitigating Biases in Distillation. arXiv preprint arXiv:2106.10494

  29. [29]

    A., Xu, Z., and Garcia-Olano, D

    Borkar, J., Chadha, K., Mireshghallah, N., Zhang, Y ., Veliche, I.-E., Mitra, A., Smith, D. A., Xu, Z., and Garcia-Olano, D. (2026). Memorization Dynamics in Knowledge Distillation for Language Models. arXiv preprint arXiv:2601.15394

  30. [30]

    Guo, C., Pleiss, G., Sun, Y ., and Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330

  31. [31]

    Fan, H., Jiang, Z., Lei, J., and Zhang, M. (2024). Revisit the Essence of Distilling Knowledge Through Calibration. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 12882–12894

  32. [32]

    Geng, J., Cai, F., Wang, Y ., Koeppl, H., Nakov, P., and Gurevych, I. (2024). A Survey of Confidence Estimation and Calibration in Large Language Models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6577–6595

  33. [33]

    Kapoor, S., Gruver, N., Roberts, M., Collins, K., Pal, A., Bhatt, U., Weller, A., Dooley, S., Goldblum, M., and Wilson, A. G. (2024). Large Language Models Must Be Taught to Know What They Don’t Know. In Advances in Neural Information Processing Systems, 37

  34. [34]

    Wen, B., Yao, J., Feng, S., Xu, C., Tsvetkov, Y ., Howe, B., and Wang, L. L. (2025). Know Your Limits: A Survey of Abstention in Large Language Models.Transactions of the Association for Computational Linguistics, 13

  35. [35]

    Hsieh, C.-Y ., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y ., Ratner, A., Krishna, R., Lee, C.-Y ., and Pfister, T. (2023). Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017

  36. [36]

    Yu, P., Xu, J., Weston, J., and Kulikov, I. (2024). Distilling System 2 into System 1. arXiv preprint arXiv:2407.06023

  37. [37]

    Turpin, M., Michael, J., Perez, E., and Bowman, S. R. (2023). Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. InAdvances in Neural Information Processing Systems, 36:74952–74965

  38. [38]

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge- Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, 33

  39. [39]

    Jia, P., Xu, D., Li, X., Du, Z., Li, X., Wang, Y ., Wang, Y ., Liu, Q., Wang, M., Guo, H., Tang, R., and Zhao, X. (2025). Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4242–4256. 11

  40. [40]

    Huang, L., Feng, X., Ma, W., Gu, Y ., Zhong, W., Feng, X., Yu, W., Peng, W., Tang, D., Tu, D., and Qin, B. (2024). Learning Fine-Grained Grounded Citations for Attributed Large Language Models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 14095–14113

  41. [41]

    Cui, J., Chiang, W.-L., Stoica, I., and Hsieh, C.-J. (2024). OR-Bench: An Over-Refusal Benchmark for Large Language Models. arXiv preprint arXiv:2405.20947

  42. [42]

    Muhamed, A., Ribeiro, L. F. R., Dreyer, M., Smith, V ., and Diab, M. T. (2026). RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6811–6856

  43. [43]

    Is model collapse inevitable? Breaking the curse of recursion by accumulating real and synthetic data,

    Gerstgrasser, M., Schaeffer, R., Dey, A., Rafailov, R., Korbak, T., Sleight, H., Agrawal, R., Hughes, J., Pai, D. B., Gromov, A., Roberts, D., Yang, D., Donoho, D. L., and Koyejo, S. (2024). Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data. arXiv preprint arXiv:2404.01413

  44. [44]

    A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?

    Awal, M. A., Rochan, M., and Roy, C. K. (2025). A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher? arXiv preprint arXiv:2511.05476

  45. [45]

    Tabassi, E. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1, National Institute of Standards and Technology.https://doi.org/10.6028/NIST.AI.100-1

  46. [46]

    Bucila, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model Compression. InProceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 535–541

  47. [47]

    V ., Chi, E

    Wang, X., Wei, J., Schuurmans, D., Le, Q. V ., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. InInternational Conference on Learning Representations

  48. [48]

    C., Tschannen, M., Itti, L., and Anandkumar, A

    Furlanello, T., Lipton, Z. C., Tschannen, M., Itti, L., and Anandkumar, A. (2018). Born Again Neural Networks. InProceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1607–1616

  49. [49]

    and Lampert, C

    Phuong, M. and Lampert, C. (2019). Towards Understanding Knowledge Distillation. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5142–5151

  50. [50]

    V ., and Zhou, D

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. V ., and Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, 35:24824–24837

  51. [51]

    Li, Y ., Zhang, H., Cao, J., Ma, X., and Gao, J. (2023). Symbolic Chain-of-Thought Distillation: Small Models Can Also “Think” Step-by-Step. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2665–2679

  52. [52]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Lanham, T., Garriga-Alonso, A., Cooper, A. F., Hill, K., Greenblatt, R., Noble, R., Birch, A., and others (2023). Measuring Faithfulness in Chain-of-Thought Reasoning. arXiv preprint arXiv:2307.13702

  53. [53]

    Madsen, A., Chandar, S., and Reddy, S. (2024). Are Self-Explanations from Large Language Models Faithful? InFindings of the Association for Computational Linguistics: ACL 2024, pages 295–337

  54. [54]

    Cao, L. (2024). Learn to Refuse: Making Large Language Models More Controllable and Reliable through Knowledge Scope Limitation and Refusal Mechanism. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3628–3646. 12 A Representative Reporting-Pattern Checklist Table 4 is a representative 50-paper checklist used...

  55. [55]

    Retention and distribu- tion Introduces soft targets as information beyond hard labels

  56. [56]

    Method taxonomy Distinguishes response-, feature-, and relation- based KD

  57. [57]

    Retention evidence Shows successful compression and retained language-understanding performance

  58. [58]

    LLM KD background Documents the diversity of modern LLM distil- lation settings

  59. [59]

    Reasoning distillation Illustrates the contemporary importance of dis- tilled reasoning students

  60. [60]

    Agent/tool distillation Shows that tool behavior can become a distilla- tion target

  61. [61]

    Distribution loss Shows student predictive distributions may di- verge from teachers

  62. [62]

    Distribution loss Explains why teacher probability estimates can matter beyond accuracy

  63. [63]

    Generative distribution Studies how distribution-matching choices affect LLM KD

  64. [64]

    Theory Analyzes why KD can work without reducing success to score retention

  65. [65]

    Property transfer Studies which off-task properties are inherited by students

  66. [66]

    Loss study Directly studies information loss between teacher and student

  67. [67]

    Representation preserva- tion Uses intermediate hints, showing outputs alone may be insufficient

  68. [68]

    Relation preservation Transfers attention and value relations, not only final outputs

  69. [69]

    Counterpoint Shows students may outperform teachers on some metrics

  70. [70]

    Robustness loss Shows adversarial robustness may fail to transfer under KD

  71. [71]

    OOD loss Shows in-distribution gains do not guarantee tar- get robustness

  72. [72]

    Subgroup behavior Studies uneven group-wise effects of distillation

  73. [73]

    Fairness loss Examines fairness and bias after knowledge transfer

  74. [74]

    Calibration metric Establishes confidence calibration as distinct from accuracy

  75. [75]

    Calibration transfer Studies whether calibration transfers through KD

  76. [76]

    Calibration as KD Treats calibration as central to distilling knowl- edge

  77. [77]

    Uncertainty background Surveys confidence estimation and calibration in LLMs

  78. [78]

    Uncertainty behavior Argues models must learn what they do not know

  79. [79]

    13 Work Evidence type Role in our argument

    Abstention Surveys abstention as a distinct LLM capability. 13 Work Evidence type Role in our argument

  80. [80]

    Rationale distillation Shows rationales can improve small-model learn- ing

Showing first 80 references.