pith. sign in

arxiv: 2606.03052 · v1 · pith:S6IXPUS6new · submitted 2026-06-02 · 💻 cs.LG

What Do Students Learn? A Feature-Level Analysis of Dark Knowledge

Pith reviewed 2026-06-28 11:19 UTC · model grok-4.3

classification 💻 cs.LG
keywords knowledge distillationself-distillationfeature learningdark knowledgeconfusion matrixCIFAR-100ResNet
0
0 comments X

The pith

Effective knowledge distillation prunes low-frequency sample-specific features, and a model's dataset-level confusion matrix encodes analogous dark knowledge for teacher-free self-distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how student models acquire features during knowledge distillation by applying the Interaction Tensor framework to track feature interactions. It finds that successful distillation regularizes learning by removing low-frequency, sample-specific features and favoring a smaller set of reusable ones across the dataset. The authors observe that the confusion matrix computed over the entire training set carries structural information similar to the teacher's dark knowledge. From this they derive Confusion Distillation, a self-distillation procedure that treats the student's own evolving confusion patterns as dynamic soft targets. On CIFAR-100 the method matches or exceeds prior self-distillation baselines while remaining computationally lighter than standard teacher-student distillation.

Core claim

Knowledge distillation functions as a regularizer that prunes low-frequency sample-specific features and encourages reliance on compact reusable features; the dataset-level confusion matrix contains structural information analogous to a teacher's dark knowledge, which can be exploited directly as dynamic soft targets in a teacher-free self-distillation procedure called Confusion Distillation.

What carries the argument

The Interaction Tensor framework for decomposing feature learning, together with the dataset-level confusion matrix used as evolving soft targets in Confusion Distillation.

If this is right

  • Students succeed in distillation by discarding low-frequency sample-specific features in favor of a compact reusable feature set.
  • The confusion matrix computed over the training set can substitute for teacher logits as soft targets.
  • Confusion Distillation achieves competitive accuracy on ResNet-34 and ResNet-50 for CIFAR-100 while avoiding a separate teacher model.
  • The method outperforms CS-KD and PS-KD by 1.2 percentage points on the same benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Tracking confusion evolution during training may supply a general signal for regularization in other self-supervised or semi-supervised settings.
  • The pruning effect on low-frequency features could be tested by measuring feature reuse frequency before and after applying the method.
  • Because no external teacher is required, the approach may scale to larger models where training a separate teacher is prohibitive.

Load-bearing premise

The dataset-level confusion matrix contains structural information analogous to the teacher's dark knowledge.

What would settle it

Running Confusion Distillation on a dataset where the confusion matrix shows no correlation with teacher logit patterns and measuring whether accuracy remains at or below standard cross-entropy training or other self-distillation baselines.

Figures

Figures reproduced from arXiv: 2606.03052 by Seungu Kang, Songkuk Kim.

Figure 1
Figure 1. Figure 1: (a) Feature frequency distributions for baseline, student, and teacher models. Features are sorted in ascending order based on the number of data points in which each feature appears. (b) Feature frequency distributions of commonly activated features shared by baseline, student, and teacher models a large number of local, data-dependent features. This observation aligns with the theoretical finding that de… view at source ↗
Figure 2
Figure 2. Figure 2: A 2D kernel density estimation plot showing the relationship between [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Feature frequency distributions for baseline, CD models. (b) Compar￾ison of the number of activated features per data point for baseline, CD models. 4.3 Effects of Confusion Distillation To examine whether Confusion Distillation (CD) produces effects similar to those of conventional Knowledge Distillation (KD), we analyze how the CD-trained model learns and utilizes features using the Interaction Tenso… view at source ↗
read the original abstract

Knowledge Distillation (KD) is a powerful tool for model compression, yet the precise mechanisms by which student models acquire feature representations remain underexplored. In this work, we analyze student feature learning using the Interaction Tensor framework. Our analysis reveals that effective KD acts as a regularizer that prunes low-frequency, sample-specific features, encouraging the student to rely on a compact set of highly reusable features. Crucially, we observe that the dataset-level confusion matrix contains structural information analogous to the teacher's "Dark Knowledge." Leveraging this insight, we propose Confusion Distillation (CD), a teacher-free self-distillation method that utilizes the model's own evolving confusion patterns as dynamic soft targets. CD achieves competitive performance on ResNet-34 and ResNet-50 for CIFAR-100, outperforming existing self-distillation methods like CS-KD and PS-KD by 1.2% while offering a computationally efficient alternative to standard KD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes student feature learning in knowledge distillation via the Interaction Tensor framework, claiming that effective KD regularizes by pruning low-frequency sample-specific features in favor of reusable ones. It asserts that the dataset-level confusion matrix encodes structural information analogous to a teacher's dark knowledge. This leads to the proposal of Confusion Distillation (CD), a teacher-free self-distillation method that uses the model's evolving confusion patterns as dynamic soft targets. CD is reported to achieve competitive accuracy on ResNet-34 and ResNet-50 for CIFAR-100, outperforming CS-KD and PS-KD by 1.2%.

Significance. If the analogy between confusion matrices and dark knowledge is substantiated and the performance claims hold under rigorous controls, the work could advance understanding of self-distillation mechanisms and provide a computationally lightweight alternative to standard KD. The application of the Interaction Tensor framework for feature-level analysis would be a strength if it yields falsifiable, reproducible insights into feature pruning.

major comments (2)
  1. [§3] §3 (Interaction Tensor analysis): the central assertion that the dataset-level confusion matrix contains structural information analogous to the teacher's dark knowledge lacks explicit quantitative support, such as cosine similarity, eigenvalue comparisons, or direct matrix alignment between student confusion patterns and those of a converged teacher model.
  2. [§5] §5 (CD method and experiments): the reported 1.2% gain over CS-KD and PS-KD on CIFAR-100 is load-bearing for the contribution, yet no ablation isolates the off-diagonal confusion structure from generic label smoothing or self-regularization effects; without this, the method risks reducing to standard regularization whose advantage may not persist under matched hyperparameters.
minor comments (2)
  1. [Experiments section] Table reporting CIFAR-100 results: include standard deviations across multiple runs and the exact number of trials to allow assessment of statistical significance for the 1.2% margin.
  2. [§2] Notation for the Interaction Tensor: clarify whether it is a new construction or drawn from prior work, and provide a self-contained definition or reference in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, agreeing that additional quantitative support and ablations will strengthen the claims, and we will incorporate these in the revision.

read point-by-point responses
  1. Referee: [§3] §3 (Interaction Tensor analysis): the central assertion that the dataset-level confusion matrix contains structural information analogous to the teacher's dark knowledge lacks explicit quantitative support, such as cosine similarity, eigenvalue comparisons, or direct matrix alignment between student confusion patterns and those of a converged teacher model.

    Authors: We agree that the current presentation relies primarily on the Interaction Tensor analysis and downstream performance of CD for support. To provide explicit quantitative backing, the revised manuscript will include cosine similarity scores and eigenvalue spectrum comparisons between evolving student confusion matrices and those from a converged teacher model trained via standard KD. These metrics will be reported across training epochs on CIFAR-100. revision: yes

  2. Referee: [§5] §5 (CD method and experiments): the reported 1.2% gain over CS-KD and PS-KD on CIFAR-100 is load-bearing for the contribution, yet no ablation isolates the off-diagonal confusion structure from generic label smoothing or self-regularization effects; without this, the method risks reducing to standard regularization whose advantage may not persist under matched hyperparameters.

    Authors: This concern is well-taken, as the current experiments do not fully disentangle the structured off-diagonal contributions. In revision, we will add ablations using (i) only the diagonal of the confusion matrix (dynamic label smoothing) and (ii) off-diagonal elements with row-wise shuffling to preserve marginals but remove structure. All variants will use identical hyperparameter search budgets and be compared directly to CS-KD and PS-KD on the same ResNet-34/50 CIFAR-100 splits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on empirical observation and experimental validation

full rationale

The paper's chain proceeds from Interaction Tensor analysis of student features under KD, to the observation that a dataset-level confusion matrix encodes reusable structure analogous to dark knowledge, to the proposal of Confusion Distillation using the model's own evolving predictions as soft targets, with performance measured on CIFAR-100. None of these steps reduce by construction to fitted parameters renamed as predictions, self-definitional equations, or load-bearing self-citations; the analogy is presented as an empirical finding rather than a tautology, and the 1.2% gain is an external experimental outcome. The method is therefore self-contained against the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no details provided on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5686 in / 1112 out tokens · 28430 ms · 2026-06-28T11:19:31.486785+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Jiang, C

    Y. Jiang, C. Baek, and J. Z. Kolter,On the Joint Interaction of Models, Data, and Features, In International Conference on Learning Representations (ICLR), 2024

  2. [2]

    Hinton, O

    G. Hinton, O. Vinyals, and J. Dean,Distilling the Knowledge in a Neural Network, in Proceedings of the NIPS Deep Learning and Representation Learning Workshop, 2015

  3. [3]

    L.Yuan,F.EH.Tay,G.Li,T.Wang,andJ.Feng,Revisiting Knowledge Distillation via Label Smoothing Regularization, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  4. [4]

    H. Zhou, L. Song, J. Chen, Y. Zhou, G. Wang, J. Yuan, and Q. Zhang,Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective, in International Conference on Learning Representations (ICLR), 2021

  5. [5]

    K. He, X. Zhang, S. Ren, and J. Sun,Deep Residual Learning for Image Recog- nition, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  6. [6]

    S. Yun, J. Park, K. Lee, and J. Shin,Regularizing Class-wise Predictions via Self- knowledge Distillation, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  7. [7]

    K. Kim, B. Ji, D. Yoon, and S. Hwang,Self-Knowledge Distillation with Progressive Refinement of Targets, In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  8. [8]

    Romero, N

    A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio,FitNets: Hints for Thin Deep Nets,InInternationalConferenceonLearningRepresentations (ICLR), 2015

  9. [9]

    Zagoruyko and N

    S. Zagoruyko and N. Komodakis,Paying More Attention to Attention: Improv- ing the Performance of Convolutional Neural Networks via Attention Transfer, In International Conference on Learning Representations (ICLR), 2017

  10. [10]

    J. Kim, S. Park, and N. Kwak,Paraphrasing Complex Network: Network Compres- sion via Factor Transfer, In Advances in Neural Information Processing Systems (NeurIPS), 2018

  11. [11]

    Y. Tian, D. Krishnan, and P. Isola,Contrastive Representation Distillation, In International Conference on Learning Representations (ICLR), 2020

  12. [12]

    W. Park, D. Kim, Y. Lu, and M. Cho,Relational Knowledge Distillation, In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2019. What Do Students Learn? 15

  13. [13]

    Furlanello, Z

    T. Furlanello, Z. Lipton, M. Tschannen, L. Itti, and A. Anandkumar,Born Again Neural Networks, In International Conference on Machine Learning (ICML), 2018

  14. [14]

    Zhang, T

    Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu,Deep Mutual Learning, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  15. [15]

    Zhang, J

    L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, and K. Ma,Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation, In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019

  16. [16]

    Raghu, J

    M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein,SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability, In Advances in Neural Information Processing Systems (NeurIPS), 2017

  17. [17]

    S.Kornblith,M.Norouzi,H.Lee,andG.Hinton,Similarity of Neural Network Rep- resentations Revisited, In International Conference on Machine Learning (ICML), 2019

  18. [18]

    Xu, and C

    T. Xu, and C. Liu,Data-Distortion Guided Self-Distillation for Deep Neural Net- works, In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019

  19. [19]

    Z. Yang, A. Zeng, Z. Li, T. Zhang, C. Yuan, and Y. Li,From Knowledge Distilla- tion to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels, In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  20. [20]

    Koço, and C Capponi,On multi-class classification through the minimization of the confusion matrix norm, In Asian Conference on Machine Learning (ACML), 2013

    S. Koço, and C Capponi,On multi-class classification through the minimization of the confusion matrix norm, In Asian Conference on Machine Learning (ACML), 2013

  21. [21]

    L. Yan, B. Zhong, and K.-K. Ma,Confusion-Aware Convolutional Neural Network for Image Classification, In International Conference on Neural Information Pro- cessing (ICONIP), 2019

  22. [22]

    N. Tsoi, K. Candon, D. Li, Y. Milkessa, and M. Vázquez,Bridging the Gap: Unify- ing the Training and Evaluation of Neural Network Binary Classifiers, In Advances in Neural Information Processing Systems (NeurIPS), 2022

  23. [23]

    D. Han, N. Moniz, and N. V. Chawla,AnyLoss: Transforming Classification Met- rics into Loss Functions, In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2024

  24. [24]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tul- loch, Y. Jia, and K. He,Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, ArXiv preprint arXiv:1706.02677, 2017

  25. [25]

    J. Tang, R. Shivanna, Z. Zhao, D. Lin, A. Singh, E. H. Chi, and S. Jain,Under- standing and Improving Knowledge Distillation, arXiv preprint arXiv:2002.03532, 2020

  26. [26]

    Müller, S

    R. Müller, S. Kornblith, and G. Hinton,When Does Label Smoothing Help?, In Advances in Neural Information Processing Systems (NeurIPS), 2019

  27. [27]

    Feldman,Does Learning Require Memorization? A Short Tale about a Long Tail, In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2020

    V. Feldman,Does Learning Require Memorization? A Short Tale about a Long Tail, In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2020

  28. [28]

    A.TarvainenandH.Valpola,Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results, In Advances in Neural Information Processing Systems (NeurIPS), 2017