pith. machine review for the scientific record. sign in

arxiv: 2605.07191 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Attention Transfer Is Not Universally Effective for Vision Transformers

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:41 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords attention transfervision transformersViTknowledge distillationarchitectural mismatchpre-trained modelsattention patternsmodel transfer
0
0 comments X

The pith

Attention transfer in Vision Transformers succeeds only when student architecture matches the teacher.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

A prior result claimed that transferring only attention patterns from a pre-trained teacher Vision Transformer could fully recover the benefits of its weights for any standard student ViT. This paper tests the claim on 20 teachers from 11 ViT families and finds success in only 7 families, with the other 4 failing by as much as 5.1% below even a no-transfer baseline. The failures hold across model sizes, longer training, different datasets, and out-of-distribution tests. Controlled experiments localize the issue to the attention-routing channel and show that adding the teacher's native components to the randomly initialized student reverses every failure, while those same components give no gain in ordinary from-scratch training. The work concludes that attention patterns are sufficient only when the student's architecture matches the teacher's.

Core claim

Attention Transfer is not universally effective for Vision Transformers. It recovers full pre-training benefits for some families yet produces consistent failures up to 5.1% below the from-scratch baseline for others. These failures are family-consistent, persist under varied training conditions, and localize to the attention-routing channel. The root cause is architectural mismatch: transferred attention patterns remain non-functional for the student unless its structure is augmented with the teacher's native components. Adding those components alone reverses all failures, while they provide no benefit when used in standard from-scratch training, confirming that attention is sufficient only

What carries the argument

Architectural mismatch between the pre-trained teacher Vision Transformer and the standard student, which renders transferred attention patterns non-functional unless the student's structure incorporates the teacher's specific components.

If this is right

  • Attention transfer cannot be applied indiscriminately across different ViT families without first checking architectural compatibility.
  • Adding only the teacher's mismatched components to the student restores the full benefit of transferred attention patterns.
  • The performance gap from mismatch remains even with extended training, alternate datasets, or out-of-distribution evaluation.
  • Attention patterns alone do not capture the transferable value of ViT pre-training; supporting architectural elements must also align.
  • Future distillation work must treat architecture as a first-class requirement rather than assuming attention is the sole transferable element.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Distillation pipelines for transformers will need explicit architecture-alignment steps to avoid silent performance losses.
  • Similar mismatch effects may appear when distilling between language-model families that differ in attention or feed-forward design.
  • Model repositories could improve transfer success by publishing component-level descriptions that allow students to adopt teacher-specific blocks.
  • The finding invites experiments that optimize the student architecture jointly with the transfer objective rather than fixing it in advance.

Load-bearing premise

That the observed transfer failures are caused by architectural mismatch rather than untested factors such as optimization dynamics or subtle pre-training data differences.

What would settle it

A test in which a student is given exactly the teacher's architecture while every other variable stays fixed and attention transfer still fails to recover full performance.

Figures

Figures reproduced from arXiv: 2605.07191 by Chen Gong, Gabriel James Goenawan, Hongyuan Zhu, Huaiyuan Qin, Muli Yang, Peng Hu, Xi Peng.

Figure 1
Figure 1. Figure 1: Attention Transfer is not universally effective. We begin by evaluating Attention Trans￾fer [8] on the broader benchmark of 20 pre-trained teachers from 11 well-known ViT families under the default ImageNet-1K setup. Each point represents one set of pre-trained weights; the axes report ∆ Top-1 accuracy (%) relative to the No-Transfer baseline under two transfer methods: Attention Copy (x-axis) and Attentio… view at source ↗
Figure 2
Figure 2. Figure 2: Extended training robustness. We ex￾tend the Attention Distillation training up to 100 epochs and report the gap relative to No-Transfer at every 10 epochs. The same success/failure split remains consistent with the results in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise decomposition. We trans￾fer attention from the top or bottom k layers of DINOv2-S under both Attention Copy and At￾tention Distillation. In contrast to the monotonic gains in [8], no layer subset reproduces success￾family gains, and harmful transfer appears across every transfer depth for this failure teacher. Component-Level Decomposition. We first ask whether the family-specific split is drive… view at source ↗
Figure 4
Figure 4. Figure 4: Architectural compatibility rescues ineffective Attention Transfer. For teachers from the failure families, we report ∆ accuracy (%) relative to the corresponding No-Transfer baselines for both Attention Copy and Attention Distillation. The native-arch student is augmented by adding only the teacher-native components missing from the standard student in a randomly initialized state, with no pre-trained wei… view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise teacher-to-student KL divergence. We plot per-layer KL divergence between the pre-trained teacher’s attention maps and the transferred student’s attention maps on ImageNet-1K under Attention Distillation. A clear spike of the standard-arch student at late layers can be observed, while the native-arch student consistently stays near-zero. This shows Attention Transfer can match teacher attention … view at source ↗
Figure 6
Figure 6. Figure 6: Ablation on behaviors of MSE loss vs. default CE loss. We compare behaviors of different transfer losses in Attention Distillation with varying λ. Same family-level trends can be observed under the λ-sweep. For success teachers, both losses yield monotonic gains with increasing λ; while for failure teachers, both losses cause larger drops with increasing λ. the native-architecture student remains consisten… view at source ↗
read the original abstract

A recent work shows that Attention Transfer, which transfers only the attention patterns from a pre-trained teacher Vision Transformer (ViT) to a randomly initialized standard student ViT, is sufficient to recover the full benefit of the teacher's pre-trained weights. We revisit this finding on a comprehensive benchmark of 20 teachers from 11 well-known ViT families and reveal that Attention Transfer is not universally effective. While 7 families transfer successfully, 4 consistently fail, falling up to 5.1\% below the from-scratch no-transfer baseline. Further results demonstrate that this failure is family-consistent across model sizes, and persists under extended training durations, different transfer datasets, and out-of-distribution evaluations. Controlled analyses then consistently localize the problem to the attention-routing channel, indicating that the key issue is not whether the student can match the teacher's attention patterns, but whether the matched patterns remain functional for the student. Crucially, we identify architectural mismatch between the pre-trained teacher and the standard student as the primary mechanism. By adding only the teacher's native architectural components to the student in a randomly initialized state, we completely reverse the failure for all 4 families. Notably, these components alone do not improve from-scratch training, confirming that they specifically unlock the usability of the teacher's attention. We further systematically show that this failure is not explained by the inadequate choice of transfer loss or by differences in pre-training recipes. Our findings refine the prevailing understanding of attention in ViT representations: attention is sufficient \textit{only} when the student architecture matches the teacher.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper claims that Attention Transfer (AT) from pre-trained ViT teachers to randomly initialized standard student ViTs is not universally effective. Across a benchmark of 20 teachers from 11 ViT families, AT succeeds for 7 families but fails consistently for 4, yielding up to 5.1% worse accuracy than the from-scratch baseline. The failures persist across model scales, training lengths, transfer datasets, and OOD evaluations. Controlled experiments localize the issue to architectural mismatch in the attention-routing channel; augmenting the student with the teacher's native components (in random initialization) fully reverses the negative transfer for all failing families, while those components confer no benefit in from-scratch training. The authors rule out transfer-loss choice and pre-training recipe differences as explanations, concluding that attention patterns are sufficient only when student and teacher architectures match.

Significance. If the central empirical claim holds, the work meaningfully refines the prevailing view of attention transfer in ViTs by demonstrating that matched attention patterns are not automatically functional for the student. The broad benchmark (20 teachers, 11 families), consistency checks across scales/training regimes/datasets/OOD, and the clean reversal experiment when architectural components are restored constitute strong evidence. The finding that the added components improve transfer but not from-scratch training isolates the mechanism and has direct implications for distillation and transfer-learning practice in vision transformers.

major comments (1)
  1. [§4.3, §5.2] §4.3 and §5.2: the reversal experiment (adding teacher's native components to the student) is load-bearing for the architectural-mismatch claim. While the paper shows these components confer no from-scratch gain, it would strengthen the result to report the exact performance delta (with standard deviation over seeds) between the augmented-student AT run and the original failing AT run for each of the 4 families.
minor comments (3)
  1. [Table 2] Table 2: the caption should explicitly state whether the reported accuracies are means over multiple random seeds and, if so, include the number of seeds and standard deviations.
  2. [§3.2] §3.2: the definition of 'standard student ViT' could be made more precise by listing the exact architectural differences (e.g., patch embedding, positional encoding, MLP structure) relative to each teacher family.
  3. [Figure 4] Figure 4: the OOD evaluation plots would benefit from an additional panel or annotation showing the from-scratch baseline for direct visual comparison with the AT curves.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the constructive suggestion regarding the reversal experiment. We agree that providing the precise performance deltas with standard deviations will strengthen the clarity and impact of our architectural-mismatch claim.

read point-by-point responses
  1. Referee: [§4.3, §5.2] §4.3 and §5.2: the reversal experiment (adding teacher's native components to the student) is load-bearing for the architectural-mismatch claim. While the paper shows these components confer no from-scratch gain, it would strengthen the result to report the exact performance delta (with standard deviation over seeds) between the augmented-student AT run and the original failing AT run for each of the 4 families.

    Authors: We thank the referee for this helpful suggestion. In the revised manuscript, we will add the exact performance deltas (with standard deviations computed over three random seeds) between the augmented-student Attention Transfer runs and the original failing Attention Transfer runs for each of the four families. These values will be incorporated into the text and/or tables of §4.3 and §5.2 to provide a precise quantification of the reversal effect. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims rest on direct empirical measurements from independent training runs across 20 teachers in 11 ViT families, with controlled ablations isolating architectural mismatch as the cause of transfer failure. No derivations, equations, or first-principles predictions are present that reduce to self-defined inputs, fitted parameters renamed as outputs, or self-citation chains. All results (negative transfer for 4 families, reversal upon component addition, persistence across scales/datasets) are falsifiable via replication and do not rely on any ansatz or uniqueness theorem imported from prior author work. This is standard self-contained empirical research.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations from controlled training experiments rather than new theoretical axioms or invented entities; standard training assumptions such as convergence of SGD are invoked but not load-bearing for the mismatch conclusion.

free parameters (1)
  • transfer loss weighting and training hyperparameters
    Standard choices in knowledge distillation experiments; the paper explicitly tests that the failure is not explained by loss choice.
axioms (1)
  • domain assumption Attention patterns extracted from a pre-trained teacher can be imposed on a student via a distillation loss
    This is the foundational assumption of the attention transfer method being tested.

pith-pipeline@v0.9.0 · 5593 in / 1258 out tokens · 47845 ms · 2026-05-11T02:41:50.578500+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 4 internal anchors

  1. [1]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InICLR, 2021. 1, 3, 6, 9

  2. [2]

    DINOv2: Learning robust visual features without supervision.TMLR, 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.TMLR, 2024. 1, 2, 3, 6, 9, 13

  3. [3]

    DINOv3

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 9

  4. [4]

    Learning Transferable Visual Models from Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models from Natural Language Supervision. InICML, pages 8748–8763, 2021. 1, 2, 3, 6, 9, 13

  5. [5]

    Teaching matters: Investigating the role of supervision in vision transformers

    Matthew Walmer, Saksham Suri, Kamal Gupta, and Abhinav Shrivastava. Teaching matters: Investigating the role of supervision in vision transformers. InCVPR, pages 7486–7496, 2023. 1, 9

  6. [6]

    What do self-supervised vision transformers learn? InICLR, 2023

    Namuk Park, Wonjae Kim, Byeongho Heo, Taekyung Kim, and Sangdoo Yun. What do self-supervised vision transformers learn? InICLR, 2023

  7. [7]

    Rosetta neurons: Mining the common units in a model zoo

    Amil Dravid, Yossi Gandelsman, Alexei A Efros, and Assaf Shocher. Rosetta neurons: Mining the common units in a model zoo. InICCV, pages 1934–1943, 2023. 1, 9

  8. [8]

    On the surprising effectiveness of attention transfer for vision transformers

    Alexander C Li, Yuandong Tian, Beidi Chen, Deepak Pathak, and Xinlei Chen. On the surprising effectiveness of attention transfer for vision transformers. InNeurIPS, pages 113963–113990, 2024. 1, 2, 3, 4, 5, 6, 8, 9, 13, 16

  9. [9]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. InICML, pages 10347–10357,

  10. [10]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, pages 9650–9660, 2021. 3, 9, 13

  11. [11]

    An empirical study of training self-supervised vision trans- formers

    Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision trans- formers. InICCV, pages 9640–9649, 2021. 3, 9, 13

  12. [12]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, pages 16000–16009, 2022. 3, 9, 13

  13. [13]

    ibot: Image bert pre-training with online tokenizer

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer. InICLR, 2022. 3, 9, 13

  14. [14]

    arXiv preprint arXiv:2208.06366 (2022)

    Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, and Furu Wei. Beit v2: Masked image modeling with vector-quantized visual tokenizers.arXiv preprint arXiv:2208.06366, 2022. 2, 3, 6, 9, 13

  15. [15]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 3, 9, 13 10

  16. [16]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, pages 4015–4026,

  17. [17]

    Vision transformers need registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InICLR, 2024. 2, 3, 6, 9, 13, 15

  18. [18]

    Imagenet: A Large-scale Hierarchical Image Database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A Large-scale Hierarchical Image Database. InCVPR, pages 248–255, 2009. 4, 13

  19. [19]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 4, 14

  20. [20]

    The inaturalist species classification and detection dataset

    Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InCVPR, pages 8769–8778, 2018. 5

  21. [21]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InICCV, pages 8340–8349, 2021. 5

  22. [22]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InCVPR, pages 15262–15271, 2021. 5

  23. [23]

    Learning robust global representations by penalizing local predictive power

    Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. InNeurIPS, 2019. 5

  24. [24]

    Do imagenet classifiers generalize to imagenet? InICML, pages 5389–5400, 2019

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InICML, pages 5389–5400, 2019. 5

  25. [25]

    Transformer feed-forward layers are key-value memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InEMNLP, pages 5484–5495, 2021. 5

  26. [26]

    Going deeper with image transformers

    Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. InICCV, pages 32–42, 2021. 6, 9

  27. [27]

    Three things everyone should know about vision transformers

    Hugo Touvron, Matthieu Cord, Alaaeldin El-Nouby, Jakob Verbeek, and Hervé Jégou. Three things everyone should know about vision transformers. InECCV, pages 497–515, 2022

  28. [28]

    Deit iii: Revenge of the vit

    Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Revenge of the vit. InECCV, pages 516–533,

  29. [29]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InCVPR, pages 9729–9738, 2020. 9

  30. [30]

    Improved Baselines with Momentum Contrastive Learning

    Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning.arXiv preprint arXiv:2003.04297, 2020. 9

  31. [31]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, pages 4904–4916, 2021. 9

  32. [32]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, pages 11975–11986, 2023. 9

  33. [33]

    BEiT: BERT pre-training of image transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image transformers. In ICLR, 2022. 9

  34. [34]

    SimMIM: A simple framework for masked image modeling

    Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. SimMIM: A simple framework for masked image modeling. InCVPR, pages 9653–9663, 2022. 9

  35. [35]

    SAM 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. InICLR, 2025. 9 11

  36. [36]

    SAM 3: Segment anything with concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris Coll- Vinent, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Z...

  37. [37]

    Beyond instance consistency: Investigating view diversity in self-supervised learning.TMLR, 2025

    Huaiyuan Qin, Muli Yang, Siyuan Hu, Peng Hu, Yu Zhang, Chen Gong, and Hongyuan Zhu. Beyond instance consistency: Investigating view diversity in self-supervised learning.TMLR, 2025. 9

  38. [38]

    Rotary position embedding for vision transformer

    Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. InECCV, pages 289–305, 2024. 9

  39. [39]

    Vision transformers don’t need trained registers

    Nick Jiang, Aran Nayebi, Aaditya Bansal, Charlotte Chong, Edward Hu, Carl V ondrick, Kyle Mahowald, Yash Goyal, Erik Boehnke, and Kyle Mahowald. Vision transformers don’t need trained registers. In NeurIPS, 2025. 9

  40. [40]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InICLR, 2024. 9, 15

  41. [41]

    Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer

    Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. InICLR, 2017. 9

  42. [42]

    MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InNeurIPS, pages 5776–5788, 2020. 9

  43. [43]

    Attention distillation: Self-supervised vision transformer students need more guidance

    Kai Wang, Fei Yang, and Joost van de Weijer. Attention distillation: Self-supervised vision transformer students need more guidance. InBMVC, 2022. 9

  44. [44]

    Zico Kolter

    Asher Trockman and J. Zico Kolter. Mimetic initialization of self-attention layers. InICML, pages 34456–34468, 2023. 9

  45. [45]

    ScaleKD: Strong vision transformers could be excellent teachers

    Jiawei Fan, Chao Li, Xiaolong Liu, and Anbang Yao. ScaleKD: Strong vision transformers could be excellent teachers. InNeurIPS, pages 63290–63315, 2024. 9

  46. [46]

    Zico Kolter

    Asher Trockman and J. Zico Kolter. Mimetic initialization of mlps.arXiv preprint arXiv:2602.07156, 2026

  47. [47]

    Data efficient any transformer-to-mamba distillation via attention bridge.arXiv preprint arXiv:2510.19266, 2025

    Penghao Wang, Yuhao Zhou, Mengxuan Wu, Panpan Zhang, Zhangyang Wang, and Kai Wang. Data efficient any transformer-to-mamba distillation via attention bridge.arXiv preprint arXiv:2510.19266, 2025

  48. [48]

    From low-rank features to encoding mismatch: Rethinking feature distillation in vision transformers.arXiv preprint arXiv:2511.15572, 2025

    Huiyuan Tian, Bonan Xu, Shijian Li, and Xin Jin. From low-rank features to encoding mismatch: Rethinking feature distillation in vision transformers.arXiv preprint arXiv:2511.15572, 2025

  49. [49]

    Revisiting cross- architecture distillation: Adaptive dual-teacher transfer for lightweight video models

    Ying Peng, Hongsen Ye, Changxin Huang, Xiping Hu, Jian Chen, and Runhao Zeng. Revisiting cross- architecture distillation: Adaptive dual-teacher transfer for lightweight video models. InAAAI, pages 8367–8375, 2026

  50. [50]

    Perspective-aware teaching: Adapting knowledge for heterogeneous distillation

    Jhe-Hao Lin, Yi Yao, Chan-Feng Hsu, Hong-Xia Xie, Hong-Han Shuai, and Wen-Huang Cheng. Perspective-aware teaching: Adapting knowledge for heterogeneous distillation. InICCV, pages 4178–4187,

  51. [51]

    Attention Transfer does not work

    Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, and Zhuang Liu. Initializing models with larger ones. InICLR, 2024. 9 12 Attention Transfer Is Not Universally Effective for Vision Transformers Technical Appendices and Supplementary Material A Additional Implementation Details We provide further implementati...