Recognition: no theorem link
Attention Transfer Is Not Universally Effective for Vision Transformers
Pith reviewed 2026-05-11 02:41 UTC · model grok-4.3
The pith
Attention transfer in Vision Transformers succeeds only when student architecture matches the teacher.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Attention Transfer is not universally effective for Vision Transformers. It recovers full pre-training benefits for some families yet produces consistent failures up to 5.1% below the from-scratch baseline for others. These failures are family-consistent, persist under varied training conditions, and localize to the attention-routing channel. The root cause is architectural mismatch: transferred attention patterns remain non-functional for the student unless its structure is augmented with the teacher's native components. Adding those components alone reverses all failures, while they provide no benefit when used in standard from-scratch training, confirming that attention is sufficient only
What carries the argument
Architectural mismatch between the pre-trained teacher Vision Transformer and the standard student, which renders transferred attention patterns non-functional unless the student's structure incorporates the teacher's specific components.
If this is right
- Attention transfer cannot be applied indiscriminately across different ViT families without first checking architectural compatibility.
- Adding only the teacher's mismatched components to the student restores the full benefit of transferred attention patterns.
- The performance gap from mismatch remains even with extended training, alternate datasets, or out-of-distribution evaluation.
- Attention patterns alone do not capture the transferable value of ViT pre-training; supporting architectural elements must also align.
- Future distillation work must treat architecture as a first-class requirement rather than assuming attention is the sole transferable element.
Where Pith is reading between the lines
- Distillation pipelines for transformers will need explicit architecture-alignment steps to avoid silent performance losses.
- Similar mismatch effects may appear when distilling between language-model families that differ in attention or feed-forward design.
- Model repositories could improve transfer success by publishing component-level descriptions that allow students to adopt teacher-specific blocks.
- The finding invites experiments that optimize the student architecture jointly with the transfer objective rather than fixing it in advance.
Load-bearing premise
That the observed transfer failures are caused by architectural mismatch rather than untested factors such as optimization dynamics or subtle pre-training data differences.
What would settle it
A test in which a student is given exactly the teacher's architecture while every other variable stays fixed and attention transfer still fails to recover full performance.
Figures
read the original abstract
A recent work shows that Attention Transfer, which transfers only the attention patterns from a pre-trained teacher Vision Transformer (ViT) to a randomly initialized standard student ViT, is sufficient to recover the full benefit of the teacher's pre-trained weights. We revisit this finding on a comprehensive benchmark of 20 teachers from 11 well-known ViT families and reveal that Attention Transfer is not universally effective. While 7 families transfer successfully, 4 consistently fail, falling up to 5.1\% below the from-scratch no-transfer baseline. Further results demonstrate that this failure is family-consistent across model sizes, and persists under extended training durations, different transfer datasets, and out-of-distribution evaluations. Controlled analyses then consistently localize the problem to the attention-routing channel, indicating that the key issue is not whether the student can match the teacher's attention patterns, but whether the matched patterns remain functional for the student. Crucially, we identify architectural mismatch between the pre-trained teacher and the standard student as the primary mechanism. By adding only the teacher's native architectural components to the student in a randomly initialized state, we completely reverse the failure for all 4 families. Notably, these components alone do not improve from-scratch training, confirming that they specifically unlock the usability of the teacher's attention. We further systematically show that this failure is not explained by the inadequate choice of transfer loss or by differences in pre-training recipes. Our findings refine the prevailing understanding of attention in ViT representations: attention is sufficient \textit{only} when the student architecture matches the teacher.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Attention Transfer (AT) from pre-trained ViT teachers to randomly initialized standard student ViTs is not universally effective. Across a benchmark of 20 teachers from 11 ViT families, AT succeeds for 7 families but fails consistently for 4, yielding up to 5.1% worse accuracy than the from-scratch baseline. The failures persist across model scales, training lengths, transfer datasets, and OOD evaluations. Controlled experiments localize the issue to architectural mismatch in the attention-routing channel; augmenting the student with the teacher's native components (in random initialization) fully reverses the negative transfer for all failing families, while those components confer no benefit in from-scratch training. The authors rule out transfer-loss choice and pre-training recipe differences as explanations, concluding that attention patterns are sufficient only when student and teacher architectures match.
Significance. If the central empirical claim holds, the work meaningfully refines the prevailing view of attention transfer in ViTs by demonstrating that matched attention patterns are not automatically functional for the student. The broad benchmark (20 teachers, 11 families), consistency checks across scales/training regimes/datasets/OOD, and the clean reversal experiment when architectural components are restored constitute strong evidence. The finding that the added components improve transfer but not from-scratch training isolates the mechanism and has direct implications for distillation and transfer-learning practice in vision transformers.
major comments (1)
- [§4.3, §5.2] §4.3 and §5.2: the reversal experiment (adding teacher's native components to the student) is load-bearing for the architectural-mismatch claim. While the paper shows these components confer no from-scratch gain, it would strengthen the result to report the exact performance delta (with standard deviation over seeds) between the augmented-student AT run and the original failing AT run for each of the 4 families.
minor comments (3)
- [Table 2] Table 2: the caption should explicitly state whether the reported accuracies are means over multiple random seeds and, if so, include the number of seeds and standard deviations.
- [§3.2] §3.2: the definition of 'standard student ViT' could be made more precise by listing the exact architectural differences (e.g., patch embedding, positional encoding, MLP structure) relative to each teacher family.
- [Figure 4] Figure 4: the OOD evaluation plots would benefit from an additional panel or annotation showing the from-scratch baseline for direct visual comparison with the AT curves.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the constructive suggestion regarding the reversal experiment. We agree that providing the precise performance deltas with standard deviations will strengthen the clarity and impact of our architectural-mismatch claim.
read point-by-point responses
-
Referee: [§4.3, §5.2] §4.3 and §5.2: the reversal experiment (adding teacher's native components to the student) is load-bearing for the architectural-mismatch claim. While the paper shows these components confer no from-scratch gain, it would strengthen the result to report the exact performance delta (with standard deviation over seeds) between the augmented-student AT run and the original failing AT run for each of the 4 families.
Authors: We thank the referee for this helpful suggestion. In the revised manuscript, we will add the exact performance deltas (with standard deviations computed over three random seeds) between the augmented-student Attention Transfer runs and the original failing Attention Transfer runs for each of the four families. These values will be incorporated into the text and/or tables of §4.3 and §5.2 to provide a precise quantification of the reversal effect. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's central claims rest on direct empirical measurements from independent training runs across 20 teachers in 11 ViT families, with controlled ablations isolating architectural mismatch as the cause of transfer failure. No derivations, equations, or first-principles predictions are present that reduce to self-defined inputs, fitted parameters renamed as outputs, or self-citation chains. All results (negative transfer for 4 families, reversal upon component addition, persistence across scales/datasets) are falsifiable via replication and do not rely on any ansatz or uniqueness theorem imported from prior author work. This is standard self-contained empirical research.
Axiom & Free-Parameter Ledger
free parameters (1)
- transfer loss weighting and training hyperparameters
axioms (1)
- domain assumption Attention patterns extracted from a pre-trained teacher can be imposed on a student via a distillation loss
Reference graph
Works this paper leans on
-
[1]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InICLR, 2021. 1, 3, 6, 9
work page 2021
-
[2]
DINOv2: Learning robust visual features without supervision.TMLR, 2024
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.TMLR, 2024. 1, 2, 3, 6, 9, 13
work page 2024
-
[3]
Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Learning Transferable Visual Models from Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models from Natural Language Supervision. InICML, pages 8748–8763, 2021. 1, 2, 3, 6, 9, 13
work page 2021
-
[5]
Teaching matters: Investigating the role of supervision in vision transformers
Matthew Walmer, Saksham Suri, Kamal Gupta, and Abhinav Shrivastava. Teaching matters: Investigating the role of supervision in vision transformers. InCVPR, pages 7486–7496, 2023. 1, 9
work page 2023
-
[6]
What do self-supervised vision transformers learn? InICLR, 2023
Namuk Park, Wonjae Kim, Byeongho Heo, Taekyung Kim, and Sangdoo Yun. What do self-supervised vision transformers learn? InICLR, 2023
work page 2023
-
[7]
Rosetta neurons: Mining the common units in a model zoo
Amil Dravid, Yossi Gandelsman, Alexei A Efros, and Assaf Shocher. Rosetta neurons: Mining the common units in a model zoo. InICCV, pages 1934–1943, 2023. 1, 9
work page 1934
-
[8]
On the surprising effectiveness of attention transfer for vision transformers
Alexander C Li, Yuandong Tian, Beidi Chen, Deepak Pathak, and Xinlei Chen. On the surprising effectiveness of attention transfer for vision transformers. InNeurIPS, pages 113963–113990, 2024. 1, 2, 3, 4, 5, 6, 8, 9, 13, 16
work page 2024
-
[9]
Training data-efficient image transformers & distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. InICML, pages 10347–10357,
-
[10]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, pages 9650–9660, 2021. 3, 9, 13
work page 2021
-
[11]
An empirical study of training self-supervised vision trans- formers
Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision trans- formers. InICCV, pages 9640–9649, 2021. 3, 9, 13
work page 2021
-
[12]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, pages 16000–16009, 2022. 3, 9, 13
work page 2022
-
[13]
ibot: Image bert pre-training with online tokenizer
Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer. InICLR, 2022. 3, 9, 13
work page 2022
-
[14]
arXiv preprint arXiv:2208.06366 (2022)
Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, and Furu Wei. Beit v2: Masked image modeling with vector-quantized visual tokenizers.arXiv preprint arXiv:2208.06366, 2022. 2, 3, 6, 9, 13
-
[15]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 3, 9, 13 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, pages 4015–4026,
-
[17]
Vision transformers need registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InICLR, 2024. 2, 3, 6, 9, 13, 15
work page 2024
-
[18]
Imagenet: A Large-scale Hierarchical Image Database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A Large-scale Hierarchical Image Database. InCVPR, pages 248–255, 2009. 4, 13
work page 2009
-
[19]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 4, 14
work page 2019
-
[20]
The inaturalist species classification and detection dataset
Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InCVPR, pages 8769–8778, 2018. 5
work page 2018
-
[21]
The many faces of robustness: A critical analysis of out-of-distribution generalization
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InICCV, pages 8340–8349, 2021. 5
work page 2021
-
[22]
Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InCVPR, pages 15262–15271, 2021. 5
work page 2021
-
[23]
Learning robust global representations by penalizing local predictive power
Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. InNeurIPS, 2019. 5
work page 2019
-
[24]
Do imagenet classifiers generalize to imagenet? InICML, pages 5389–5400, 2019
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InICML, pages 5389–5400, 2019. 5
work page 2019
-
[25]
Transformer feed-forward layers are key-value memories
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InEMNLP, pages 5484–5495, 2021. 5
work page 2021
-
[26]
Going deeper with image transformers
Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. InICCV, pages 32–42, 2021. 6, 9
work page 2021
-
[27]
Three things everyone should know about vision transformers
Hugo Touvron, Matthieu Cord, Alaaeldin El-Nouby, Jakob Verbeek, and Hervé Jégou. Three things everyone should know about vision transformers. InECCV, pages 497–515, 2022
work page 2022
-
[28]
Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Revenge of the vit. InECCV, pages 516–533,
-
[29]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InCVPR, pages 9729–9738, 2020. 9
work page 2020
-
[30]
Improved Baselines with Momentum Contrastive Learning
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning.arXiv preprint arXiv:2003.04297, 2020. 9
work page internal anchor Pith review arXiv 2003
-
[31]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, pages 4904–4916, 2021. 9
work page 2021
-
[32]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, pages 11975–11986, 2023. 9
work page 2023
-
[33]
BEiT: BERT pre-training of image transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image transformers. In ICLR, 2022. 9
work page 2022
-
[34]
SimMIM: A simple framework for masked image modeling
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. SimMIM: A simple framework for masked image modeling. InCVPR, pages 9653–9663, 2022. 9
work page 2022
-
[35]
SAM 2: Segment anything in images and videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. InICLR, 2025. 9 11
work page 2025
-
[36]
SAM 3: Segment anything with concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris Coll- Vinent, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Z...
work page 2026
-
[37]
Beyond instance consistency: Investigating view diversity in self-supervised learning.TMLR, 2025
Huaiyuan Qin, Muli Yang, Siyuan Hu, Peng Hu, Yu Zhang, Chen Gong, and Hongyuan Zhu. Beyond instance consistency: Investigating view diversity in self-supervised learning.TMLR, 2025. 9
work page 2025
-
[38]
Rotary position embedding for vision transformer
Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. InECCV, pages 289–305, 2024. 9
work page 2024
-
[39]
Vision transformers don’t need trained registers
Nick Jiang, Aran Nayebi, Aaditya Bansal, Charlotte Chong, Edward Hu, Carl V ondrick, Kyle Mahowald, Yash Goyal, Erik Boehnke, and Kyle Mahowald. Vision transformers don’t need trained registers. In NeurIPS, 2025. 9
work page 2025
-
[40]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InICLR, 2024. 9, 15
work page 2024
-
[41]
Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. InICLR, 2017. 9
work page 2017
-
[42]
MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InNeurIPS, pages 5776–5788, 2020. 9
work page 2020
-
[43]
Attention distillation: Self-supervised vision transformer students need more guidance
Kai Wang, Fei Yang, and Joost van de Weijer. Attention distillation: Self-supervised vision transformer students need more guidance. InBMVC, 2022. 9
work page 2022
-
[44]
Asher Trockman and J. Zico Kolter. Mimetic initialization of self-attention layers. InICML, pages 34456–34468, 2023. 9
work page 2023
-
[45]
ScaleKD: Strong vision transformers could be excellent teachers
Jiawei Fan, Chao Li, Xiaolong Liu, and Anbang Yao. ScaleKD: Strong vision transformers could be excellent teachers. InNeurIPS, pages 63290–63315, 2024. 9
work page 2024
-
[46]
Asher Trockman and J. Zico Kolter. Mimetic initialization of mlps.arXiv preprint arXiv:2602.07156, 2026
-
[47]
Penghao Wang, Yuhao Zhou, Mengxuan Wu, Panpan Zhang, Zhangyang Wang, and Kai Wang. Data efficient any transformer-to-mamba distillation via attention bridge.arXiv preprint arXiv:2510.19266, 2025
-
[48]
Huiyuan Tian, Bonan Xu, Shijian Li, and Xin Jin. From low-rank features to encoding mismatch: Rethinking feature distillation in vision transformers.arXiv preprint arXiv:2511.15572, 2025
work page internal anchor Pith review arXiv 2025
-
[49]
Ying Peng, Hongsen Ye, Changxin Huang, Xiping Hu, Jian Chen, and Runhao Zeng. Revisiting cross- architecture distillation: Adaptive dual-teacher transfer for lightweight video models. InAAAI, pages 8367–8375, 2026
work page 2026
-
[50]
Perspective-aware teaching: Adapting knowledge for heterogeneous distillation
Jhe-Hao Lin, Yi Yao, Chan-Feng Hsu, Hong-Xia Xie, Hong-Han Shuai, and Wen-Huang Cheng. Perspective-aware teaching: Adapting knowledge for heterogeneous distillation. InICCV, pages 4178–4187,
-
[51]
Attention Transfer does not work
Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, and Zhuang Liu. Initializing models with larger ones. InICLR, 2024. 9 12 Attention Transfer Is Not Universally Effective for Vision Transformers Technical Appendices and Supplementary Material A Additional Implementation Details We provide further implementati...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.