arxiv: 2605.07191 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Attention Transfer Is Not Universally Effective for Vision Transformers

Huaiyuan Qin , Muli Yang , Gabriel James Goenawan , Peng Hu , Chen Gong , Xi Peng , Hongyuan Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:41 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords attention transfervision transformersViTknowledge distillationarchitectural mismatchpre-trained modelsattention patternsmodel transfer

0 comments

The pith

Attention transfer in Vision Transformers succeeds only when student architecture matches the teacher.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

A prior result claimed that transferring only attention patterns from a pre-trained teacher Vision Transformer could fully recover the benefits of its weights for any standard student ViT. This paper tests the claim on 20 teachers from 11 ViT families and finds success in only 7 families, with the other 4 failing by as much as 5.1% below even a no-transfer baseline. The failures hold across model sizes, longer training, different datasets, and out-of-distribution tests. Controlled experiments localize the issue to the attention-routing channel and show that adding the teacher's native components to the randomly initialized student reverses every failure, while those same components give no gain in ordinary from-scratch training. The work concludes that attention patterns are sufficient only when the student's architecture matches the teacher's.

Core claim

Attention Transfer is not universally effective for Vision Transformers. It recovers full pre-training benefits for some families yet produces consistent failures up to 5.1% below the from-scratch baseline for others. These failures are family-consistent, persist under varied training conditions, and localize to the attention-routing channel. The root cause is architectural mismatch: transferred attention patterns remain non-functional for the student unless its structure is augmented with the teacher's native components. Adding those components alone reverses all failures, while they provide no benefit when used in standard from-scratch training, confirming that attention is sufficient only

What carries the argument

Architectural mismatch between the pre-trained teacher Vision Transformer and the standard student, which renders transferred attention patterns non-functional unless the student's structure incorporates the teacher's specific components.

If this is right

Attention transfer cannot be applied indiscriminately across different ViT families without first checking architectural compatibility.
Adding only the teacher's mismatched components to the student restores the full benefit of transferred attention patterns.
The performance gap from mismatch remains even with extended training, alternate datasets, or out-of-distribution evaluation.
Attention patterns alone do not capture the transferable value of ViT pre-training; supporting architectural elements must also align.
Future distillation work must treat architecture as a first-class requirement rather than assuming attention is the sole transferable element.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Distillation pipelines for transformers will need explicit architecture-alignment steps to avoid silent performance losses.
Similar mismatch effects may appear when distilling between language-model families that differ in attention or feed-forward design.
Model repositories could improve transfer success by publishing component-level descriptions that allow students to adopt teacher-specific blocks.
The finding invites experiments that optimize the student architecture jointly with the transfer objective rather than fixing it in advance.

Load-bearing premise

That the observed transfer failures are caused by architectural mismatch rather than untested factors such as optimization dynamics or subtle pre-training data differences.

What would settle it

A test in which a student is given exactly the teacher's architecture while every other variable stays fixed and attention transfer still fails to recover full performance.

Figures

Figures reproduced from arXiv: 2605.07191 by Chen Gong, Gabriel James Goenawan, Hongyuan Zhu, Huaiyuan Qin, Muli Yang, Peng Hu, Xi Peng.

**Figure 1.** Figure 1: Attention Transfer is not universally effective. We begin by evaluating Attention Transfer [8] on the broader benchmark of 20 pre-trained teachers from 11 well-known ViT families under the default ImageNet-1K setup. Each point represents one set of pre-trained weights; the axes report ∆ Top-1 accuracy (%) relative to the No-Transfer baseline under two transfer methods: Attention Copy (x-axis) and Attentio… view at source ↗

**Figure 2.** Figure 2: Extended training robustness. We extend the Attention Distillation training up to 100 epochs and report the gap relative to No-Transfer at every 10 epochs. The same success/failure split remains consistent with the results in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Layer-wise decomposition. We transfer attention from the top or bottom k layers of DINOv2-S under both Attention Copy and Attention Distillation. In contrast to the monotonic gains in [8], no layer subset reproduces successfamily gains, and harmful transfer appears across every transfer depth for this failure teacher. Component-Level Decomposition. We first ask whether the family-specific split is drive… view at source ↗

**Figure 4.** Figure 4: Architectural compatibility rescues ineffective Attention Transfer. For teachers from the failure families, we report ∆ accuracy (%) relative to the corresponding No-Transfer baselines for both Attention Copy and Attention Distillation. The native-arch student is augmented by adding only the teacher-native components missing from the standard student in a randomly initialized state, with no pre-trained wei… view at source ↗

**Figure 5.** Figure 5: Layer-wise teacher-to-student KL divergence. We plot per-layer KL divergence between the pre-trained teacher’s attention maps and the transferred student’s attention maps on ImageNet-1K under Attention Distillation. A clear spike of the standard-arch student at late layers can be observed, while the native-arch student consistently stays near-zero. This shows Attention Transfer can match teacher attention … view at source ↗

**Figure 6.** Figure 6: Ablation on behaviors of MSE loss vs. default CE loss. We compare behaviors of different transfer losses in Attention Distillation with varying λ. Same family-level trends can be observed under the λ-sweep. For success teachers, both losses yield monotonic gains with increasing λ; while for failure teachers, both losses cause larger drops with increasing λ. the native-architecture student remains consisten… view at source ↗

read the original abstract

A recent work shows that Attention Transfer, which transfers only the attention patterns from a pre-trained teacher Vision Transformer (ViT) to a randomly initialized standard student ViT, is sufficient to recover the full benefit of the teacher's pre-trained weights. We revisit this finding on a comprehensive benchmark of 20 teachers from 11 well-known ViT families and reveal that Attention Transfer is not universally effective. While 7 families transfer successfully, 4 consistently fail, falling up to 5.1\% below the from-scratch no-transfer baseline. Further results demonstrate that this failure is family-consistent across model sizes, and persists under extended training durations, different transfer datasets, and out-of-distribution evaluations. Controlled analyses then consistently localize the problem to the attention-routing channel, indicating that the key issue is not whether the student can match the teacher's attention patterns, but whether the matched patterns remain functional for the student. Crucially, we identify architectural mismatch between the pre-trained teacher and the standard student as the primary mechanism. By adding only the teacher's native architectural components to the student in a randomly initialized state, we completely reverse the failure for all 4 families. Notably, these components alone do not improve from-scratch training, confirming that they specifically unlock the usability of the teacher's attention. We further systematically show that this failure is not explained by the inadequate choice of transfer loss or by differences in pre-training recipes. Our findings refine the prevailing understanding of attention in ViT representations: attention is sufficient \textit{only} when the student architecture matches the teacher.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Attention transfer only recovers teacher gains when the student ViT architecture matches the teacher, and the paper shows this failure mode clearly across families.

read the letter

The paper's main result is straightforward: attention transfer from a pre-trained ViT teacher works for some families but fails consistently for others, dropping up to 5.1% below a from-scratch baseline. The failures hold across model sizes, longer training, different datasets, and OOD tests. They trace the issue to the attention-routing channel and show that adding the teacher's native components to the student (in random init) fully reverses the problem, while those same components give no gain when training from scratch. This pins the problem to architectural compatibility rather than the attention patterns themselves being unusable in principle. They also rule out transfer loss choice and pre-training recipe differences as explanations. The experiments are direct and reproducible, with no reliance on fitted parameters or circular definitions. The breadth—20 teachers from 11 families—makes the negative result more credible than the single-family success in the prior work it rebuts. The reversal experiment is the cleanest part; it isolates the mechanism without extra assumptions. Minor soft spots remain around whether every conceivable optimization or data-distribution confound is closed, though the controls already address the most obvious ones. The central claim that attention is sufficient only under architectural match is supported by the data they present. This work is for researchers doing ViT distillation or compression who need to know when the trick actually applies. It refines rather than overturns the earlier finding, so it is worth citing in follow-up papers on efficient training. The empirical grounding is strong enough that a serious editor should send it to referees rather than desk-reject.

Referee Report

1 major / 3 minor

Summary. The paper claims that Attention Transfer (AT) from pre-trained ViT teachers to randomly initialized standard student ViTs is not universally effective. Across a benchmark of 20 teachers from 11 ViT families, AT succeeds for 7 families but fails consistently for 4, yielding up to 5.1% worse accuracy than the from-scratch baseline. The failures persist across model scales, training lengths, transfer datasets, and OOD evaluations. Controlled experiments localize the issue to architectural mismatch in the attention-routing channel; augmenting the student with the teacher's native components (in random initialization) fully reverses the negative transfer for all failing families, while those components confer no benefit in from-scratch training. The authors rule out transfer-loss choice and pre-training recipe differences as explanations, concluding that attention patterns are sufficient only when student and teacher architectures match.

Significance. If the central empirical claim holds, the work meaningfully refines the prevailing view of attention transfer in ViTs by demonstrating that matched attention patterns are not automatically functional for the student. The broad benchmark (20 teachers, 11 families), consistency checks across scales/training regimes/datasets/OOD, and the clean reversal experiment when architectural components are restored constitute strong evidence. The finding that the added components improve transfer but not from-scratch training isolates the mechanism and has direct implications for distillation and transfer-learning practice in vision transformers.

major comments (1)

[§4.3, §5.2] §4.3 and §5.2: the reversal experiment (adding teacher's native components to the student) is load-bearing for the architectural-mismatch claim. While the paper shows these components confer no from-scratch gain, it would strengthen the result to report the exact performance delta (with standard deviation over seeds) between the augmented-student AT run and the original failing AT run for each of the 4 families.

minor comments (3)

[Table 2] Table 2: the caption should explicitly state whether the reported accuracies are means over multiple random seeds and, if so, include the number of seeds and standard deviations.
[§3.2] §3.2: the definition of 'standard student ViT' could be made more precise by listing the exact architectural differences (e.g., patch embedding, positional encoding, MLP structure) relative to each teacher family.
[Figure 4] Figure 4: the OOD evaluation plots would benefit from an additional panel or annotation showing the from-scratch baseline for direct visual comparison with the AT curves.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the constructive suggestion regarding the reversal experiment. We agree that providing the precise performance deltas with standard deviations will strengthen the clarity and impact of our architectural-mismatch claim.

read point-by-point responses

Referee: [§4.3, §5.2] §4.3 and §5.2: the reversal experiment (adding teacher's native components to the student) is load-bearing for the architectural-mismatch claim. While the paper shows these components confer no from-scratch gain, it would strengthen the result to report the exact performance delta (with standard deviation over seeds) between the augmented-student AT run and the original failing AT run for each of the 4 families.

Authors: We thank the referee for this helpful suggestion. In the revised manuscript, we will add the exact performance deltas (with standard deviations computed over three random seeds) between the augmented-student Attention Transfer runs and the original failing Attention Transfer runs for each of the four families. These values will be incorporated into the text and/or tables of §4.3 and §5.2 to provide a precise quantification of the reversal effect. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims rest on direct empirical measurements from independent training runs across 20 teachers in 11 ViT families, with controlled ablations isolating architectural mismatch as the cause of transfer failure. No derivations, equations, or first-principles predictions are present that reduce to self-defined inputs, fitted parameters renamed as outputs, or self-citation chains. All results (negative transfer for 4 families, reversal upon component addition, persistence across scales/datasets) are falsifiable via replication and do not rely on any ansatz or uniqueness theorem imported from prior author work. This is standard self-contained empirical research.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations from controlled training experiments rather than new theoretical axioms or invented entities; standard training assumptions such as convergence of SGD are invoked but not load-bearing for the mismatch conclusion.

free parameters (1)

transfer loss weighting and training hyperparameters
Standard choices in knowledge distillation experiments; the paper explicitly tests that the failure is not explained by loss choice.

axioms (1)

domain assumption Attention patterns extracted from a pre-trained teacher can be imposed on a student via a distillation loss
This is the foundational assumption of the attention transfer method being tested.

pith-pipeline@v0.9.0 · 5593 in / 1258 out tokens · 47845 ms · 2026-05-11T02:41:50.578500+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 4 internal anchors

[1]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InICLR, 2021. 1, 3, 6, 9

work page 2021
[2]

DINOv2: Learning robust visual features without supervision.TMLR, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.TMLR, 2024. 1, 2, 3, 6, 9, 13

work page 2024
[3]

DINOv3

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Learning Transferable Visual Models from Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models from Natural Language Supervision. InICML, pages 8748–8763, 2021. 1, 2, 3, 6, 9, 13

work page 2021
[5]

Teaching matters: Investigating the role of supervision in vision transformers

Matthew Walmer, Saksham Suri, Kamal Gupta, and Abhinav Shrivastava. Teaching matters: Investigating the role of supervision in vision transformers. InCVPR, pages 7486–7496, 2023. 1, 9

work page 2023
[6]

What do self-supervised vision transformers learn? InICLR, 2023

Namuk Park, Wonjae Kim, Byeongho Heo, Taekyung Kim, and Sangdoo Yun. What do self-supervised vision transformers learn? InICLR, 2023

work page 2023
[7]

Rosetta neurons: Mining the common units in a model zoo

Amil Dravid, Yossi Gandelsman, Alexei A Efros, and Assaf Shocher. Rosetta neurons: Mining the common units in a model zoo. InICCV, pages 1934–1943, 2023. 1, 9

work page 1934
[8]

On the surprising effectiveness of attention transfer for vision transformers

Alexander C Li, Yuandong Tian, Beidi Chen, Deepak Pathak, and Xinlei Chen. On the surprising effectiveness of attention transfer for vision transformers. InNeurIPS, pages 113963–113990, 2024. 1, 2, 3, 4, 5, 6, 8, 9, 13, 16

work page 2024
[9]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. InICML, pages 10347–10357,

work page
[10]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, pages 9650–9660, 2021. 3, 9, 13

work page 2021
[11]

An empirical study of training self-supervised vision trans- formers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision trans- formers. InICCV, pages 9640–9649, 2021. 3, 9, 13

work page 2021
[12]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, pages 16000–16009, 2022. 3, 9, 13

work page 2022
[13]

ibot: Image bert pre-training with online tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer. InICLR, 2022. 3, 9, 13

work page 2022
[14]

arXiv preprint arXiv:2208.06366 (2022)

Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, and Furu Wei. Beit v2: Masked image modeling with vector-quantized visual tokenizers.arXiv preprint arXiv:2208.06366, 2022. 2, 3, 6, 9, 13

work page arXiv 2022
[15]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 3, 9, 13 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, pages 4015–4026,

work page
[17]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InICLR, 2024. 2, 3, 6, 9, 13, 15

work page 2024
[18]

Imagenet: A Large-scale Hierarchical Image Database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A Large-scale Hierarchical Image Database. InCVPR, pages 248–255, 2009. 4, 13

work page 2009
[19]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 4, 14

work page 2019
[20]

The inaturalist species classification and detection dataset

Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InCVPR, pages 8769–8778, 2018. 5

work page 2018
[21]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InICCV, pages 8340–8349, 2021. 5

work page 2021
[22]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InCVPR, pages 15262–15271, 2021. 5

work page 2021
[23]

Learning robust global representations by penalizing local predictive power

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. InNeurIPS, 2019. 5

work page 2019
[24]

Do imagenet classifiers generalize to imagenet? InICML, pages 5389–5400, 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InICML, pages 5389–5400, 2019. 5

work page 2019
[25]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InEMNLP, pages 5484–5495, 2021. 5

work page 2021
[26]

Going deeper with image transformers

Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. InICCV, pages 32–42, 2021. 6, 9

work page 2021
[27]

Three things everyone should know about vision transformers

Hugo Touvron, Matthieu Cord, Alaaeldin El-Nouby, Jakob Verbeek, and Hervé Jégou. Three things everyone should know about vision transformers. InECCV, pages 497–515, 2022

work page 2022
[28]

Deit iii: Revenge of the vit

Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Revenge of the vit. InECCV, pages 516–533,

work page
[29]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InCVPR, pages 9729–9738, 2020. 9

work page 2020
[30]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning.arXiv preprint arXiv:2003.04297, 2020. 9

work page internal anchor Pith review arXiv 2003
[31]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, pages 4904–4916, 2021. 9

work page 2021
[32]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, pages 11975–11986, 2023. 9

work page 2023
[33]

BEiT: BERT pre-training of image transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image transformers. In ICLR, 2022. 9

work page 2022
[34]

SimMIM: A simple framework for masked image modeling

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. SimMIM: A simple framework for masked image modeling. InCVPR, pages 9653–9663, 2022. 9

work page 2022
[35]

SAM 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. InICLR, 2025. 9 11

work page 2025
[36]

SAM 3: Segment anything with concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris Coll- Vinent, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Z...

work page 2026
[37]

Beyond instance consistency: Investigating view diversity in self-supervised learning.TMLR, 2025

Huaiyuan Qin, Muli Yang, Siyuan Hu, Peng Hu, Yu Zhang, Chen Gong, and Hongyuan Zhu. Beyond instance consistency: Investigating view diversity in self-supervised learning.TMLR, 2025. 9

work page 2025
[38]

Rotary position embedding for vision transformer

Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. InECCV, pages 289–305, 2024. 9

work page 2024
[39]

Vision transformers don’t need trained registers

Nick Jiang, Aran Nayebi, Aaditya Bansal, Charlotte Chong, Edward Hu, Carl V ondrick, Kyle Mahowald, Yash Goyal, Erik Boehnke, and Kyle Mahowald. Vision transformers don’t need trained registers. In NeurIPS, 2025. 9

work page 2025
[40]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InICLR, 2024. 9, 15

work page 2024
[41]

Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer

Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. InICLR, 2017. 9

work page 2017
[42]

MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InNeurIPS, pages 5776–5788, 2020. 9

work page 2020
[43]

Attention distillation: Self-supervised vision transformer students need more guidance

Kai Wang, Fei Yang, and Joost van de Weijer. Attention distillation: Self-supervised vision transformer students need more guidance. InBMVC, 2022. 9

work page 2022
[44]

Zico Kolter

Asher Trockman and J. Zico Kolter. Mimetic initialization of self-attention layers. InICML, pages 34456–34468, 2023. 9

work page 2023
[45]

ScaleKD: Strong vision transformers could be excellent teachers

Jiawei Fan, Chao Li, Xiaolong Liu, and Anbang Yao. ScaleKD: Strong vision transformers could be excellent teachers. InNeurIPS, pages 63290–63315, 2024. 9

work page 2024
[46]

Zico Kolter

Asher Trockman and J. Zico Kolter. Mimetic initialization of mlps.arXiv preprint arXiv:2602.07156, 2026

work page arXiv 2026
[47]

Data efficient any transformer-to-mamba distillation via attention bridge.arXiv preprint arXiv:2510.19266, 2025

Penghao Wang, Yuhao Zhou, Mengxuan Wu, Panpan Zhang, Zhangyang Wang, and Kai Wang. Data efficient any transformer-to-mamba distillation via attention bridge.arXiv preprint arXiv:2510.19266, 2025

work page arXiv 2025
[48]

From low-rank features to encoding mismatch: Rethinking feature distillation in vision transformers.arXiv preprint arXiv:2511.15572, 2025

Huiyuan Tian, Bonan Xu, Shijian Li, and Xin Jin. From low-rank features to encoding mismatch: Rethinking feature distillation in vision transformers.arXiv preprint arXiv:2511.15572, 2025

work page internal anchor Pith review arXiv 2025
[49]

Revisiting cross- architecture distillation: Adaptive dual-teacher transfer for lightweight video models

Ying Peng, Hongsen Ye, Changxin Huang, Xiping Hu, Jian Chen, and Runhao Zeng. Revisiting cross- architecture distillation: Adaptive dual-teacher transfer for lightweight video models. InAAAI, pages 8367–8375, 2026

work page 2026
[50]

Perspective-aware teaching: Adapting knowledge for heterogeneous distillation

Jhe-Hao Lin, Yi Yao, Chan-Feng Hsu, Hong-Xia Xie, Hong-Han Shuai, and Wen-Huang Cheng. Perspective-aware teaching: Adapting knowledge for heterogeneous distillation. InICCV, pages 4178–4187,

work page
[51]

Attention Transfer does not work

Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, and Zhuang Liu. Initializing models with larger ones. InICLR, 2024. 9 12 Attention Transfer Is Not Universally Effective for Vision Transformers Technical Appendices and Supplementary Material A Additional Implementation Details We provide further implementati...

work page 2024