pith. sign in

arxiv: 2605.28018 · v1 · pith:OQVNPCXZnew · submitted 2026-05-27 · 💻 cs.CV

Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking

Pith reviewed 2026-06-29 13:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords UAV trackingknowledge distillationtransformerasymmetric trackingreal-time trackingfeature distillationtarget localizationdual-branch
0
0 comments X

The pith

A teacher-guided dual-branch distillation strategy enables a lightweight student transformer to achieve accurate UAV tracking at real-time speeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the trade-off in UAV tracking where simplifying the model backbone for speed weakens feature representation in complex scenes. It proposes EATrack, which uses a dual-branch distillation from a teacher model: one branch focuses on spatially guided feature learning to strengthen target representations, and the other on prediction-level transfer for better localization. Additionally, a fine-grained target-aware strategy and a temporal adaptation module help with appearance changes and long-term robustness. This matters for practical UAV applications that require both precision and efficiency on limited hardware.

Core claim

EATrack centers on a teacher-guided dual-branch distillation strategy that transfers strong target representations and accurate localization capabilities from a heavy teacher model to a lightweight student backbone, supplemented by target-aware distillation and temporal adaptation, resulting in a favorable accuracy-speed balance on five UAV benchmarks.

What carries the argument

Teacher-guided dual-branch distillation strategy that performs spatially focused feature-level distillation and prediction-level distillation to compensate for the student's simplified representations.

If this is right

  • The lightweight student model gains enhanced feature expressiveness for complex scenarios.
  • Prediction-level distillation improves spatial localization accuracy.
  • The fine-grained target-aware distillation increases robustness to appearance variations.
  • The temporal adaptation module boosts performance over time during inference.
  • EATrack demonstrates a good accuracy-speed trade-off across multiple UAV tracking benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar distillation approaches could be applied to other real-time vision tasks like object detection on edge devices.
  • The method might reduce the need for powerful onboard computers in drones, enabling longer flight times.
  • Testing on additional benchmarks with extreme conditions could reveal limits of the compensation strategy.

Load-bearing premise

That the spatially focused feature-level and prediction-level distillations can transfer enough of the teacher's target modeling ability to overcome the feature weakening from the simplified student backbone.

What would settle it

Running EATrack on the five UAV benchmarks and finding that its accuracy falls significantly below the teacher's or other state-of-the-art methods while only marginally improving speed, or that ablating the distillation branches causes large performance drops.

Figures

Figures reproduced from arXiv: 2605.28018 by Bineng Zhong, Hongtao Yang, Qihua Liang, Shuxiang Song, Xiantao Hu, Yaozong Zheng, Yuanliang Xue.

Figure 1
Figure 1. Figure 1: Impact of teacher tracker distillation. (a) Tracking re [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework of the proposed EATrack. The left part shows our novel teacher-guided training strategy, where an [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AUC scores of different attributes on UAV123. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons of our tracker against other four [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of Attention Maps from the Dual-Branch [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Given the real-time demands of UAV tracking, many methods simplify the backbone to reduce computation, but this often weakens feature representation and degrades performance in complex scenarios. To alleviate this issue, we propose EATrack, an efficient and asymmetric UAV tracking framework centered around a teacher-guided dual-branch distillation strategy that enhances the feature expressiveness of the lightweight student model. Specifically, EATrack investigates two complementary perspectives of knowledge transfer: spatially focused feature-level distillation that compensates for weakened representations by guiding the student to learn strong target representations, and prediction-level distillation that enhances spatial localization by learning the teacher's capability for accurate target localization. Furthermore, to enhance robustness against appearance variations, we introduce a fine-grained target-aware distillation strategy that selectively transfers the teacher's target modeling capacity to the student. A temporal adaptation module is incorporated at inference to enhance robustness over time. Experiments on five UAV benchmarks demonstrate that EATrack achieves a favorable balance between accuracy and speed. Code: https://github.com/GXNU-ZhongLab/EATrack

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents EATrack, an efficient asymmetric UAV tracking framework based on a teacher-guided dual-branch distillation strategy. The approach aims to enhance the feature expressiveness of a lightweight student model through spatially focused feature-level distillation, prediction-level distillation, and a fine-grained target-aware distillation strategy, along with a temporal adaptation module. It claims to achieve a favorable balance between accuracy and speed on five UAV benchmarks.

Significance. If the results hold, this could advance practical real-time UAV tracking by mitigating performance loss from backbone simplification via targeted distillation. The linked code repository supports reproducibility.

major comments (2)
  1. Abstract: The abstract states experimental results on five benchmarks but provides no details on implementation, baselines, error bars, or ablation studies, preventing verification of the central claim that EATrack achieves a favorable accuracy-speed balance.
  2. Abstract: The load-bearing assumption that spatially focused feature-level distillation, prediction-level distillation, and the fine-grained target-aware strategy together transfer the teacher's target modeling capacity to compensate for the deliberately weakened student backbone lacks any quantitative isolation of the transfer effect (e.g., via ablation comparing student with/without distillation components).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments on the abstract. We address each point below, noting that the full manuscript contains the requested details in the experimental sections.

read point-by-point responses
  1. Referee: Abstract: The abstract states experimental results on five benchmarks but provides no details on implementation, baselines, error bars, or ablation studies, preventing verification of the central claim that EATrack achieves a favorable accuracy-speed balance.

    Authors: The abstract provides a concise summary of the method and overall results, consistent with standard practice for conference papers where space is limited. Implementation details, baseline comparisons, error bars where applicable, and ablation studies are fully reported in Sections 4 and 5 of the manuscript, including results across the five UAV benchmarks and component-wise analysis. This structure allows verification of the accuracy-speed claims without overloading the abstract. revision: no

  2. Referee: Abstract: The load-bearing assumption that spatially focused feature-level distillation, prediction-level distillation, and the fine-grained target-aware strategy together transfer the teacher's target modeling capacity to compensate for the deliberately weakened student backbone lacks any quantitative isolation of the transfer effect (e.g., via ablation comparing student with/without distillation components).

    Authors: The manuscript includes quantitative ablations in Section 5.3 that isolate the contribution of each distillation component. These experiments compare the student model with and without the spatially focused feature-level distillation, prediction-level distillation, and fine-grained target-aware distillation, demonstrating incremental gains in tracking accuracy that support the knowledge transfer claim. The abstract summarizes the approach while the main text provides the supporting evidence. revision: no

Circularity Check

0 steps flagged

No circularity: method is an independent empirical framework with no self-referential derivations

full rationale

The paper presents EATrack as a proposed architecture combining dual-branch distillation, spatially focused feature-level and prediction-level transfer, a fine-grained target-aware strategy, and a temporal adaptation module. No equations, parameter fits, or derivation chains are described in the provided text that reduce a claimed prediction or result back to the inputs by construction. The central claims rest on experimental validation across five UAV benchmarks rather than on any self-definitional mapping, fitted-input renaming, or load-bearing self-citation chain. The distillation strategies are presented as design choices whose effectiveness is tested externally, not derived tautologically from the student backbone simplification itself. This is the standard non-circular case for an applied tracking paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no information on free parameters, axioms, or invented entities; ledger is empty by necessity.

pith-pipeline@v0.9.1-grok · 5719 in / 985 out tokens · 36742 ms · 2026-06-29T13:14:52.691069+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    Fully-convolutional siamese networks for object tracking

    Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. InComputer vision–ECCV 2016 workshops: Amsterdam, the Netherlands, October 8- 10 and 15-16, 2016, proceedings, part II 14, pages 850–865. Springer, 2016. 2

  2. [2]

    Hift: Hierarchical feature transformer for aerial tracking

    Ziang Cao, Changhong Fu, Junjie Ye, Bowen Li, and Yiming Li. Hift: Hierarchical feature transformer for aerial tracking. InProceedings of the IEEE/CVF international conference on computer vision, pages 15457–15466, 2021. 6

  3. [3]

    Tctrack: Temporal contexts for aerial tracking

    Ziang Cao, Ziyuan Huang, Liang Pan, Shiwei Zhang, Zi- wei Liu, and Changhong Fu. Tctrack: Temporal contexts for aerial tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14798– 14808, 2022. 6

  4. [4]

    Towards real-world visual tracking with temporal contexts.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(12):15834–15849, 2023

    Ziang Cao, Ziyuan Huang, Liang Pan, Shiwei Zhang, Ziwei Liu, and Changhong Fu. Towards real-world visual tracking with temporal contexts.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(12):15834–15849, 2023. 2, 6

  5. [5]

    Transformer tracking

    Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8126–8135, 2021. 5

  6. [6]

    Mixformer: End-to-end tracking with iterative mixed atten- tion

    Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed atten- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 13608–13618,

  7. [7]

    Learning spatially regularized correlation filters for visual tracking

    Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. Learning spatially regularized correlation filters for visual tracking. InProceedings of the IEEE inter- national conference on computer vision, pages 4310–4318,

  8. [8]

    Discriminative scale space tracking.IEEE transactions on pattern analysis and machine intelligence, 39(8):1561–1575, 2016

    Martin Danelljan, Gustav H ¨ager, Fahad Shahbaz Khan, and Michael Felsberg. Discriminative scale space tracking.IEEE transactions on pattern analysis and machine intelligence, 39(8):1561–1575, 2016. 2

  9. [9]

    Discriminative scale space tracking.IEEE transactions on pattern analysis and machine intelligence, 39(8):1561–1575, 2016

    Martin Danelljan, Gustav H ¨ager, Fahad Shahbaz Khan, and Michael Felsberg. Discriminative scale space tracking.IEEE transactions on pattern analysis and machine intelligence, 39(8):1561–1575, 2016. 6

  10. [10]

    Eco: Efficient convolution operators for tracking

    Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Eco: Efficient convolution operators for tracking. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6638–6646,

  11. [11]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 5

  12. [12]

    The unmanned aerial vehicle benchmark: Object detection and tracking

    Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. InProceedings of the European con- ference on computer vision (ECCV), pages 370–386, 2018. 5

  13. [13]

    Lasot: A high-quality benchmark for large-scale single ob- ject tracking

    Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single ob- ject tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5374–5383,

  14. [14]

    Siamese anchor proposal network for high-speed aerial tracking

    Changhong Fu, Ziang Cao, Yiming Li, Junjie Ye, and Chen Feng. Siamese anchor proposal network for high-speed aerial tracking. In2021 IEEE international conference on robotics and automation (ICRA), pages 510–516. IEEE,

  15. [15]

    Progressive representation learning for real-time uav tracking

    Changhong Fu, Xiang Lei, Haobo Zuo, Liangliang Yao, Guangze Zheng, and Jia Pan. Progressive representation learning for real-time uav tracking. In2024 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS), pages 5072–5079. IEEE, 2024. 6

  16. [16]

    Unmanned aerial vehicles (uavs) and artificial intelligence revolution- izing wildlife monitoring and conservation.Sensors, 16(1): 97, 2016

    Luis F Gonzalez, Glen A Montes, Eduard Puig, Sandra John- son, Kerrie Mengersen, and Kevin J Gaston. Unmanned aerial vehicles (uavs) and artificial intelligence revolution- izing wildlife monitoring and conservation.Sensors, 16(1): 97, 2016. 1

  17. [17]

    Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista

    Jo ˜ao F. Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. High-speed tracking with kernelized correlation fil- ters.IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2015. 2, 6

  18. [18]

    A comprehensive overhaul of feature distillation

    Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, No- jun Kwak, and Jin Young Choi. A comprehensive overhaul of feature distillation. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1921–1930,

  19. [19]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 3

  20. [20]

    Transformer track- ing via frequency fusion.IEEE Transactions on Circuits and Systems for Video Technology, 34(2):1020–1031, 2023

    Xiantao Hu, Bineng Zhong, Qihua Liang, Shengping Zhang, Ning Li, Xianxian Li, and Rongrong Ji. Transformer track- ing via frequency fusion.IEEE Transactions on Circuits and Systems for Video Technology, 34(2):1020–1031, 2023. 2

  21. [21]

    Toward modalities correlation for rgb-t tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(10):9102–9111, 2024

    Xiantao Hu, Bineng Zhong, Qihua Liang, Shengping Zhang, Ning Li, and Xianxian Li. Toward modalities correlation for rgb-t tracking.IEEE Transactions on Circuits and Systems for Video Technology, 34(10):9102–9111, 2024. 2

  22. [22]

    Exploiting multimodal spatial-temporal patterns for video object tracking

    Xiantao Hu, Ying Tai, Xu Zhao, Chen Zhao, Zhenyu Zhang, Jun Li, Bineng Zhong, and Jian Yang. Exploiting multimodal spatial-temporal patterns for video object tracking. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 3581–3589, 2025. 3

  23. [23]

    Adaptive perception for unified visual multi-modal object tracking.IEEE Trans- actions on Artificial Intelligence, 2025

    Xiantao Hu, Bineng Zhong, Qihua Liang, Liangtao Shi, Zhiyi Mo, Ying Tai, and Jian Yang. Adaptive perception for unified visual multi-modal object tracking.IEEE Trans- actions on Artificial Intelligence, 2025

  24. [24]

    Cur- riculum adaptation for one-stream rgb–t tracking.Pattern Recognition, page 113494, 2026

    Xiantao Hu, Fansheng Zeng, Bineng Zhong, Zhangyong Tang, Wenxuan Fang, Jun Li, Ying Tai, and Jian Yang. Cur- riculum adaptation for one-stream rgb–t tracking.Pattern Recognition, page 113494, 2026. 3

  25. [25]

    Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019

    Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019. 5

  26. [26]

    Learning aberrance repressed correlation filters for real-time uav tracking

    Ziyuan Huang, Changhong Fu, Yiming Li, Fuling Lin, and Peng Lu. Learning aberrance repressed correlation filters for real-time uav tracking. InProceedings of the IEEE/CVF international conference on computer vision, pages 2891– 2900, 2019. 6

  27. [27]

    Uav traffic patrolling via road detection and tracking in anony- mous aerial video frames.Journal of Intelligent & Robotic Systems, 95:675–690, 2019

    M ¨ucahit Karaduman, Ahmet Cınar, and Haluk Eren. Uav traffic patrolling via road detection and tracking in anony- mous aerial video frames.Journal of Intelligent & Robotic Systems, 95:675–690, 2019. 1

  28. [28]

    Learning spatial-temporal regularized cor- relation filters for visual tracking

    Feng Li, Cheng Tian, Wangmeng Zuo, Lei Zhang, and Ming- Hsuan Yang. Learning spatial-temporal regularized cor- relation filters for visual tracking. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 4904–4913, 2018. 6

  29. [29]

    Cadtrack: Learning contextual aggregation with deformable alignment for robust rgbt tracking

    Hao Li, Yuhao Wang, Xiantao Hu, Wenning Hao, Pingping Zhang, Dong Wang, and Huchuan Lu. Cadtrack: Learning contextual aggregation with deformable alignment for robust rgbt tracking. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6109–6117, 2026. 3

  30. [30]

    Visual object tracking for un- manned aerial vehicles: A benchmark and new motion mod- els

    Siyi Li and Dit-Yan Yeung. Visual object tracking for un- manned aerial vehicles: A benchmark and new motion mod- els. InProceedings of the AAAI conference on artificial in- telligence, 2017. 5

  31. [31]

    Learning residue-aware correlation filters and refining scale for real-time uav tracking.Pattern Recognition, 127:108614,

    Shuiwang Li, Yuting Liu, Qijun Zhao, and Ziliang Feng. Learning residue-aware correlation filters and refining scale for real-time uav tracking.Pattern Recognition, 127:108614,

  32. [32]

    Adaptive and background-aware vision transformer for real-time uav tracking

    Shuiwang Li, Yangxiang Yang, Dan Zeng, and Xucheng Wang. Adaptive and background-aware vision transformer for real-time uav tracking. InProceedings of the IEEE/CVF international conference on computer vision, pages 13989– 14000, 2023. 2, 3

  33. [34]

    Autotrack: Towards high-performance visual tracking for uav with automatic spatio-temporal regulariza- tion

    Yiming Li, Changhong Fu, Fangqiang Ding, Ziyuan Huang, and Geng Lu. Autotrack: Towards high-performance visual tracking for uav with automatic spatio-temporal regulariza- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11923–11932,

  34. [35]

    Learning adaptive and view-invariant vision transformer for real-time uav tracking

    Yongxin Li, Mengyuan Liu, You Wu, Xucheng Wang, Xi- angyang Yang, and Shuiwang Li. Learning adaptive and view-invariant vision transformer for real-time uav tracking. InForty-first International Conference on Machine Learn- ing, 2024. 2, 3, 6

  35. [36]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 5

  36. [37]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 5

  37. [38]

    Towards the develop- ment of a gas sensor system for monitoring pollutant gases in the low troposphere using small unmanned aerial vehicles

    Jairo Malaver Rojas, Felipe Gonzalez, Nunzio Motta, Alessandro Depari, and Peter Corke. Towards the develop- ment of a gas sensor system for monitoring pollutant gases in the low troposphere using small unmanned aerial vehicles. In 2012 Workshop on Robotics for Environmental Monitoring, pages 1–3, 2012. 1

  38. [39]

    A benchmark and simulator for uav tracking

    Matthias Mueller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for uav tracking. InEuropean con- ference on computer vision, pages 445–461. Springer, 2016. 5, 8

  39. [40]

    Trackingnet: A large-scale dataset and benchmark for object tracking in the wild

    Matthias Muller, Adel Bibi, Silvio Giancola, Salman Al- subaihi, and Bernard Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision (ECCV), pages 300–317, 2018. 5

  40. [41]

    Generalized in- tersection over union: A metric and a loss for bounding box regression

    Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized in- tersection over union: A metric and a loss for bounding box regression. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666,

  41. [42]

    FitNets: Hints for Thin Deep Nets

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fit- nets: Hints for thin deep nets. arxiv 2014.arXiv preprint arXiv:1412.6550, 2014. 3

  42. [43]

    Mamba adapter: Effi- cient multi-modal fusion for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology,

    Liangtao Shi, Bineng Zhong, Qihua Liang, Xiantao Hu, Zhiyi Mo, and Shuxiang Song. Mamba adapter: Effi- cient multi-modal fusion for vision-language tracking.IEEE Transactions on Circuits and Systems for Video Technology,

  43. [44]

    Training data-efficient image transformers & distillation through at- tention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. InInternational conference on machine learning, pages 10347–10357. PMLR, 2021. 5

  44. [45]

    Multi-cue correlation filters for ro- bust visual tracking

    Ning Wang, Wengang Zhou, Qi Tian, Richang Hong, Meng Wang, and Houqiang Li. Multi-cue correlation filters for ro- bust visual tracking. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4844– 4853, 2018. 6

  45. [46]

    Rank-based filter pruning for real-time uav tracking

    Xucheng Wang, Dan Zeng, Qijun Zhao, and Shuiwang Li. Rank-based filter pruning for real-time uav tracking. In2022 IEEE International Conference on Multimedia and Expo (ICME), pages 01–06. IEEE, 2022. 6

  46. [47]

    Litetrack: Layer pruning with asynchronous feature extrac- tion for lightweight and efficient visual tracking

    Qingmao Wei, Bi Zeng, Jianqi Liu, Li He, and Guotian Zeng. Litetrack: Layer pruning with asynchronous feature extrac- tion for lightweight and efficient visual tracking. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 4968–4975. IEEE, 2024. 6

  47. [48]

    Learning an adaptive and view-invariant vi- sion transformer for real-time uav tracking.IEEE Transac- tions on Circuits and Systems for Video Technology, pages 1–1, 2025

    You Wu, Yongxin Li, Mengyuan Liu, Xucheng Wang, Xi- angyang Yang, Hengzhou Ye, Dan Zeng, Qijun Zhao, and Shuiwang Li. Learning an adaptive and view-invariant vi- sion transformer for real-time uav tracking.IEEE Transac- tions on Circuits and Systems for Video Technology, pages 1–1, 2025. 6

  48. [49]

    Learning occlusion-robust vision transformers for real-time uav track- ing

    You Wu, Xucheng Wang, Xiangyang Yang, Mengyuan Liu, Dan Zeng, Hengzhou Ye, and Shuiwang Li. Learning occlusion-robust vision transformers for real-time uav track- ing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17103–17113, 2025. 2, 3, 6

  49. [50]

    Similarity- guided layer-adaptive vision transformer for uav tracking

    Chaocan Xue, Bineng Zhong, Qihua Liang, Yaozong Zheng, Ning Li, Yuanliang Xue, and Shuxiang Song. Similarity- guided layer-adaptive vision transformer for uav tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 6730–6740, 2025. 2, 3, 6

  50. [51]

    Motion-aware object tracking via motion and geometry-aware cues

    Hongtao Yang, Bineng Zhong, Qihua Liang, Xiantao Hu, Yufei Tan, Haiying Xia, and Shuxiang Song. Motion-aware object tracking via motion and geometry-aware cues. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 11604–11612, 2026. 3

  51. [52]

    Sgdvit: Saliency-guided dy- namic vision transformer for uav tracking.arXiv preprint arXiv:2303.04378, 2023

    Liangliang Yao, Changhong Fu, Sihang Li, Guangze Zheng, and Junjie Ye. Sgdvit: Saliency-guided dy- namic vision transformer for uav tracking.arXiv preprint arXiv:2303.04378, 2023. 6

  52. [53]

    Joint feature learning and relation modeling for tracking: A one-stream framework

    Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. InEuropean conference on computer vision, pages 341–357. Springer, 2022. 3, 5

  53. [54]

    Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer

    Sergey Zagoruyko and Nikos Komodakis. Paying more at- tention to attention: Improving the performance of convolu- tional neural networks via attention transfer.arXiv preprint arXiv:1612.03928, 2016. 3

  54. [55]

    Explicit context reasoning with supervision for visual tracking

    Fansheng Zeng, Bineng Zhong, Haiying Xia, Yufei Tan, Xi- antao Hu, Liangtao Shi, and Shuxiang Song. Explicit context reasoning with supervision for visual tracking. InProceed- ings of the 33rd ACM International Conference on Multime- dia, pages 8067–8076, 2025. 3

  55. [56]

    Visdrone-vdt2018: The vision meets drone video detection and tracking challenge results

    Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Haibin Ling, Qinghua Hu, Haotian Wu, Qinqin Nie, Hao Cheng, Chenfeng Liu, et al. Visdrone-vdt2018: The vision meets drone video detection and tracking challenge results. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018. 5

  56. [57]

    Adversarial blur-deblur network for ro- bust uav tracking.IEEE Robotics and Automation Letters, 8 (2):1101–1108, 2023

    Haobo Zuo, Changhong Fu, Sihang Li, Kunhan Lu, Yiming Li, and Chen Feng. Adversarial blur-deblur network for ro- bust uav tracking.IEEE Robotics and Automation Letters, 8 (2):1101–1108, 2023. 6