arxiv: 2604.21453 · v1 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

Instance-level Visual Active Tracking with Occlusion-Aware Planning

Haowei Sun , Kai Zhou , Hao Gao , Shiteng Zhang , Jinwu Hu , Xutao Wen , Qixiang Ye , Mingkui Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual active trackingocclusion recoveryinstance prototypesconditional diffusiondrone navigationKalman filterDINO features

0 comments

The pith

OA-VAT uses DINOv3 prototypes and a conditional diffusion planner to track specific targets in 3D despite distractors and occlusions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OA-VAT as a pipeline that controls a camera to follow one designated target through three-dimensional space while ignoring similar-looking objects and recovering from temporary blockages. It identifies two persistent failures in prior visual active tracking work: weak distinction of the true target and lack of forward planning when the target disappears from view. The solution builds fixed instance prototypes from augmented views, refines them during operation with a Kalman filter, and generates avoidance paths via a diffusion model trained on a dedicated dataset of 20,000 planning scenarios. Reported results show improved success rates in simulation, real datasets, and onboard drone flights, suggesting the method can support continuous tracking in cluttered environments.

Core claim

OA-VAT is a unified three-module system. The first module initializes discriminative instance prototypes offline by aggregating multi-view augmented features from DINOv3. The second module enhances those prototypes online while applying a confidence-aware Kalman filter to maintain stable tracking under changing appearance and motion. The third module trains a conditional diffusion planner on the new Planning-20k dataset to output obstacle-avoiding trajectories that enable occlusion recovery. Together these components deliver 0.93 average success rate on UnrealCV, 90.8 percent collision avoidance on real-world sets, and 81.6 percent tracking success on a DJI Tello drone at 35 frames per real.

What carries the argument

The three-module OA-VAT pipeline, in which the occlusion-aware conditional diffusion planner produces safe trajectories while the prototype tracker supplies instance-level discrimination.

Load-bearing premise

The conditional diffusion planner trained on Planning-20k will generalize to occlusions and obstacles in unseen real-world settings.

What would settle it

Run the system on a new physical scene containing obstacle layouts and occlusion durations absent from Planning-20k and measure whether success rate falls more than 10 percent below the reported 90.8 percent CAR.

Figures

Figures reproduced from arXiv: 2604.21453 by Hao Gao, Haowei Sun, Jinwu Hu, Kai Zhou, Mingkui Tan, Qixiang Ye, Shiteng Zhang, Xutao Wen.

**Figure 2.** Figure 2: Overview of Occlusion-Aware-VAT (OA-VAT). (a) Given a reference image, OA-VAT first constructs an instance-aware prototype [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed Occlusion-Aware Trajectory Planner. The planner denoises a random trajectory into a feasible recovery [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation study on the instance-aware offline prototype initialization module. In (a) and (b), each point represents a prototype [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Results on real-world images of the Car8 video in DTB70 [24] dataset [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Experiment results on real drone DJI Tello. 4.4. Experiments in Real-world Scenarios Effectiveness on real-world images. To assess OA-VAT ’s transferability to real-world scenarios, we follow the setting of [40] and perform zero-shot evaluation on 8 videos each from VOT [21], DTB70 [24], and UAVDT [14]. Although camera control is unavailable in these videos, we can feed frames into the model and verify whe… view at source ↗

**Figure 8.** Figure 8: Failure cases of the baseline method FAn [ [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: OA-VAT accurately detects the target against distractors. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OA-VAT adds a training-free DINOv3 prototype step and a conditional diffusion planner on a new Planning-20k dataset to handle distractors and occlusions in active tracking, with reported gains on sim and drone tests, but the planner's real-world transfer is the least supported claim.

read the letter

OA-VAT improves visual active tracking with DINOv3 prototypes and a diffusion planner on a new dataset, delivering measurable gains in sim and real drone tests, but the occlusion planner's generalization to new conditions remains the weakest link. This paper presents OA-VAT, a pipeline for visual active tracking that tries to solve two common problems: mistaking similar objects for the target and losing track during occlusions. It does this with three modules that work together. What is new here is the instance-aware prototype initialization that uses DINOv3 in a training-free way by aggregating multi-view features. Then there's an online tracker that updates prototypes and uses a Kalman filter for stability. The key learned part is the occlusion-aware trajectory planner based on conditional diffusion, trained on their created Planning-20k dataset. This seems like a reasonable way to generate paths that avoid obstacles while recovering from occlusions. The paper does well in showing practical improvements. On the UnrealCV simulator it gets 0.93 success rate, a small edge over TrackVLA. On real-world datasets it reaches 90.8% collision avoidance rate, better than GC-VAT by 12%. And on a DJI Tello drone it achieves 81.6% tracking success. It runs at 35 FPS, which is good for real deployment in drones or surveillance. The soft spots are around the planner. The stress test highlights that the conditional diffusion model trained on Planning-20k may not handle unseen real-world occlusions and obstacles reliably, since there's no mention of domain randomization or out-of-distribution tests. The other modules are more robust because they're training-free or online, but the planner is the learned component that could fail in new environments. If the full paper doesn't provide more evidence on this transfer, the deployment claims are a bit optimistic. Minor issues include lack of error bars or statistical significance in the reported numbers, though the gains are consistent across settings. This work is for people in computer vision and robotics who need robust active tracking for applications like drone following or security cameras. A reader looking for end-to-end systems with planning would get value from seeing how the modules fit together and the new dataset. It deserves a serious referee because it has a clear problem statement, a novel combination of techniques, and empirical validation on both sim and real hardware, even if some parts need strengthening. I recommend putting it through peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes OA-VAT, a unified pipeline for instance-level visual active tracking (VAT) comprising three modules: (1) a training-free Instance-Aware Offline Prototype Initialization that aggregates multi-view features from DINOv3 to build discriminative prototypes against distractors, (2) an Online Prototype Enhancement Tracker that updates prototypes and employs a confidence-aware Kalman filter for robustness to appearance/motion changes, and (3) an Occlusion-Aware Trajectory Planner trained via conditional diffusion on the new Planning-20k dataset to generate obstacle-avoiding paths. The central empirical claims are 0.93 average success rate (SR) on UnrealCV (+2.2% over TrackVLA), 90.8% average collision avoidance rate (CAR) on real-world datasets (+12.1% over GC-VAT), and 81.6% tracking success rate (TSR) on a DJI Tello drone, all at 35 FPS on RTX 3090.

Significance. If the reported gains and real-world transfer hold, the work would meaningfully advance practical VAT for drones and surveillance by jointly addressing instance discrimination and active occlusion recovery, with the Planning-20k dataset and conditional diffusion planner as potentially reusable contributions. The training-free nature of the first two modules and real-time hardware validation are notable strengths that could support deployment if generalization is better substantiated.

major comments (2)

[Experiments] Experiments section (and abstract performance claims): the reported metrics (0.93 SR, 90.8% CAR, 81.6% TSR) and SOTA comparisons lack any description of experimental protocols, data splits, number of trials, error bars, or statistical tests, which directly undermines assessment of whether the +2.2% and +12.1% margins are reliable or reproducible.
[Occlusion-Aware Trajectory Planner] Occlusion-Aware Trajectory Planner and Planning-20k dataset sections: the central claim that the conditional diffusion planner enables reliable occlusion recovery in real deployment rests on unexamined sim-to-real transfer; no domain-randomization details, out-of-distribution failure analysis, or explicit comparison of occlusion/obstacle distributions between Planning-20k and the real-world test sets are provided, making the 90.8% CAR and 81.6% TSR results load-bearing but insufficiently supported.

minor comments (2)

[Introduction] The abstract and introduction would benefit from a brief diagram or pseudocode overview of the three-module pipeline to clarify data flow between prototype initialization, online tracking, and planning.
[Online Prototype Enhancement Tracker] Notation for the Kalman filter confidence weighting and the conditional diffusion conditioning variables should be defined explicitly in the methods section rather than left implicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have carefully addressed each major comment below and revised the manuscript to improve experimental transparency and analysis of sim-to-real transfer.

read point-by-point responses

Referee: [Experiments] Experiments section (and abstract performance claims): the reported metrics (0.93 SR, 90.8% CAR, 81.6% TSR) and SOTA comparisons lack any description of experimental protocols, data splits, number of trials, error bars, or statistical tests, which directly undermines assessment of whether the +2.2% and +12.1% margins are reliable or reproducible.

Authors: We agree that the original manuscript did not provide sufficient detail on experimental protocols, which limits reproducibility assessment. In the revised version, we have expanded Section 4 (Experiments) with a dedicated subsection on experimental setup. This includes: explicit data splits for the Planning-20k dataset and evaluation benchmarks; the number of trials (100 independent episodes per scenario across UnrealCV and real-world tests); error bars as standard deviations computed over 5 random seeds; and paired t-test results confirming statistical significance of the reported gains (p < 0.05 for both the SR and CAR improvements). These additions substantiate the reliability of the margins without altering the core results. revision: yes
Referee: [Occlusion-Aware Trajectory Planner] Occlusion-Aware Trajectory Planner and Planning-20k dataset sections: the central claim that the conditional diffusion planner enables reliable occlusion recovery in real deployment rests on unexamined sim-to-real transfer; no domain-randomization details, out-of-distribution failure analysis, or explicit comparison of occlusion/obstacle distributions between Planning-20k and the real-world test sets are provided, making the 90.8% CAR and 81.6% TSR results load-bearing but insufficiently supported.

Authors: We acknowledge that the original manuscript provided insufficient analysis of sim-to-real transfer for the Occlusion-Aware Trajectory Planner. We have revised the manuscript by adding a new subsection (4.5) that details: the domain randomization strategies used during diffusion model training (randomized obstacle densities, lighting conditions, and occlusion durations); quantitative comparisons of occlusion and obstacle distributions between Planning-20k and the real DJI Tello test sets (via histograms and summary statistics on obstacle count and occlusion length); and an out-of-distribution failure analysis with representative failure cases and their frequency. The real-world CAR and TSR results remain as empirical evidence of transfer, but we have added explicit caveats on the remaining domain gap and future directions for further bridging it. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of inputs.

full rationale

The paper presents a modular system (training-free prototype initialization via DINOv3, online Kalman-enhanced tracking, and conditional diffusion planner trained on the new Planning-20k dataset) whose performance is measured via direct comparisons to external SOTA baselines on UnrealCV, real-world datasets, and DJI Tello drone trials. No mathematical derivation chain, self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the abstract or described pipeline. All reported gains (SR, CAR, TSR) are empirical and falsifiable against independent benchmarks, keeping the central claims self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The approach rests on standard computer vision assumptions about feature discriminability and the effectiveness of diffusion models for path planning, plus newly introduced components whose performance is asserted via experiments.

free parameters (1)

Hyperparameters of the conditional diffusion model and Kalman filter
Likely tuned during development of the planner and tracker but not specified in the abstract.

axioms (1)

domain assumption DINOv3 features enable reliable multi-view instance discrimination without task-specific training
Invoked in the Instance-Aware Offline Prototype Initialization module.

invented entities (2)

OA-VAT unified pipeline no independent evidence
purpose: Integrates prototype initialization, online tracking, and occlusion-aware planning
New system architecture proposed in the paper.
Planning-20k dataset no independent evidence
purpose: Training data for the conditional diffusion trajectory planner
Newly introduced dataset for this work.

pith-pipeline@v0.9.0 · 5562 in / 1416 out tokens · 63263 ms · 2026-05-09T21:56:27.998375+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Vi- sual tracking with online multiple instance learning

Boris Babenko, Ming-Hsuan Yang, and Serge Belongie. Vi- sual tracking with online multiple instance learning. In2009 IEEE Conference on computer vision and Pattern Recogni- tion, pages 983–990. IEEE, 2009. 1

2009
[2]

Fully-convolutional siamese networks for object tracking

Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. InComputer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8- 10 and 15-16, 2016, Proceedings, Part II 14, pages 850–865. Springer, 2016. 1

2016
[3]

Simple online and realtime tracking

Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In2016 IEEE international conference on image processing (ICIP), pages 3464–3468. IEEE, 2016. 1

2016
[4]

Learning discriminative model prediction for track- ing

Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning discriminative model prediction for track- ing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 6182–6191, 2019. 1, 2, 6, 12, 14

2019
[5]

Seqtrack: Sequence to sequence learning for visual ob- ject tracking

Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. Seqtrack: Sequence to sequence learning for visual ob- ject tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14572–14581, 2023. 1

2023
[6]

Yolo-world: Real-time open- vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open- vocabulary object detection. InProc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2024. 1

2024
[7]

Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research,

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research,
[8]

Cimpoi, S

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. InProceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014. 5

2014
[9]

Dji tello

Da-Jiang Innovations. Dji tello. https://store.dji.com/product/tello, 2025. 6, 9

2025
[10]

Djitellopy

damiafuentes. Djitellopy. https://github.com/damiafuentes/DJITelloPy, 2025. 15

2025
[11]

Stable and consistent object track- ing: An active vision approach

Dibyendu Kumar Das, Mouli Laha, Somajyoti Majumder, and Dipnarayan Ray. Stable and consistent object track- ing: An active vision approach. InAdvanced Computa- tional and Communication Paradigms: Proceedings of In- ternational Conference on ICACCP 2017, Volume 2, pages 299–308. Springer, 2018. 2

2017
[12]

Enhancing continuous control of mobile robots for end-to- end visual active tracking.Robotics and Autonomous Sys- tems, 142:103799, 2021

Alessandro Devo, Alberto Dionigi, and Gabriele Costante. Enhancing continuous control of mobile robots for end-to- end visual active tracking.Robotics and Autonomous Sys- tems, 142:103799, 2021. 1

2021
[13]

D-vat: End-to-end visual active tracking for micro aerial vehicles.IEEE Robotics and Automation Letters, 9(6):5046–5053, 2024

Alberto Dionigi, Simone Felicioni, Mirko Leomanni, and Gabriele Costante. D-vat: End-to-end visual active tracking for micro aerial vehicles.IEEE Robotics and Automation Letters, 9(6):5046–5053, 2024. 1, 2, 6, 12, 15

2024
[14]

The unmanned aerial vehicle benchmark: Object detection and tracking

Dawei Du, Yuankai Qi, Hongyang Yu, Yifan Yang, Kaiwen Duan, Guorong Li, Weigang Zhang, Qingming Huang, and Qi Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. InProceedings of the European con- ference on computer vision (ECCV), pages 370–386, 2018. 6, 8

2018
[15]

A review of quadro- tor: An underactuated mechanical system.Annual Reviews in Control, 46:165–180, 2018

Bara J Emran and Homayoun Najjaran. A review of quadro- tor: An underactuated mechanical system.Annual Reviews in Control, 46:165–180, 2018. 1

2018
[16]

Lasot: A high-quality benchmark for large-scale single ob- ject tracking

Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single ob- ject tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5374–5383,
[17]

Aerial vision-and-dialog nav- igation

Yue Fan, Winson Chen, Tongzhou Jiang, Chun Zhou, Yi Zhang, and Xin Eric Wang. Aerial vision-and-dialog nav- igation. InFindings of the Association for Computational Linguistics: ACL 2023, pages 3043–3061, Toronto, Canada,

2023
[18]

Association for Computational Linguistics. 1
[19]

Hart, Nils J

Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. A for- mal basis for the heuristic determination of minimum cost paths.IEEE Transactions on Systems Science and Cybernet- ics, 4(2):100–107, 1968. 5

1968
[20]

A new approach to linear filtering and prediction problems

Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. 1960. 4, 7

1960
[21]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3, 12

2023
[22]

A novel performance evaluation methodology for single-target trackers.IEEE transactions on pattern analysis and machine intelligence, 38(11):2137– 2155, 2016

Matej Kristan, Jiri Matas, Ale ˇs Leonardis, Tom´aˇs V oj´ıˇr, Ro- man Pflugfelder, Gustavo Fernandez, Georg Nebehay, Fatih Porikli, and Luka ˇCehovin. A novel performance evaluation methodology for single-target trackers.IEEE transactions on pattern analysis and machine intelligence, 38(11):2137– 2155, 2016. 6, 8

2016
[23]

High performance visual tracking with siamese region pro- posal network

Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region pro- posal network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8971–8980,
[24]

Siamrpn++: Evolution of siamese vi- sual tracking with very deep networks

Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese vi- sual tracking with very deep networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4282–4291, 2019. 1

2019
[25]

Visual object tracking for un- manned aerial vehicles: A benchmark and new motion mod- els

Siyi Li and Dit-Yan Yeung. Visual object tracking for un- manned aerial vehicles: A benchmark and new motion mod- els. InProceedings of the AAAI conference on artificial in- telligence, 2017. 6, 8

2017
[26]

Aerialvln: Vision-and-language navigation for uavs

Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yan- ning Zhang, and Qi Wu. Aerialvln: Vision-and-language navigation for uavs. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 15384– 15394, 2023. 1

2023
[27]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,
[28]

Wenhan Luo, Peng Sun, Fangwei Zhong, Wei Liu, Tong Zhang, and Yizhou Wang. End-to-end active object track- ing and its real-world deployment via reinforcement learn- ing.IEEE transactions on pattern analysis and machine in- telligence, 42(6):1317–1332, 2019. 1, 6, 12, 14

2019
[29]

Follow anything: Open- set detection, tracking, and following in real-time.IEEE Robotics and Automation Letters, 9(4):3283–3290, 2024

Alaa Maalouf, Ninad Jadhav, Krishna Murthy Jatavallab- hula, Makram Chahine, Daniel M V ogt, Robert J Wood, An- tonio Torralba, and Daniela Rus. Follow anything: Open- set detection, tracking, and following in real-time.IEEE Robotics and Automation Letters, 9(4):3283–3290, 2024. 1, 2, 6, 8, 9, 12, 14, 15

2024
[30]

Directional stability of automatically steered bodies.Journal of the American Society for Naval Engineers, 34(2):280–309, 1922

Nicolas Minorsky. Directional stability of automatically steered bodies.Journal of the American Society for Naval Engineers, 34(2):280–309, 1922. 2, 4, 7, 8, 12, 14

1922
[31]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational conference on machine learning, pages 8162–8171. PMLR,
[32]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

2023
[33]

Fast-tracker 2.0: Improving autonomy of aerial tracking with active vision and human location regression

Neng Pan, Ruibin Zhang, Tiankai Yang, Can Cui, Chao Xu, and Fei Gao. Fast-tracker 2.0: Improving autonomy of aerial tracking with active vision and human location regression. IET Cyber-Systems and Robotics, 3(4):292–301, 2021. 2

2021
[34]

Unrealcv: Virtual worlds for computer vision

Weichao Qiu, Fangwei Zhong, Yi Zhang, Siyuan Qiao, Zi- hao Xiao, Tae Soo Kim, and Yizhou Wang. Unrealcv: Virtual worlds for computer vision. InProceedings of the 25th ACM international conference on Multimedia, pages 1221–1224,
[35]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 6, 14, 15

work page internal anchor Pith review arXiv 2024
[36]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 5

2015
[37]

Schedl, Indrajit Kurmi, and Oliver Bimber

David C. Schedl, Indrajit Kurmi, and Oliver Bimber. An autonomous drone for search and rescue in forests using air- borne optical sectioning.Science Robotics, 6, 2021. 1

2021
[38]

Large scale real- world multi person tracking

Bing Shuai, Alessandro Bergamo, Uta Buechler, Andrew Berneshawi, Alyssa Boden, and Joe Tighe. Large scale real- world multi person tracking. InEuropean Conference on Computer Vision. Springer, 2022. 7

2022
[39]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 3, 7, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Vi- sual tracking: An experimental survey.IEEE transactions on pattern analysis and machine intelligence, 36(7):1442– 1468, 2013

Arnold WM Smeulders, Dung M Chu, Rita Cucchiara, Si- mone Calderara, Afshin Dehghan, and Mubarak Shah. Vi- sual tracking: An experimental survey.IEEE transactions on pattern analysis and machine intelligence, 36(7):1442– 1468, 2013. 1

2013
[41]

Open-world drone active tracking with goal- centered rewards

Haowei Sun, Jinwu Hu, Zhirui Zhang, Haoyuan Tian, Xinze Xie, Yufeng Wang, Xiaohua Xie, Yun Lin, Zhuliang Yu, and Mingkui Tan. Open-world drone active tracking with goal- centered rewards. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 1, 2, 6, 7, 8, 12, 15

2025
[42]

Yoloe: Real-time seeing anything

Ao Wang, Lihao Liu, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yoloe: Real-time seeing anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 24591–24602, 2025. 3, 13

2025
[43]

Fast online object tracking and segmentation: A unifying approach

Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip HS Torr. Fast online object tracking and segmentation: A unifying approach. InProceedings of the IEEE/CVF con- ference on Computer Vision and Pattern Recognition, pages 1328–1338, 2019. 1

2019
[44]

Trackvla: Embodied visual tracking in the wild.arXiv preprint arXiv:2505.23189, 2025a

Shaoan Wang, Jiazhao Zhang, Minghan Li, Jiahang Liu, Anqi Li, Kui Wu, Fangwei Zhong, Junzhi Yu, Zhizheng Zhang, and He Wang. Trackvla: Embodied visual tracking in the wild.arXiv preprint arXiv:2505.23189, 2025. 1, 6, 12, 14, 15

work page arXiv 2025
[45]

Towards realistic uav vision-language navigation: Plat- form, benchmark, and methodology, 2024

Xiangyu Wang, Donglin Yang, Ziqin Wang, Hohin Kwan, Jinyu Chen, Wenjun Wu, Hongsheng Li, Yue Liao, and Si Liu. Towards realistic uav vision-language navigation: Plat- form, benchmark, and methodology, 2024. 1

2024
[46]

Detection, tracking, and counting meets drones in crowds: A benchmark

Longyin Wen, Dawei Du, Pengfei Zhu, Qinghua Hu, Qilong Wang, Liefeng Bo, and Siwei Lyu. Detection, tracking, and counting meets drones in crowds: A benchmark. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7812–7821, 2021. 1

2021
[47]

Learning occlusion-robust vision transformers for real-time uav track- ing

You Wu, Xucheng Wang, Xiangyang Yang, Mengyuan Liu, Dan Zeng, Hengzhou Ye, and Shuiwang Li. Learning occlusion-robust vision transformers for real-time uav track- ing. InCVPR, 2025. 4, 8

2025
[48]

Multi-uav cooperative system for search and rescue based on yolov5.International Journal of Disaster Risk Reduction, 76:102972, 2022

Linjie Xing, Xiaoyan Fan, Yaxin Dong, Zenghui Xiong, Lin Xing, Yang Yang, Haicheng Bai, and Chengjiang Zhou. Multi-uav cooperative system for search and rescue based on yolov5.International Journal of Disaster Risk Reduction, 76:102972, 2022. 1

2022
[49]

Joint feature learning and relation modeling for tracking: A one-stream framework

Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. InEuropean Conference on Computer Vision, pages 341–357. Springer, 2022. 1

2022
[50]

From poses to identity: Training-free per- son re-identification via feature centralization

Chao Yuan, Guiwei Zhang, Changxiao Ma, Tianyi Zhang, and Guanglin Niu. From poses to identity: Training-free per- son re-identification via feature centralization. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 24409–24418, 2025. 3

2025
[51]

Multimodal pretrained knowl- edge for real-world object navigation.Machine Intelligence Research, 22(4):713–729, 2025

Hui Yuan, Yan Huang, Naigong Yu, Dongbo Zhang, Zetao Du, Ziqi Liu, and Kun Zhang. Multimodal pretrained knowl- edge for real-world object navigation.Machine Intelligence Research, 22(4):713–729, 2025. 1

2025
[52]

A novel uav path planning approach: Heuristic cross- ing search and rescue optimization algorithm.Expert Sys- tems with Applications, 215:119243, 2023

Chaoqun Zhang, Wenjuan Zhou, Weidong Qin, and Weidong Tang. A novel uav path planning approach: Heuristic cross- ing search and rescue optimization algorithm.Expert Sys- tems with Applications, 215:119243, 2023. 1

2023
[53]

Ad-vat: An asymmetric dueling mechanism for learning visual active tracking

Fangwei Zhong, Peng Sun, Wenhan Luo, Tingyun Yan, and Yizhou Wang. Ad-vat: An asymmetric dueling mechanism for learning visual active tracking. InInternational Confer- ence on Learning Representations, 2019. 1, 6, 14

2019
[54]

Ad-vat+: An asymmetric dueling mechanism for learning and understanding visual active tracking.IEEE transactions on pattern analysis and machine intelligence, 43(5):1467–1482, 2019

Fangwei Zhong, Peng Sun, Wenhan Luo, Tingyun Yan, and Yizhou Wang. Ad-vat+: An asymmetric dueling mechanism for learning and understanding visual active tracking.IEEE transactions on pattern analysis and machine intelligence, 43(5):1467–1482, 2019. 1, 6, 12, 14

2019
[55]

Towards distraction-robust active visual track- ing

Fangwei Zhong, Peng Sun, Wenhan Luo, Tingyun Yan, and Yizhou Wang. Towards distraction-robust active visual track- ing. InInternational Conference on Machine Learning, pages 12782–12792. PMLR, 2021. 6, 14

2021
[56]

Rspt: reconstruct surroundings and predict trajectory for generalizable active object tracking

Fangwei Zhong, Xiao Bi, Yudi Zhang, Wei Zhang, and Yizhou Wang. Rspt: reconstruct surroundings and predict trajectory for generalizable active object tracking. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 3705–3714, 2023. 1, 6, 12, 14

2023
[57]

Empowering embodied visual tracking with visual foundation models and offline rl

Fangwei Zhong, Kui Wu, Hai Ci, Churan Wang, and Hao Chen. Empowering embodied visual tracking with visual foundation models and offline rl. InEuropean Conference on Computer Vision, pages 139–155. Springer, 2024. 1, 2, 6, 7, 8, 9, 12, 14, 15

2024
[58]

Zero-shot skeleton-based action recognition with prototype-guided feature alignment.IEEE Transactions on Image Processing, 34:4602–4617, 2025

Kai Zhou, Shuhai Zhang, Zeng You, Jinwu Hu, Mingkui Tan, and Fei Liu. Zero-shot skeleton-based action recognition with prototype-guided feature alignment.IEEE Transactions on Image Processing, 34:4602–4617, 2025. 12

2025
[59]

Curml: A curriculum machine learning library

Yuwei Zhou, Hong Chen, Zirui Pan, Chuanhao Yan, Fanqi Lin, Xin Wang, and Wenwu Zhu. Curml: A curriculum machine learning library. InProceedings of the 30th ACM International Conference on Multimedia, pages 7359–7363,
[60]

Curbench: curriculum learning benchmark

Yuwei Zhou, Zirui Pan, Xin Wang, Hong Chen, Haoyang Li, Yanwen Huang, Zhixiao Xiong, Fangzhou Xiong, Peiyang Xu, Wenwu Zhu, et al. Curbench: curriculum learning benchmark. InForty-first International Conference on Ma- chine Learning, 2024. 12

2024
[61]

1 N NX i=1 ∥fk,i −g∥ 2 2 # . (17) By Assumption 1, RHS of Eq.(17)is bounded by: E

Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling. Detection and tracking meet drones challenge.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 44(11):7380–7399, 2022. 1 We organize the supplementary materials as follows. Section A reviews related work on visual active tracking. Section B presents the c...

2022