DeTrack: A Benchmark and Altitude-Aware Dual World Model for Drone-embodied Tracking
Pith reviewed 2026-05-20 14:37 UTC · model grok-4.3
The pith
Drone tracking improves when dual world models imagine future states at different altitudes to balance visibility and safety.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that an altitude-aware dual world model framework called AaDWorlds, built from an altitude-aware perception module and dual world models that imagine future states under both high- and low-altitude regimes, when combined with pseudo altitude-aware observations, alleviates the intrinsic altitude-mediated contradiction between target visibility and flight safety and thereby improves closed-loop tracking performance across all evaluation metrics on the DeTrack benchmark.
What carries the argument
Altitude-aware dual world models that generate imagined future states under high- and low-altitude regimes to resolve visibility-safety trade-offs.
Load-bearing premise
The dual world models can generate sufficiently accurate imagined future states under high- and low-altitude regimes that, when combined with pseudo altitude-aware observations, resolve the visibility-safety trade-off in closed-loop control.
What would settle it
Running the closed-loop controller on DeTrack scenes where the world models produce inaccurate future-state predictions for either altitude regime and measuring whether tracking metrics fail to improve or degrade relative to a single-model baseline.
Figures
read the original abstract
Aerial object tracking has broad applications in public safety, emergency rescue, wildlife monitoring, and related fields. However, existing aerial tracking benchmarks are mainly based on passive 2D video sequences captured from fixed camera locations or predefined flight paths, where drones are treated as passive cameras rather than embodied agents that actively perceive, interact, and control their motion in dynamic 3D scenes. In this paper, we define a new drone-embodied tracking task, termed DeTrack, which requires a drone to track a target in interactive 3D environments using online egocentric observations and active flight control in a closed loop. We build a large-scale benchmark containing 11,368 target trajectories across diverse scenes, rendering conditions, semantic regions, and moving distractors, together with evaluation metrics for target visibility, tracking accuracy, and trajectory success. We further propose AaDWorlds, an altitude-aware dual world model framework for drone-embodied tracking. AaDWorlds consists of an altitude-aware perception module and dual world models that imagine future states under both high- and low-altitude regimes. By combining pseudo altitude-aware observations and imagined future states, AaDWorlds alleviates the intrinsic altitude-mediated contradiction between target visibility and flight safety. Experiments on the DeTrack benchmark demonstrate that AaDWorlds improves closed-loop tracking performance across all evaluation metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DeTrack, a new benchmark for drone-embodied tracking consisting of 11,368 target trajectories in interactive 3D environments with diverse scenes, rendering conditions, and distractors, along with metrics for visibility, accuracy, and success. It proposes AaDWorlds, an altitude-aware dual world model framework comprising an altitude-aware perception module and dual world models that generate imagined future states in high- and low-altitude regimes; these are combined with pseudo altitude-aware observations to resolve the visibility-safety trade-off in closed-loop control. Experiments on DeTrack are reported to show improvements across all evaluation metrics.
Significance. If the central results hold with proper validation, the work is significant for shifting aerial tracking research from passive 2D video benchmarks to embodied agent settings with active 3D control. The large-scale, diverse DeTrack benchmark fills a clear gap, and the altitude-aware dual world model approach offers a concrete mechanism for handling altitude-mediated contradictions in drone perception and planning. Credit is due for the scale of the benchmark and the explicit framing of the visibility-safety trade-off.
major comments (2)
- [Experiments] Experiments section: The claim that AaDWorlds improves closed-loop tracking performance across all metrics is presented only via aggregate results; the manuscript supplies no separate quantitative validation of dual world model fidelity (e.g., per-timestep prediction error, state reconstruction loss) nor an ablation that replaces imagined future states with ground-truth rollouts. Without these, it remains possible that observed gains derive entirely from the altitude-aware perception module and pseudo-observation mechanism, rendering the dual-world-model component non-load-bearing for the central claim.
- [Method] Method, §3.2 (Dual World Models): The description of how the high- and low-altitude world models are trained and how their imagined states are fused with pseudo observations lacks explicit training objectives, loss terms, or fidelity metrics. This detail is required to evaluate whether the models generate sufficiently accurate future states to resolve the visibility-safety trade-off under the regimes described.
minor comments (2)
- [Abstract] Abstract: While the abstract correctly summarizes the contributions, it would benefit from at least one concrete quantitative improvement (e.g., percentage gain on a primary metric) to allow readers to gauge effect size without reading the full results.
- [Results] Figure captions and tables: Several result tables and figures would be clearer if they explicitly listed the baselines used and whether statistical significance was computed across multiple runs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, with clear indications of the revisions we will implement to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The claim that AaDWorlds improves closed-loop tracking performance across all metrics is presented only via aggregate results; the manuscript supplies no separate quantitative validation of dual world model fidelity (e.g., per-timestep prediction error, state reconstruction loss) nor an ablation that replaces imagined future states with ground-truth rollouts. Without these, it remains possible that observed gains derive entirely from the altitude-aware perception module and pseudo-observation mechanism, rendering the dual-world-model component non-load-bearing for the central claim.
Authors: We acknowledge that the current experiments section emphasizes aggregate closed-loop metrics. To isolate the contribution of the dual world models, we will add quantitative fidelity evaluations, including per-timestep prediction error and state reconstruction loss for the high- and low-altitude models. We will also include an ablation that substitutes ground-truth rollouts for the imagined states while keeping the altitude-aware perception and pseudo-observation components fixed. These additions will be reported in a revised Experiments section with corresponding tables and analysis. revision: yes
-
Referee: [Method] Method, §3.2 (Dual World Models): The description of how the high- and low-altitude world models are trained and how their imagined states are fused with pseudo observations lacks explicit training objectives, loss terms, or fidelity metrics. This detail is required to evaluate whether the models generate sufficiently accurate future states to resolve the visibility-safety trade-off under the regimes described.
Authors: We agree that the training details in §3.2 require expansion for reproducibility and evaluation. In the revised manuscript we will explicitly list the training objectives and loss terms (reconstruction, prediction, and regularization losses) used for each world model. We will also report fidelity metrics such as multi-step prediction accuracy under high- and low-altitude regimes. The fusion procedure with pseudo altitude-aware observations will be described with additional equations and pseudocode to clarify how imagined states are combined at each control step. revision: yes
Circularity Check
No significant circularity; empirical claims rest on independent benchmark evaluation
full rationale
The paper introduces the new DeTrack task and benchmark with 11,368 trajectories and defines AaDWorlds as an altitude-aware perception module plus dual world models that generate imagined future states. The central claim is that combining pseudo altitude-aware observations with these imagined states improves closed-loop metrics (visibility, accuracy, success) on the benchmark. No equations, fitted parameters, self-citations, or ansatzes are described that would reduce any reported prediction or result to the inputs by construction; the improvements are presented as outcomes of the proposed architecture evaluated on held-out trajectories, rendering the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Drones must actively perceive, interact, and control motion in dynamic 3D scenes using online egocentric observations.
invented entities (1)
-
Altitude-aware dual world models
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Deep learning for UA V- based object detection and tracking: A survey,
X. Wu, W. Li, D. Hong, R. Tao, and Q. Du, “Deep learning for UA V- based object detection and tracking: A survey,”IEEE Geoscience and Remote Sensing Magazine, vol. 10, no. 4, pp. 91–124, 2022
work page 2022
-
[2]
Drone deep reinforcement learning: A review,
A. T. Azar, A. Koubaa, N. Ali Mohamed, H. A. Ibrahim, Z. F. Ibrahim, M. Kazim, A. Ammar, B. Benjdira, A. M. Khamis, I. A. Hameed, and G. Casalino, “Drone deep reinforcement learning: A review,” Electronics, vol. 10, no. 9, p. 999, 2021
work page 2021
-
[3]
Reinforcement learning-based drone simulators: Survey, practice, and challenge,
J. H. Chan, K. Liu, Y . Chen, A. S. M. S. Sagar, and Y . Kim, “Reinforcement learning-based drone simulators: Survey, practice, and challenge,”Artificial Intelligence Review, vol. 57, p. 281, 2024
work page 2024
-
[4]
M.-S. Liu, J.-Q. Gao, G.-Y . Hu, G.-F. Hao, T.-Z. Jiang, C. Zhang, and S. Yu, “Monkeytrail: A scalable video-based method for tracking macaque movement trajectory in daily living cages,”Zoological Re- search, vol. 43, no. 3, pp. 343–351, 2022
work page 2022
-
[5]
UA V123: A benchmark and simulator for UA V tracking,
M. Mueller, N. Smith, and B. Ghanem, “UA V123: A benchmark and simulator for UA V tracking,” inEuropean Conference on Computer Vision (ECCV), 2016, pp. 445–461
work page 2016
-
[6]
The unmanned aerial vehicle benchmark: Object detection and tracking,
D. Du, Y . Qi, H. Yu, Y . Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and Q. Tian, “The unmanned aerial vehicle benchmark: Object detection and tracking,” inEuropean Conference on Computer Vision (ECCV), 2018, pp. 370–386
work page 2018
-
[7]
Vision Meets Drones: A Challenge
P. Zhu, L. Wen, X. Bian, H. Ling, and Q. Hu, “Vision meets drones: A challenge,”arXiv preprint arXiv:1804.07437, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[8]
VisDrone- VDT2018: The vision meets drone video detection and tracking chal- lenge results,
P. Zhu, L. Wen, D. Du, X. Bian, H. Ling, Q. Huet al., “VisDrone- VDT2018: The vision meets drone video detection and tracking chal- lenge results,” inComputer Vision – ECCV 2018 Workshops, 2019, pp. 496–518
work page 2018
-
[9]
High-speed tracking with kernelized correlation filters,
J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 583–596, 2015
work page 2015
-
[10]
Fully-convolutional siamese networks for object tracking,
L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr, “Fully-convolutional siamese networks for object tracking,” inEuropean Conference on Computer Vision (ECCV) Workshops, 2016
work page 2016
-
[11]
SiamRPN++: Evolution of siamese visual tracking with very deep networks,
B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “SiamRPN++: Evolution of siamese visual tracking with very deep networks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4282–4291
work page 2019
-
[12]
X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “TransT: Transformer tracking,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
work page 2021
-
[13]
STARK: Learning spatio- temporal transformer for visual tracking,
B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “STARK: Learning spatio- temporal transformer for visual tracking,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2021
work page 2021
-
[14]
Joint feature learning and relation modeling for tracking: A one-stream framework,
B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” inEuropean Conference on Computer Vision (ECCV), 2022
work page 2022
-
[15]
Detection and tracking meet drones challenge,
P. Zhu, L. Wen, X. Bian, H. Ling, and Q. Hu, “Detection and tracking meet drones challenge,”arXiv preprint arXiv:2001.06303, 2020
-
[16]
A UA V to UA V tracking benchmark,
Y . Wang, Z. Huang, R. Lagani `ere, H. Zhang, and L. Ding, “A UA V to UA V tracking benchmark,”Knowledge-Based Systems, vol. 261, p. 110197, 2023
work page 2023
-
[17]
Vision-based anti-UA V detection and tracking,
Y . Zhao, D. Wang, H. Lu, Y . Wang, X. Zhang, and X. Li, “Vision-based anti-UA V detection and tracking,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 12, pp. 23 639–23 652, 2022
work page 2022
-
[18]
Missingness- aware prompting for modality-missing rgbt tracking,
G. Hu, Z. Wang, C. Li, D. Yuan, B. He, and J. Tang, “Missingness- aware prompting for modality-missing rgbt tracking,”Journal of King Saud University Computer and Information Sciences, vol. 37, no. 6, pp. 1–17, 2025, art. no. 128
work page 2025
-
[19]
End-to-end active object tracking via reinforcement learning,
W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y . Wang, “End-to-end active object tracking via reinforcement learning,” inProceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80, 2018, pp. 3286–3295
work page 2018
-
[20]
UA V dynamic object tracking with lightweight deep vision reinforcement learning,
H. Nguyen, S. Thudumu, H. Du, K. Mouzakis, and R. Vasa, “UA V dynamic object tracking with lightweight deep vision reinforcement learning,”Algorithms, vol. 16, no. 5, p. 227, 2023
work page 2023
-
[21]
Deep reinforcement learning for UA V navigation through massive MIMO technique,
H. Huang, Y . Yang, H. Wang, Z. Ding, H. Sari, and F. Adachi, “Deep reinforcement learning for UA V navigation through massive MIMO technique,”arXiv preprint arXiv:1901.10832, 2019
-
[22]
Z. Feng, X. Na, S. Hai, Q. Sun, and J. Shi, “Deep reinforcement learning for UA V target search and continuous tracking in complex environments with gaussian process regression and prior policy embedding,”Electron- ics, vol. 14, no. 7, p. 1330, 2025
work page 2025
-
[23]
Habitat: A platform for embodied AI research,
M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A platform for embodied AI research,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 9339–9347
work page 2019
-
[24]
AI2-THOR: An Interactive 3D Environment for Visual AI
E. Kolve, R. Mottaghi, D. Gordon, Y . Zhu, A. Gupta, and A. Farhadi, “AI2-THOR: An interactive 3d environment for visual AI,”arXiv preprint arXiv:1712.05474, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Gibson Env: Real-world perception for embodied agents,
F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese, “Gibson Env: Real-world perception for embodied agents,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018
work page 2018
-
[27]
Air learning: A deep reinforcement learning gym for autonomous aerial robot visual navigation,
S. Krishnan, B. Boroujerdian, W. Fu, A. Faust, and V . J. Reddi, “Air learning: A deep reinforcement learning gym for autonomous aerial robot visual navigation,”Machine Learning, vol. 110, no. 9, pp. 2501– 2540, 2021
work page 2021
-
[28]
Receding horizon “next-best-view
A. Bircher, M. Kamel, K. Alexis, H. Oleynikova, and R. Siegwart, “Receding horizon “next-best-view” planner for 3d exploration,” inIEEE International Conference on Robotics and Automation (ICRA), 2016
work page 2016
-
[29]
A survey on coverage path planning for robotics,
E. Galceran and M. Carreras, “A survey on coverage path planning for robotics,”Robotics and Autonomous Systems, vol. 61, no. 12, pp. 1258– 1276, 2013
work page 2013
-
[30]
Unreal Engine 4.27 Documentation,
Epic Games, “Unreal Engine 4.27 Documentation,” https: //dev.epicgames.com/documentation/en-us/unreal-engine?application version=4.27, 2021, official documentation
work page 2021
-
[31]
Microsoft Research, “AirSim,” https://microsoft.github.io/AirSim/, 2021, official documentation
work page 2021
-
[32]
Airsim: High-fidelity visual and physical simulation for autonomous vehicles,
S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” inField and Service Robotics, 2018, pp. 621–635
work page 2018
-
[33]
Auto-encoding variational bayes,
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning Representations, 2014
work page 2014
-
[34]
Visual object tracking using adaptive correlation filters,
D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y . M. Lui, “Visual object tracking using adaptive correlation filters,” inIEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 2544–2550
work page 2010
-
[35]
V . R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” inAdvances in Neural Information Processing Systems, 2000
work page 2000
-
[36]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
Track- ingnet: A large-scale dataset and benchmark for object tracking in the wild,
M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem, “Track- ingnet: A large-scale dataset and benchmark for object tracking in the wild,” inEuropean Conference on Computer Vision (ECCV), 2018, pp. 300–317
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.