pith. sign in

arxiv: 2605.17451 · v1 · pith:RN2A2Y2Unew · submitted 2026-05-17 · 💻 cs.CV

DeTrack: A Benchmark and Altitude-Aware Dual World Model for Drone-embodied Tracking

Pith reviewed 2026-05-20 14:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords drone trackingaerial trackingworld modelsembodied trackingbenchmarkaltitude-aware perceptionclosed-loop controlDeTrack
0
0 comments X

The pith

Drone tracking improves when dual world models imagine future states at different altitudes to balance visibility and safety.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper defines a new drone-embodied tracking task called DeTrack that requires a drone to actively control its flight and use egocentric observations to follow a target in interactive 3D scenes. To enable evaluation, the authors release a benchmark of 11,368 trajectories spanning diverse environments, lighting, regions, and distractors, along with metrics for visibility, accuracy, and trajectory success. They introduce AaDWorlds, which pairs an altitude-aware perception module with two separate world models that predict future states under high-altitude and low-altitude conditions. Combining these imagined states with pseudo altitude-aware observations lets the system trade off between keeping the target in view and maintaining safe flight distances. Experiments on the benchmark show consistent gains in all closed-loop tracking metrics.

Core claim

The paper claims that an altitude-aware dual world model framework called AaDWorlds, built from an altitude-aware perception module and dual world models that imagine future states under both high- and low-altitude regimes, when combined with pseudo altitude-aware observations, alleviates the intrinsic altitude-mediated contradiction between target visibility and flight safety and thereby improves closed-loop tracking performance across all evaluation metrics on the DeTrack benchmark.

What carries the argument

Altitude-aware dual world models that generate imagined future states under high- and low-altitude regimes to resolve visibility-safety trade-offs.

Load-bearing premise

The dual world models can generate sufficiently accurate imagined future states under high- and low-altitude regimes that, when combined with pseudo altitude-aware observations, resolve the visibility-safety trade-off in closed-loop control.

What would settle it

Running the closed-loop controller on DeTrack scenes where the world models produce inaccurate future-state predictions for either altitude regime and measuring whether tracking metrics fail to improve or degrade relative to a single-model baseline.

Figures

Figures reproduced from arXiv: 2605.17451 by Chenglong Li, Feng Chen, Guyue Hu, Haoming Liu, Jin Tang, Siyuan Song.

Figure 1
Figure 1. Figure 1: Paradigm comparison between passive aerial tracking and drone [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview statistics and typical challenges of the proposed drone-embodied tracking (DeTrack) benchmark. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sample illustration of target trajectories in the DeTrack benchmark. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of scene-style and rendering-condition diversity in the De [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Geometry-level spatial statistics of the DeTrack trajectory library. (a) [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Basic statistics of target trajectory library in the DeTrack benchmark. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overview pipeline of the proposed AaDWorlds framework for drone-embodied tracking. The framework consists of a reinforcement drone-embodied [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Detailed structure of the proposed Dual World Models (DWM). DWM consists of a High-Altitude World Model and a Low-Altitude World Model, [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Detailed structure of the proposed Altitude-aware Perception (AaP) module. AaP takes the real egocentric observation together with prophet high [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative visualization of target visibility and flight safety under four representative scenes. (a) The overview of target trajectory. (b)-(f) The results [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
read the original abstract

Aerial object tracking has broad applications in public safety, emergency rescue, wildlife monitoring, and related fields. However, existing aerial tracking benchmarks are mainly based on passive 2D video sequences captured from fixed camera locations or predefined flight paths, where drones are treated as passive cameras rather than embodied agents that actively perceive, interact, and control their motion in dynamic 3D scenes. In this paper, we define a new drone-embodied tracking task, termed DeTrack, which requires a drone to track a target in interactive 3D environments using online egocentric observations and active flight control in a closed loop. We build a large-scale benchmark containing 11,368 target trajectories across diverse scenes, rendering conditions, semantic regions, and moving distractors, together with evaluation metrics for target visibility, tracking accuracy, and trajectory success. We further propose AaDWorlds, an altitude-aware dual world model framework for drone-embodied tracking. AaDWorlds consists of an altitude-aware perception module and dual world models that imagine future states under both high- and low-altitude regimes. By combining pseudo altitude-aware observations and imagined future states, AaDWorlds alleviates the intrinsic altitude-mediated contradiction between target visibility and flight safety. Experiments on the DeTrack benchmark demonstrate that AaDWorlds improves closed-loop tracking performance across all evaluation metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DeTrack, a new benchmark for drone-embodied tracking consisting of 11,368 target trajectories in interactive 3D environments with diverse scenes, rendering conditions, and distractors, along with metrics for visibility, accuracy, and success. It proposes AaDWorlds, an altitude-aware dual world model framework comprising an altitude-aware perception module and dual world models that generate imagined future states in high- and low-altitude regimes; these are combined with pseudo altitude-aware observations to resolve the visibility-safety trade-off in closed-loop control. Experiments on DeTrack are reported to show improvements across all evaluation metrics.

Significance. If the central results hold with proper validation, the work is significant for shifting aerial tracking research from passive 2D video benchmarks to embodied agent settings with active 3D control. The large-scale, diverse DeTrack benchmark fills a clear gap, and the altitude-aware dual world model approach offers a concrete mechanism for handling altitude-mediated contradictions in drone perception and planning. Credit is due for the scale of the benchmark and the explicit framing of the visibility-safety trade-off.

major comments (2)
  1. [Experiments] Experiments section: The claim that AaDWorlds improves closed-loop tracking performance across all metrics is presented only via aggregate results; the manuscript supplies no separate quantitative validation of dual world model fidelity (e.g., per-timestep prediction error, state reconstruction loss) nor an ablation that replaces imagined future states with ground-truth rollouts. Without these, it remains possible that observed gains derive entirely from the altitude-aware perception module and pseudo-observation mechanism, rendering the dual-world-model component non-load-bearing for the central claim.
  2. [Method] Method, §3.2 (Dual World Models): The description of how the high- and low-altitude world models are trained and how their imagined states are fused with pseudo observations lacks explicit training objectives, loss terms, or fidelity metrics. This detail is required to evaluate whether the models generate sufficiently accurate future states to resolve the visibility-safety trade-off under the regimes described.
minor comments (2)
  1. [Abstract] Abstract: While the abstract correctly summarizes the contributions, it would benefit from at least one concrete quantitative improvement (e.g., percentage gain on a primary metric) to allow readers to gauge effect size without reading the full results.
  2. [Results] Figure captions and tables: Several result tables and figures would be clearer if they explicitly listed the baselines used and whether statistical significance was computed across multiple runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, with clear indications of the revisions we will implement to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The claim that AaDWorlds improves closed-loop tracking performance across all metrics is presented only via aggregate results; the manuscript supplies no separate quantitative validation of dual world model fidelity (e.g., per-timestep prediction error, state reconstruction loss) nor an ablation that replaces imagined future states with ground-truth rollouts. Without these, it remains possible that observed gains derive entirely from the altitude-aware perception module and pseudo-observation mechanism, rendering the dual-world-model component non-load-bearing for the central claim.

    Authors: We acknowledge that the current experiments section emphasizes aggregate closed-loop metrics. To isolate the contribution of the dual world models, we will add quantitative fidelity evaluations, including per-timestep prediction error and state reconstruction loss for the high- and low-altitude models. We will also include an ablation that substitutes ground-truth rollouts for the imagined states while keeping the altitude-aware perception and pseudo-observation components fixed. These additions will be reported in a revised Experiments section with corresponding tables and analysis. revision: yes

  2. Referee: [Method] Method, §3.2 (Dual World Models): The description of how the high- and low-altitude world models are trained and how their imagined states are fused with pseudo observations lacks explicit training objectives, loss terms, or fidelity metrics. This detail is required to evaluate whether the models generate sufficiently accurate future states to resolve the visibility-safety trade-off under the regimes described.

    Authors: We agree that the training details in §3.2 require expansion for reproducibility and evaluation. In the revised manuscript we will explicitly list the training objectives and loss terms (reconstruction, prediction, and regularization losses) used for each world model. We will also report fidelity metrics such as multi-step prediction accuracy under high- and low-altitude regimes. The fusion procedure with pseudo altitude-aware observations will be described with additional equations and pseudocode to clarify how imagined states are combined at each control step. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on independent benchmark evaluation

full rationale

The paper introduces the new DeTrack task and benchmark with 11,368 trajectories and defines AaDWorlds as an altitude-aware perception module plus dual world models that generate imagined future states. The central claim is that combining pseudo altitude-aware observations with these imagined states improves closed-loop metrics (visibility, accuracy, success) on the benchmark. No equations, fitted parameters, self-citations, or ansatzes are described that would reduce any reported prediction or result to the inputs by construction; the improvements are presented as outcomes of the proposed architecture evaluated on held-out trajectories, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; ledger therefore limited to high-level assumptions stated or implied in the text.

axioms (1)
  • domain assumption Drones must actively perceive, interact, and control motion in dynamic 3D scenes using online egocentric observations.
    Central premise used to define the embodied tracking task and to contrast it with passive benchmarks.
invented entities (1)
  • Altitude-aware dual world models no independent evidence
    purpose: To imagine future states under high- and low-altitude regimes and thereby resolve the visibility-safety trade-off.
    New component introduced in AaDWorlds; no independent evidence outside the paper is described.

pith-pipeline@v0.9.0 · 5786 in / 1376 out tokens · 83488 ms · 2026-05-20T14:37:28.150164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 3 internal anchors

  1. [1]

    Deep learning for UA V- based object detection and tracking: A survey,

    X. Wu, W. Li, D. Hong, R. Tao, and Q. Du, “Deep learning for UA V- based object detection and tracking: A survey,”IEEE Geoscience and Remote Sensing Magazine, vol. 10, no. 4, pp. 91–124, 2022

  2. [2]

    Drone deep reinforcement learning: A review,

    A. T. Azar, A. Koubaa, N. Ali Mohamed, H. A. Ibrahim, Z. F. Ibrahim, M. Kazim, A. Ammar, B. Benjdira, A. M. Khamis, I. A. Hameed, and G. Casalino, “Drone deep reinforcement learning: A review,” Electronics, vol. 10, no. 9, p. 999, 2021

  3. [3]

    Reinforcement learning-based drone simulators: Survey, practice, and challenge,

    J. H. Chan, K. Liu, Y . Chen, A. S. M. S. Sagar, and Y . Kim, “Reinforcement learning-based drone simulators: Survey, practice, and challenge,”Artificial Intelligence Review, vol. 57, p. 281, 2024

  4. [4]

    Monkeytrail: A scalable video-based method for tracking macaque movement trajectory in daily living cages,

    M.-S. Liu, J.-Q. Gao, G.-Y . Hu, G.-F. Hao, T.-Z. Jiang, C. Zhang, and S. Yu, “Monkeytrail: A scalable video-based method for tracking macaque movement trajectory in daily living cages,”Zoological Re- search, vol. 43, no. 3, pp. 343–351, 2022

  5. [5]

    UA V123: A benchmark and simulator for UA V tracking,

    M. Mueller, N. Smith, and B. Ghanem, “UA V123: A benchmark and simulator for UA V tracking,” inEuropean Conference on Computer Vision (ECCV), 2016, pp. 445–461

  6. [6]

    The unmanned aerial vehicle benchmark: Object detection and tracking,

    D. Du, Y . Qi, H. Yu, Y . Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and Q. Tian, “The unmanned aerial vehicle benchmark: Object detection and tracking,” inEuropean Conference on Computer Vision (ECCV), 2018, pp. 370–386

  7. [7]

    Vision Meets Drones: A Challenge

    P. Zhu, L. Wen, X. Bian, H. Ling, and Q. Hu, “Vision meets drones: A challenge,”arXiv preprint arXiv:1804.07437, 2018

  8. [8]

    VisDrone- VDT2018: The vision meets drone video detection and tracking chal- lenge results,

    P. Zhu, L. Wen, D. Du, X. Bian, H. Ling, Q. Huet al., “VisDrone- VDT2018: The vision meets drone video detection and tracking chal- lenge results,” inComputer Vision – ECCV 2018 Workshops, 2019, pp. 496–518

  9. [9]

    High-speed tracking with kernelized correlation filters,

    J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 583–596, 2015

  10. [10]

    Fully-convolutional siamese networks for object tracking,

    L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr, “Fully-convolutional siamese networks for object tracking,” inEuropean Conference on Computer Vision (ECCV) Workshops, 2016

  11. [11]

    SiamRPN++: Evolution of siamese visual tracking with very deep networks,

    B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “SiamRPN++: Evolution of siamese visual tracking with very deep networks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4282–4291

  12. [12]

    TransT: Transformer tracking,

    X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “TransT: Transformer tracking,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  13. [13]

    STARK: Learning spatio- temporal transformer for visual tracking,

    B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “STARK: Learning spatio- temporal transformer for visual tracking,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2021

  14. [14]

    Joint feature learning and relation modeling for tracking: A one-stream framework,

    B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” inEuropean Conference on Computer Vision (ECCV), 2022

  15. [15]

    Detection and tracking meet drones challenge,

    P. Zhu, L. Wen, X. Bian, H. Ling, and Q. Hu, “Detection and tracking meet drones challenge,”arXiv preprint arXiv:2001.06303, 2020

  16. [16]

    A UA V to UA V tracking benchmark,

    Y . Wang, Z. Huang, R. Lagani `ere, H. Zhang, and L. Ding, “A UA V to UA V tracking benchmark,”Knowledge-Based Systems, vol. 261, p. 110197, 2023

  17. [17]

    Vision-based anti-UA V detection and tracking,

    Y . Zhao, D. Wang, H. Lu, Y . Wang, X. Zhang, and X. Li, “Vision-based anti-UA V detection and tracking,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 12, pp. 23 639–23 652, 2022

  18. [18]

    Missingness- aware prompting for modality-missing rgbt tracking,

    G. Hu, Z. Wang, C. Li, D. Yuan, B. He, and J. Tang, “Missingness- aware prompting for modality-missing rgbt tracking,”Journal of King Saud University Computer and Information Sciences, vol. 37, no. 6, pp. 1–17, 2025, art. no. 128

  19. [19]

    End-to-end active object tracking via reinforcement learning,

    W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y . Wang, “End-to-end active object tracking via reinforcement learning,” inProceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80, 2018, pp. 3286–3295

  20. [20]

    UA V dynamic object tracking with lightweight deep vision reinforcement learning,

    H. Nguyen, S. Thudumu, H. Du, K. Mouzakis, and R. Vasa, “UA V dynamic object tracking with lightweight deep vision reinforcement learning,”Algorithms, vol. 16, no. 5, p. 227, 2023

  21. [21]

    Deep reinforcement learning for UA V navigation through massive MIMO technique,

    H. Huang, Y . Yang, H. Wang, Z. Ding, H. Sari, and F. Adachi, “Deep reinforcement learning for UA V navigation through massive MIMO technique,”arXiv preprint arXiv:1901.10832, 2019

  22. [22]

    Deep reinforcement learning for UA V target search and continuous tracking in complex environments with gaussian process regression and prior policy embedding,

    Z. Feng, X. Na, S. Hai, Q. Sun, and J. Shi, “Deep reinforcement learning for UA V target search and continuous tracking in complex environments with gaussian process regression and prior policy embedding,”Electron- ics, vol. 14, no. 7, p. 1330, 2025

  23. [23]

    Habitat: A platform for embodied AI research,

    M. Savva, A. Kadian, O. Maksymets, Y . Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V . Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A platform for embodied AI research,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 9339–9347

  24. [24]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    E. Kolve, R. Mottaghi, D. Gordon, Y . Zhu, A. Gupta, and A. Farhadi, “AI2-THOR: An interactive 3d environment for visual AI,”arXiv preprint arXiv:1712.05474, 2017

  25. [25]

    Gibson Env: Real-world perception for embodied agents,

    F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese, “Gibson Env: Real-world perception for embodied agents,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  26. [27]

    Air learning: A deep reinforcement learning gym for autonomous aerial robot visual navigation,

    S. Krishnan, B. Boroujerdian, W. Fu, A. Faust, and V . J. Reddi, “Air learning: A deep reinforcement learning gym for autonomous aerial robot visual navigation,”Machine Learning, vol. 110, no. 9, pp. 2501– 2540, 2021

  27. [28]

    Receding horizon “next-best-view

    A. Bircher, M. Kamel, K. Alexis, H. Oleynikova, and R. Siegwart, “Receding horizon “next-best-view” planner for 3d exploration,” inIEEE International Conference on Robotics and Automation (ICRA), 2016

  28. [29]

    A survey on coverage path planning for robotics,

    E. Galceran and M. Carreras, “A survey on coverage path planning for robotics,”Robotics and Autonomous Systems, vol. 61, no. 12, pp. 1258– 1276, 2013

  29. [30]

    Unreal Engine 4.27 Documentation,

    Epic Games, “Unreal Engine 4.27 Documentation,” https: //dev.epicgames.com/documentation/en-us/unreal-engine?application version=4.27, 2021, official documentation

  30. [31]

    Microsoft Research, “AirSim,” https://microsoft.github.io/AirSim/, 2021, official documentation

  31. [32]

    Airsim: High-fidelity visual and physical simulation for autonomous vehicles,

    S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” inField and Service Robotics, 2018, pp. 621–635

  32. [33]

    Auto-encoding variational bayes,

    D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning Representations, 2014

  33. [34]

    Visual object tracking using adaptive correlation filters,

    D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y . M. Lui, “Visual object tracking using adaptive correlation filters,” inIEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 2544–2550

  34. [35]

    Actor-critic algorithms,

    V . R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” inAdvances in Neural Information Processing Systems, 2000

  35. [36]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  36. [37]

    Track- ingnet: A large-scale dataset and benchmark for object tracking in the wild,

    M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem, “Track- ingnet: A large-scale dataset and benchmark for object tracking in the wild,” inEuropean Conference on Computer Vision (ECCV), 2018, pp. 300–317