4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking

Boxue Yang; Chaoyue Li; Haoyang Wu; Linfeng Zhang; Rui Qian; Shengyao Zhou

arxiv: 2606.22631 · v1 · pith:63SWGGR6new · submitted 2026-06-21 · 💻 cs.CV

4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking

Chaoyue Li , Boxue Yang , Shengyao Zhou , Haoyang Wu , Rui Qian , Linfeng Zhang This is my paper

Pith reviewed 2026-06-26 10:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords 4D dynamic scene understandingvision-language trackingworldline-centered modelinginstruction-conditioned trackingmulti-view videoobject-centric state graphtarget grounding accuracy

0 comments

The pith

Worldline-centered modeling improves target grounding accuracy by 19.62 points on Instruct-4D

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the 4DVLT task to ground language instructions to persistent worldlines in 4D multi-view video, where a worldline binds object identity, metric 3D motion, and 2D projections over time. It creates the Instruct-4D benchmark with 129.4K question-answer pairs across 851 scenes and nine query types to test this capability. The 4DTrack method models the problem as graph-conditioned worldline inference using an object-centric 4D state graph with metric-guided routing and bidirectional decoding. On the benchmark, 4DTrack with a 9B model achieves 62.68 top-1 accuracy and exceeds the best adapted baseline by 19.62 points, indicating that maintaining consistent worldlines aids both grounding and motion recovery.

Core claim

By centering vision-language tracking on worldlines that persist across time in fully observed multi-view video, the approach allows instruction-conditioned 4D dynamic scene understanding that preserves metric topology and identity continuity, unlike prior methods limited to fragmented 2D or 3D outputs. The 4DTrack implementation demonstrates this by reaching 62.68 TGA Top1 and outperforming baselines by 19.62 points on the Instruct-4D benchmark.

What carries the argument

Graph-conditioned worldline inference through an object-centric 4D state graph, metric-guided routing, bidirectional decoding, and kinematic calibration.

If this is right

Improved target grounding accuracy for language instructions in dynamic 4D scenes.
Better quality of recovered worldlines that align with actual 3D motion.
Effective handling of reasoning-oriented queries involving temporal and spatial relations.
Applicability to both synthetic and captured scenes in the benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar worldline structures could help in extending 4D understanding to settings with fewer camera views.
Combining this with larger multimodal models may enhance reasoning while keeping metric accuracy.
Worldline inference might support downstream tasks like 4D scene editing or future prediction.

Load-bearing premise

The Instruct-4D benchmark provides a faithful test of instruction-conditioned 4D dynamic scene understanding that generalizes beyond its 851 scenes.

What would settle it

Evaluating the method on additional real-world multi-view video datasets with language queries outside the current benchmark distribution.

Figures

Figures reproduced from arXiv: 2606.22631 by Boxue Yang, Chaoyue Li, Haoyang Wu, Linfeng Zhang, Rui Qian, Shengyao Zhou.

**Figure 1.** Figure 1: From existing paradigms to 4DVLT. LMMs lack metric tracking and conventional VLT is limited to a single 2D or 3D view; 4DVLT instead recovers a calibrated multi-view worldline from video and an instruction. The right panel summarizes 4DTrack, while the lower panels show Instruct-4D and benchmark-level performance. This gap is clearest in queries such as Disambiguation, Reverse Reasoning, Trajectory Shape, … view at source ↗

**Figure 2.** Figure 2: Instruct-4D at a glance. (a) Nine instruction types; (b) geometry processing, structured-clue extraction, and verified instruction generation from nuScenes and WildTrack; (c) EgoWL/AlloWL statistics (129.4K instructions and 64.7K trajectories in total); and (d) grounding-oriented (TGATop1, TGA) and worldline-oriented (WQS, CTQ) metrics. In summary, our contributions are four-fold: • We introduce 4DVLT, a w… view at source ↗

**Figure 3.** Figure 3: Overview of 4DTrack. State, query, and geometry tokens form a 4D graph that is contracted by metric-guided routing, decoded bidirectionally with kinematic-prior beam search, and aligned into a unified 3D trajectory and synchronized multi-view 2D boxes. Here ϵ is a small constant for numerical stability. Accordingly, TGA and TGATop1 indicate whether the referred entity is found at the sequence and first-tim… view at source ↗

**Figure 4.** Figure 4: Per-query effects of 4DTrack, macro-averaged across EgoWL and AlloWL. The top row reports TGATop1 (higher is better) and the bottom row ADE3D (lower is better); the two columns separate the query families. Blue/orange bars denote 4DTrack and gray bars the matched backbone without it. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: 3D Volume Geometry (AlloWL). The query selects the pedestrian ranked 14th by 3D bounding-box volume at t=44.0 s. Columns show sampled timestamps; rows show 3D boxes in Camera 6 and synchronized 2D boxes in Cameras 1 and 7. Green dashed boxes denote ground truth, red boxes denote 4DTrack, and the remaining colors follow the embedded legend. D Scope of the Final Quantitative Presentation The final paper keep… view at source ↗

**Figure 6.** Figure 6: Absolute 3D Position (AlloWL). The query grounds a pedestrian by a metric offset from the world origin at t=2.5 s. Columns show sampled timestamps; rows show 3D boxes in Camera 5 and synchronized 2D boxes in Cameras 3, 7, and 1. Box colors follow the embedded legend. Query “ At t=119.0s, two pedestrians are only 2.39m apart. One is a pedestrian in a black jacket blue jeans at (8.25, 13.43, 0.91) and the ot… view at source ↗

**Figure 7.** Figure 7: Disambiguation (AlloWL). The query distinguishes two similarly dressed pedestrians only 2.39 m apart using clothing and metric-position cues. Columns show sampled timestamps; rows show 3D boxes in Camera 2 and synchronized 2D boxes in Cameras 1, 6, and 3. Box colors follow the embedded legend. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Kinematic Shift (EgoWL). The query identifies a pedestrian through a sudden acceleration near t=11.4 s. Columns follow the target across changing ego-camera views, with 3D boxes above and 2D boxes below. Box colors follow the embedded legend. Query “ At t=142.0s, track the pedestrian in a red jacket blue jeans that is the 4th closest to camera 5 and output their complete trajectory. ” Type: Relative 3D Pro… view at source ↗

**Figure 9.** Figure 9: Relative 3D Proximity (AlloWL). The query selects the pedestrian ranked fourth closest to Camera 5 at t=142.0 s. Columns show sampled timestamps; rows show 3D boxes in Camera 2 and synchronized 2D boxes in Cameras 5, 1, and 3. Box colors follow the embedded legend. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Reverse Reasoning (EgoWL). The query anchors a car at its state at t=16.0 s and asks for the preceding 16-second trajectory. Columns trace the target backward through changing ego-camera views, with 3D boxes above and 2D boxes below. Box colors follow the embedded legend. Query “ Track the pedestrian in a black jacket black pants who entered the scene from the right at t=190.5s and was located at (4.80, 1… view at source ↗

**Figure 11.** Figure 11: Spatiotemporal Anchor (AlloWL). The query links a right-side entry event at t=190.5 s to a later metric position at t=195.5 s. Columns show sampled timestamps; rows show 3D boxes in Camera 2 and synchronized 2D boxes in Cameras 1, 3, and 6. Box colors follow the embedded legend. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Motion Residual (AlloWL). The query identifies a pedestrian from its residual displacement over t=171.0–179.5 s. Columns show sampled timestamps; rows show 3D boxes in Camera 1 and synchronized 2D boxes in Cameras 2, 6, and 3. Box colors follow the embedded legend. Query “ Track the car who moved overall northward from t=0.0s to t=6.5s and output their complete trajectory. ” Type: Trajectory Shape 3 D 2 … view at source ↗

**Figure 13.** Figure 13: Trajectory Shape (EgoWL). The query selects the car whose trajectory moves overall northward during t=0.0–6.5 s. Columns follow the target across changing ego-camera views, with 3D boxes above and 2D boxes below. Box colors follow the embedded legend. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

read the original abstract

4D dynamic scene understanding requires grounding language to a persistent worldline that binds identity, metric 3D motion, and synchronized multi-view 2D projections. Existing paradigms capture only part of this structure: large multimodal models reason over rich visual evidence but rarely preserve metric topology, while vision-language tracking remains tied to fragmented 2D or 3D outputs and local continuation. We therefore introduce \textbf{4DVLT}, a worldline-centered task for instruction-conditioned 4D dynamic scene understanding in fully observed multi-view video, and \textbf{Instruct-4D}, a benchmark with 129.4K question-answer pairs, 64.7K target entities, 851 scenes, and 9 reasoning-oriented query types. To address this setting, we present \textbf{4DTrack}, which casts instruction-conditioned tracking as graph-conditioned worldline inference through an object-centric 4D state graph, metric-guided routing, bidirectional decoding, and kinematic calibration. On Instruct-4D, 4DTrack-Qwen3.5-9B reaches 62.68 $\mathrm{TGA}_{\mathrm{Top1}}$ and surpasses the best adapted VLT baseline by 19.62 points. These results show that worldline-centered modeling improves both target grounding and recovered worldline quality. The project page is available at https://github.com/mikubaka88/4DVLT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new 4DVLT task and Instruct-4D benchmark for language-to-4D-worldline grounding, shows a 19-point gain on its own data with 4DTrack, but leaves benchmark representativeness and baseline details untested.

read the letter

The main thing here is a new task that ties language instructions to persistent metric 4D object paths across multi-view video, plus a benchmark with 129k QA pairs and nine query types, and a model called 4DTrack that uses an object-centric graph, metric routing, bidirectional decoding, and kinematic calibration. It reports 62.68 TGA_Top1, beating the best adapted baseline by 19.62 points on Instruct-4D.

What stands out is the explicit framing around worldlines that keep identity, 3D motion, and 2D projections together. The architecture choices look like a reasonable way to enforce that structure instead of treating tracking as local 2D or 3D continuation. The scale of the benchmark (851 scenes, 64k entities) is concrete and the query types target reasoning rather than simple localization.

The soft spot is that everything rests on this single new benchmark. There are no cross-dataset results, no error bars, no mention of statistical significance, and no breakdown of how the VLT baselines were adapted to the 4D setting. If the 851 scenes miss long occlusions, scale changes, or instruction distributions that appear in real robotics data, the margin could shrink. The abstract does not show external validation, so the claim that worldline-centered modeling improves recovered paths stays tied to Instruct-4D artifacts.

This is for people already working on vision-language tracking or 4D scene models who want a structured benchmark to test against. A reader interested in downstream robotics or persistent object tracking would get value from the task definition and the reported numbers, even if they later need to run their own checks.

It deserves a serious referee. The task and benchmark are fresh enough that external review on representativeness and baseline fairness would help, and the model components are specific enough to evaluate. I would send it out rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the 4DVLT task for instruction-conditioned 4D dynamic scene understanding that grounds language to persistent worldlines binding identity, metric 3D motion, and multi-view projections. It presents the Instruct-4D benchmark (129.4K QA pairs, 64.7K entities, 851 scenes, 9 query types) and the 4DTrack method, which performs graph-conditioned worldline inference via an object-centric 4D state graph, metric-guided routing, bidirectional decoding, and kinematic calibration. On Instruct-4D, 4DTrack-Qwen3.5-9B achieves 62.68 TGA_Top1 and exceeds the best adapted VLT baseline by 19.62 points, with the central claim that worldline-centered modeling improves target grounding and recovered worldline quality.

Significance. If the evaluation details are clarified, the work offers a new task formulation and benchmark that explicitly couples language instructions with metric 4D worldlines, addressing a gap between multimodal reasoning models and fragmented tracking outputs. The reported numerical gain and the open project page (https://github.com/mikubaka88/4DVLT) provide a concrete starting point for reproducible research on persistent 4D scene representations. The contribution is primarily empirical and benchmark-driven rather than theoretical.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: The central claim that worldline-centered modeling yields a 19.62-point TGA_Top1 gain rests on comparison to 'best adapted VLT baseline,' yet no description is given of the adaptation procedure (e.g., how 2D/3D VLT methods are extended to multi-view 4D queries or the 9 reasoning types). This detail is load-bearing for attributing the margin to the proposed 4D state graph rather than implementation differences.
[Experiments] Experiments section: The reported 62.68 TGA_Top1 and 19.62-point improvement are presented without error bars, standard deviations across runs, or statistical significance tests. Given that the benchmark is newly introduced and the claim concerns improved worldline quality, these statistics are required to establish that the observed margin is robust.
[Benchmark / Evaluation] Benchmark and Evaluation sections: No cross-dataset results or external validation on existing 4D or tracking benchmarks are reported. The claim that Instruct-4D constitutes a faithful test of instruction-conditioned 4D understanding therefore depends entirely on the internal diversity of the 851 scenes and 129.4K QA pairs, which is not externally corroborated.

minor comments (2)

[Abstract] The acronym TGA_Top1 is used in the abstract without an explicit definition; a brief expansion or reference to its definition in the main text would improve readability.
Figure and table captions could more explicitly link visual results to the nine query types to help readers connect qualitative examples to the quantitative claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful comments and the recommendation for major revision. We address each of the major comments below, providing clarifications and indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: The central claim that worldline-centered modeling yields a 19.62-point TGA_Top1 gain rests on comparison to 'best adapted VLT baseline,' yet no description is given of the adaptation procedure (e.g., how 2D/3D VLT methods are extended to multi-view 4D queries or the 9 reasoning types). This detail is load-bearing for attributing the margin to the proposed 4D state graph rather than implementation differences.

Authors: We agree that more detail on the baseline adaptation is necessary to support the central claim. The original manuscript provided a high-level overview, but we will revise the Experiments section to include a comprehensive description of the adaptation procedure for the VLT baselines, specifying the extensions to multi-view 4D queries and the 9 reasoning types. This revision will help attribute the performance improvements to the worldline-centered modeling. revision: yes
Referee: [Experiments] Experiments section: The reported 62.68 TGA_Top1 and 19.62-point improvement are presented without error bars, standard deviations across runs, or statistical significance tests. Given that the benchmark is newly introduced and the claim concerns improved worldline quality, these statistics are required to establish that the observed margin is robust.

Authors: We recognize the value of error bars and statistical tests for establishing robustness, particularly for a new benchmark. Due to the significant computational resources required for training and inference on the large Instruct-4D benchmark, we conducted single-run experiments. In the revised manuscript, we will add error bars where possible by reporting results from multiple random seeds for the inference components and include a discussion of the observed margin's robustness. We will also note this as a limitation. revision: partial
Referee: [Benchmark / Evaluation] Benchmark and Evaluation sections: No cross-dataset results or external validation on existing 4D or tracking benchmarks are reported. The claim that Instruct-4D constitutes a faithful test of instruction-conditioned 4D understanding therefore depends entirely on the internal diversity of the 851 scenes and 129.4K QA pairs, which is not externally corroborated.

Authors: We appreciate the suggestion for external validation. However, to the best of our knowledge, there are no existing benchmarks that provide instruction-conditioned queries aligned with persistent 4D worldlines in multi-view settings. Creating such alignments for external datasets would require substantial new annotation efforts outside the scope of this paper. We will revise the Benchmark section to more explicitly justify the design of Instruct-4D based on its scale and diversity (851 scenes, 64.7K entities, 9 query types) and release the full benchmark to facilitate future external validations by the community. revision: no

Circularity Check

0 steps flagged

No circularity; empirical results on newly introduced benchmark are independent of internal definitions.

full rationale

The manuscript introduces the 4DVLT task, Instruct-4D benchmark (129.4K QA pairs, 851 scenes, 9 query types), and 4DTrack method, then reports empirical TGA_Top1 scores and a 19.62-point gain over baselines. No equations, parameter-fitting steps, or self-citation chains are described that would reduce the performance claims to the inputs by construction. The central claim rests on direct benchmark evaluation rather than any self-definitional or fitted-input reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that worldlines are the appropriate binding structure for identity, metric 3D motion, and multi-view projections, plus the modeling choice that graph-conditioned inference with the listed components is sufficient; no free parameters or invented entities with independent evidence are detailed in the abstract.

axioms (1)

domain assumption 4D dynamic scene understanding requires grounding language to a persistent worldline that binds identity, metric 3D motion, and synchronized multi-view 2D projections.
Stated as the foundational requirement in the first sentence of the abstract.

invented entities (2)

worldline no independent evidence
purpose: Binds identity, metric 3D motion, and synchronized multi-view 2D projections for persistent tracking.
Introduced as the central organizing concept for the new task.
4D state graph no independent evidence
purpose: Enables graph-conditioned worldline inference in the 4DTrack method.
Part of the proposed model architecture.

pith-pipeline@v0.9.1-grok · 5800 in / 1695 out tokens · 30991 ms · 2026-06-26T10:49:27.852570+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 34 canonical work pages

[1]

Gpt-4 technical report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023

Pith/arXiv arXiv 2023
[2]

Flamingo: A visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeﬀ Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, ...

work page doi:10.52202/068431- 2022
[3]

Henriques, Andrea Vedaldi, and Philip H

Luca Bertinetto, Jack Valmadre, João F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. Fully- convolutional siamese networks for object tracking. In Gang Hua and Hervé Jégou, editors, Computer Vision ECCV 2016 Workshops , volume 9914, pages 850–865. Springer International Publishing, 2016. ISBN 978-3-319-48880-6 978-3-319-48881-3. doi: 10.1007/978-3-31...

work page doi:10.1007/978-3-319-48881-3_56 2016
[4]

In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11618–11628. IEEE, 2020. ISBN 978-1-7281-7168-5. doi: 10.1109/CVPR426...

work page doi:10.1109/cvpr42600.2020 2020
[5]

WILDTRACK: A multi-camera HD dataset for dense unscripted pedestrian detection

Tatjana Chavdarova, Pierre Baque, Stephane Bouquet, Andrii Maksai, Cijo Jose, Timur Bagautdinov, Louis Lettry, Pascal Fua, Luc Van Gool, and Francois Fleuret. WILDTRACK: A multi-camera HD dataset for dense unscripted pedestrian detection. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5030–5039. IEEE, 2018. ISBN 978-1-5386-6...

arXiv 2018
[6]

Chang, and Matthias NieSSner

Dave Zhenyu Chen, Angel X. Chang, and Matthias NieSSner. ScanRefer: 3D object localization in RGB-D scans using natural language. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan- Michael Frahm, editors, Computer Vision ECCV 2020 , volume 12365, pages 202–221. Springer Inter- national Publishing, 2020. ISBN 978-3-030-58564-8 978-3-030-58565-5. doi: ...

work page doi:10.1007/978-3-030-58565- 2020
[7]

Transformer tracking

Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 8122–

2021
[8]

Contrastive learning for compact single image dehazing,

IEEE, 2021. ISBN 978-1-6654-4509-2. doi: 10.1109/CVPR46437.2021.00803

work page doi:10.1109/cvpr46437.2021.00803 2021
[9]

Self-Supervised Learning from Images with a Joint- Embedding Predictive Architecture

Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. SeqTrack: Sequence to se- quence learning for visual object tracking. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 14572–14581. IEEE, 2023. ISBN 979-8-3503-0129-8. doi: 10.1109/CVPR52729.2023.01400

work page doi:10.1109/cvpr52729.2023.01400 2023
[10]

Qlora: Eﬃcient ﬁnetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Eﬃcient ﬁnetuning of quantized llms. Advances in neural information processing systems , 36:10088–10115, 2023

2023
[11]

Contrastive learning for compact single image dehazing,

Qi Feng, Vitaly Ablavsky, Qinxun Bai, and Stan Sclaroﬀ. Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 5847–5856. IEEE, 2021. ISBN 978-1-6654-4509-2. doi: 10.1109/CVPR46437.2021.00579. 14

work page doi:10.1109/cvpr46437.2021.00579 2021
[12]

MemVLT: Vision-language tracking with adaptive memory-based prompts

Xiaokun Feng, Xuchen Li, Shiyu Hu, Dailing Zhang, Meiqi Wu, Jing Zhang, Xiaotang Chen, and Kaiqi Huang. MemVLT: Vision-language tracking with adaptive memory-based prompts. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024) , 2024. doi: 10.52202/079017-0476

work page doi:10.52202/079017-0476 2024
[13]

The llama 3 herd of models

Aaron Grattaﬁori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024

Pith/arXiv arXiv 2024
[14]

Kuckreja, M

Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, and Libo Zhang. Context-guided spatio-temporal video grounding. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1–13, 2024. doi: 10.1109/CVPR52733.2024.01735

work page doi:10.1109/cvpr52733.2024.01735 2024
[15]

Divert more attention to vision-language object tracking

Mingzhe Guo, Zhipeng Zhang, Liping Jing, Haibin Ling, and Heng Fan. Divert more attention to vision-language object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence , 46 (12):8600–8618, 2024. ISSN 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI.2024.3409078

work page doi:10.1109/tpami.2024.3409078 2024
[16]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022

2022
[17]

GOT-10k: A large high-diversity benchmark for generic object tracking in the wild

Lianghua Huang, Xin Zhao, and Kaiqi Huang. GOT-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence , 43(5): 1562–1577, 2021. ISSN 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI.2019.2957464

work page doi:10.1109/tpami.2019.2957464 2021
[18]

Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world

Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yunlong Lin, et al. Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world. arXiv preprint arXiv:2603.12746 , 2026

arXiv 2026
[19]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Guillaume Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825 , 2023

Pith/arXiv arXiv 2023
[20]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. Proceedings of the International Conference on Machine Learning , pages 1–13, 2023

2023
[21]

Time3d: End-to-end joint monocular 3d object detection and tracking for autonomous driving

Peixuan Li and Jieyu Jin. Time3d: End-to-end joint monocular 3d object detection and tracking for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3885–3894, 2022

2022
[22]

In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, and Shuxiang Song. Dynamic updates for language adaptation in visual-language tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1–10, 2025. doi: 10.1109/CVPR52734.2025.01785

work page doi:10.1109/cvpr52734.2025.01785 2025
[23]

In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp

Xuchen Li, Xiaokun Feng, Shiyu Hu, Meiqi Wu, Dailing Zhang, Jing Zhang, and Kaiqi Huang. DTLLM-VLT: Diverse text generation for visual language tracking based on LLM. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR W) , pages 7283–7292. IEEE, 2024. ISBN 979-8-3503-6547-4. doi: 10.1109/CVPR W63382.2024.00724

work page doi:10.1109/cvpr 2024
[24]

Tracking by natural language speciﬁcation

Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G M Snoek, and Arnold W M Smeulders. Tracking by natural language speciﬁcation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2017. doi: 10.1109/CVPR.2017.777

work page doi:10.1109/cvpr.2017.777 2017
[25]

In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xinqi Liu, Li Zhou, Zikun Zhou, Jianqiu Chen, and Zhenyu He. MambaVLT: Time-evolving multi- modal state space model for vision-language tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1–15, 2025. doi: 10.1109/CVPR52734.2025.00816. 15

work page doi:10.1109/cvpr52734.2025.00816 2025
[26]

Unifying visual and vision-language tracking via contrastive learning

Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, and Mengxue Kang. Unifying visual and vision-language tracking via contrastive learning. Proceedings of the AAAI Conference on Artiﬁcial Intelligence , 38(5):4107–4116, 2024. ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v38i5.28205

work page doi:10.1609/aaai.v38i5.28205 2024
[27]

TrackingNet: A large-scale dataset and benchmark for object tracking in the wild

Matthias Müller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. TrackingNet: A large-scale dataset and benchmark for object tracking in the wild. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision ECCV 2018 , volume 11205, pages 310–327. Springer International Publishing, 2018. ISBN 978-...

work page doi:10.1007/978-3-030-01246-5_19 2018
[28]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/ blog?id=qwen3.5

2026
[29]

Kuckreja, M

Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, and Jiming Chen. Context-aware integration of language and visual references for natural language tracking. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 19208–19217. IEEE, 2024. ISBN 979-8-3503-5300-6. doi: 10.1109/CVPR52733.2024.01817

work page doi:10.1109/cvpr52733.2024.01817 2024
[30]

Qwen2.5-vl, January 2025

Qwen Team. Qwen2.5-vl, January 2025. URL https://qwenlm.github.io/blog/qwen2.5-vl/

2025
[31]

Vptracker: Global vision-language tracking via visual prompt and mllm

Jingchao Wang, Kaiwen Zhou, Zhijian Wu, Kunhua Ji, Dingjiang Huang, and Yefeng Zheng. Vptracker: Global vision-language tracking via visual prompt and mllm. arXiv preprint arXiv:2512.22799 , 2025

Pith/arXiv arXiv 2025
[32]

Describe and attend to track: Learning natural language guided structural representation and visual attention for object tracking

Xiao Wang, Chenglong Li, Rui Yang, Tianzhu Zhang, Jin Tang, and Bin Luo. Describe and attend to track: Learning natural language guided structural representation and visual attention for object tracking. arXiv preprint arXiv:1811.10014 , 2018

Pith/arXiv arXiv 2018
[33]

Towards more ﬂexible and accurate object tracking with natural language: Algorithms and benchmark

Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, and Feng Wu. Towards more ﬂexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 13763– 13773, 2021

2021
[34]

In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Hongkai Wei, Yang Yang, Shijie Sun, Mingtao Feng, Xiangyu Song, Qi Lei, Hongli Hu, Rong Wang, Huansheng Song, Naveed Akhtar, and Ajmal Saeed Mian. Mono3DVLT: Monocular-video-based 3D visual language tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1–11, 2025. doi: 10.1109/CVPR52734.2025.01296

work page doi:10.1109/cvpr52734.2025.01296 2025
[35]

Mvggt: Multimodal visual geometry grounded transformer for multiview 3d referring expression segmentation

Changli Wu, Haodong Wang, Jiayi Ji, Yutian Yao, Chunsai Du, Jihua Kang, Yanwei Fu, and Liujuan Cao. Mvggt: Multimodal visual geometry grounded transformer for multiview 3d referring expression segmentation. arXiv preprint arXiv:2601.06874 , 2026

arXiv 2026
[36]

URL https://doi.org/10.1109/ ICCV51070.2023.00008

Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, and Jianbing Shen. OnlineRefer: A simple online baseline for referring video object segmentation. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , pages 2749–2758. IEEE, 2023. ISBN 979-8-3503-0718-4. doi: 10.1109/ICCV51070.2023.00259

work page doi:10.1109/iccv51070.2023.00259 2023
[37]

Towards Fewer Annotations: Active Learning via Region Impurity and Prediction Uncertainty for Domain Adaptive Semantic Segmentation , year=

Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4964–4974. IEEE, 2022. ISBN 978-1-6654-6946-3. doi: 10.1109/CVPR52688.2022. 00492

work page doi:10.1109/cvpr52688.2022 2022
[38]

Joint feature learning and relation modeling for tracking: A one-stream framework

Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. In Shai A vidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision ECCV 2022 , volume 16 13682, pages 341–357. Springer Nature Switzerland, 2022. ISBN 978-3...

work page doi:10.1007/978-3-031-20047-2_20 2022
[39]

Mllm-4d: To- wards visual-based spatial-temporal intelligence

Xingyilang Yin, Chengzhengxu Li, Jiahao Chang, Chi-Man Pun, and Xiaodong Cun. Mllm-4d: To- wards visual-based spatial-temporal intelligence. arXiv preprint arXiv:2603.00515 , 2026

arXiv 2026
[40]

All in one: Exploring uniﬁed vision-language tracking with multi-modal alignment

Chunhui Zhang, Xin Sun, Yiqian Yang, Li Liu, Qiong Liu, Xi Zhou, and Yanfeng Wang. All in one: Exploring uniﬁed vision-language tracking with multi-modal alignment. In Proceedings of the 31st ACM International Conference on Multimedia , pages 5552–5561. ACM, 2023. ISBN 979-8-4007-0108-

2023
[41]

doi: 10.1145/3581783.3611803

work page doi:10.1145/3581783.3611803
[42]

One-stream stepwise decreasing for vision-language tracking

Guangtong Zhang, Bineng Zhong, Qihua Liang, Zhiyi Mo, Ning Li, and Shuxiang Song. One-stream stepwise decreasing for vision-language tracking. IEEE Transactions on Circuits and Systems for Video Technology, 34(10):9053–9063, 2024. ISSN 1051-8215, 1558-2205. doi: 10.1109/TCSVT.2024.3395352

work page doi:10.1109/tcsvt.2024.3395352 2024
[43]

Aware distillation for robust vision-language tracking under linguistic sparsity

Guangtong Zhang, Bineng Zhong, Shirui Yang, Yang Wang, and Tian Bai. Aware distillation for robust vision-language tracking under linguistic sparsity. Proceedings of the AAAI Conference on Artiﬁcial Intelligence , pages 1–9, 2026. doi: 10.1609/aaai.v40i15.38237

work page doi:10.1609/aaai.v40i15.38237 2026
[44]

One-stream vision-language memory network for object tracking

Huanlong Zhang, Jingchao Wang, Jianwei Zhang, Tianzhu Zhang, and Bineng Zhong. One-stream vision-language memory network for object tracking. IEEE Transactions on Multimedia, 26:1720–1730,
[45]

doi: 10.1109/TMM.2023.3285441

ISSN 1520-9210, 1941-0077. doi: 10.1109/TMM.2023.3285441

work page doi:10.1109/tmm.2023.3285441 1941
[46]

From ﬂatland to space: Teaching vision-language models to perceive and reason in 3d

Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Jilin Mei, Chunhui Chen, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From ﬂatland to space: Teaching vision-language models to perceive and reason in 3d. Advances in Neural Information Processing Systems , 38, 2026

2026
[47]

Uav-track vla: Embodied aerial tracking via vision-language-action models

Qiyao Zhang, Shuhua Zheng, Jianli Sun, Chengxiang Li, Xianke Wu, Zihan Song, Zhiyong Cui, Yisheng Lv, and Yonglin Tian. Uav-track vla: Embodied aerial tracking via vision-language-action models. arXiv preprint arXiv:2604.02241 , 2026

Pith/arXiv arXiv 2026
[48]

Mutr3d: A multi-camera tracking framework via 3d-to-2d queries

Tianyuan Zhang, Xuanyao Chen, Yue Wang, Yilun Wang, and Hang Zhao. Mutr3d: A multi-camera tracking framework via 3d-to-2d queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4537–4546, 2022

2022
[49]

Yu Zhang, Yiming Sun, Mi Zhang, Fan Yu, Shaoxiang Chen, Yang Li, Changbo Wang, Jianke Zhu, and Steven C.H. Hoi. ChatTracker: Enhancing visual tracking via LLM-driven iterative description reﬁnement. IEEE Transactions on Pattern Analysis and Machine Intelligence , pages 1–18, 2026. ISSN 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI.2026.3674357

work page doi:10.1109/tpami.2026.3674357 2026
[50]

Llava-video: Video instruction tuning with synthetic data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 , 2024

Pith/arXiv arXiv 2024
[51]

In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it ex- ist: Spatio-temporal video grounding for multi-form sentences. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 10665–10674. IEEE, 2020. ISBN 978-1- 7281-7168-5. doi: 10.1109/CVPR42600.2020.01068

work page doi:10.1109/cvpr42600.2020.01068 2020
[52]

Transformer vision-language tracking via proxy token guided cross-modal fusion

Haojie Zhao, Xiao Wang, Dong Wang, Huchuan Lu, and Xiang Ruan. Transformer vision-language tracking via proxy token guided cross-modal fusion. Pattern Recognition Letters , 168:10–16, 2023. ISSN 01678655. doi: 10.1016/j.patrec.2023.02.023

work page doi:10.1016/j.patrec.2023.02.023 2023
[53]

Eﬀective local and global search for fast long-term tracking

Haojie Zhao, Bin Yan, Dong Wang, Xuesheng Qian, Xiaoyun Yang, and Huchuan Lu. Eﬀective local and global search for fast long-term tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):460–474, 2023. ISSN 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI.2022. 3153645. 17

work page doi:10.1109/tpami.2022 2023
[54]

In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Duo Zheng, Shijia Huang, and Liwei Wang. Video-3D LLM: Learning position-aware video represen- tation for 3D scene understanding. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1–14, 2025. doi: 10.1109/CVPR52734.2025.00841

work page doi:10.1109/cvpr52734.2025.00841 2025
[55]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors

Duo Zheng, Yanyang Li, Liwei Wang, et al. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors. Advances in neural information processing systems , 38:20560–20586, 2026

2026
[56]

Towards uniﬁed token learning for vision-language tracking

Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, and Xianxian Li. Towards uniﬁed token learning for vision-language tracking. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–11, 2024. doi: 10.1109/TCSVT.2023.3301933

work page doi:10.1109/tcsvt.2023.3301933 2024
[57]

Llava-4d: Embedding spatiotemporal prompt into lmms for 4d scene understanding

Hanyu Zhou and Gim Hee Lee. Llava-4d: Embedding spatiotemporal prompt into lmms for 4d scene understanding. In The Fourteenth International Conference on Learning Representations , 2025

2025
[58]

Self-Supervised Learning from Images with a Joint- Embedding Predictive Architecture

Li Zhou, Zikun Zhou, Kaige Mao, and Zhenyu He. Joint visual grounding and tracking with natural language speciﬁcation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23151–23160. IEEE, 2023. ISBN 979-8-3503-0129-8. doi: 10.1109/CVPR52729.2023. 02217

work page doi:10.1109/cvpr52729.2023 2023
[59]

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks , url=

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. LLaV A-3D: A simple yet eﬀective pathway to empowering LMMs with 3D capabilities. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 1–18, 2025. doi: 10.1109/ICCV51701.2025.00409

work page doi:10.1109/iccv51701.2025.00409 2025
[60]

Bidir.=oﬀ

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. Proceedings of the International Conference on Learning Representations , pages 1–15, 2024. 18 Appendix In the appendix, we provide additional method details including graph-conditioned prompt constru...

2024

[1] [1]

Gpt-4 technical report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023

Pith/arXiv arXiv 2023

[2] [2]

Flamingo: A visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeﬀ Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, ...

work page doi:10.52202/068431- 2022

[3] [3]

Henriques, Andrea Vedaldi, and Philip H

Luca Bertinetto, Jack Valmadre, João F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. Fully- convolutional siamese networks for object tracking. In Gang Hua and Hervé Jégou, editors, Computer Vision ECCV 2016 Workshops , volume 9914, pages 850–865. Springer International Publishing, 2016. ISBN 978-3-319-48880-6 978-3-319-48881-3. doi: 10.1007/978-3-31...

work page doi:10.1007/978-3-319-48881-3_56 2016

[4] [4]

In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11618–11628. IEEE, 2020. ISBN 978-1-7281-7168-5. doi: 10.1109/CVPR426...

work page doi:10.1109/cvpr42600.2020 2020

[5] [5]

WILDTRACK: A multi-camera HD dataset for dense unscripted pedestrian detection

Tatjana Chavdarova, Pierre Baque, Stephane Bouquet, Andrii Maksai, Cijo Jose, Timur Bagautdinov, Louis Lettry, Pascal Fua, Luc Van Gool, and Francois Fleuret. WILDTRACK: A multi-camera HD dataset for dense unscripted pedestrian detection. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5030–5039. IEEE, 2018. ISBN 978-1-5386-6...

arXiv 2018

[6] [6]

Chang, and Matthias NieSSner

Dave Zhenyu Chen, Angel X. Chang, and Matthias NieSSner. ScanRefer: 3D object localization in RGB-D scans using natural language. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan- Michael Frahm, editors, Computer Vision ECCV 2020 , volume 12365, pages 202–221. Springer Inter- national Publishing, 2020. ISBN 978-3-030-58564-8 978-3-030-58565-5. doi: ...

work page doi:10.1007/978-3-030-58565- 2020

[7] [7]

Transformer tracking

Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 8122–

2021

[8] [8]

Contrastive learning for compact single image dehazing,

IEEE, 2021. ISBN 978-1-6654-4509-2. doi: 10.1109/CVPR46437.2021.00803

work page doi:10.1109/cvpr46437.2021.00803 2021

[9] [9]

Self-Supervised Learning from Images with a Joint- Embedding Predictive Architecture

Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. SeqTrack: Sequence to se- quence learning for visual object tracking. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 14572–14581. IEEE, 2023. ISBN 979-8-3503-0129-8. doi: 10.1109/CVPR52729.2023.01400

work page doi:10.1109/cvpr52729.2023.01400 2023

[10] [10]

Qlora: Eﬃcient ﬁnetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Eﬃcient ﬁnetuning of quantized llms. Advances in neural information processing systems , 36:10088–10115, 2023

2023

[11] [11]

Contrastive learning for compact single image dehazing,

Qi Feng, Vitaly Ablavsky, Qinxun Bai, and Stan Sclaroﬀ. Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 5847–5856. IEEE, 2021. ISBN 978-1-6654-4509-2. doi: 10.1109/CVPR46437.2021.00579. 14

work page doi:10.1109/cvpr46437.2021.00579 2021

[12] [12]

MemVLT: Vision-language tracking with adaptive memory-based prompts

Xiaokun Feng, Xuchen Li, Shiyu Hu, Dailing Zhang, Meiqi Wu, Jing Zhang, Xiaotang Chen, and Kaiqi Huang. MemVLT: Vision-language tracking with adaptive memory-based prompts. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024) , 2024. doi: 10.52202/079017-0476

work page doi:10.52202/079017-0476 2024

[13] [13]

The llama 3 herd of models

Aaron Grattaﬁori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024

Pith/arXiv arXiv 2024

[14] [14]

Kuckreja, M

Xin Gu, Heng Fan, Yan Huang, Tiejian Luo, and Libo Zhang. Context-guided spatio-temporal video grounding. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1–13, 2024. doi: 10.1109/CVPR52733.2024.01735

work page doi:10.1109/cvpr52733.2024.01735 2024

[15] [15]

Divert more attention to vision-language object tracking

Mingzhe Guo, Zhipeng Zhang, Liping Jing, Haibin Ling, and Heng Fan. Divert more attention to vision-language object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence , 46 (12):8600–8618, 2024. ISSN 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI.2024.3409078

work page doi:10.1109/tpami.2024.3409078 2024

[16] [16]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022

2022

[17] [17]

GOT-10k: A large high-diversity benchmark for generic object tracking in the wild

Lianghua Huang, Xin Zhao, and Kaiqi Huang. GOT-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence , 43(5): 1562–1577, 2021. ISSN 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI.2019.2957464

work page doi:10.1109/tpami.2019.2957464 2021

[18] [18]

Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world

Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yunlong Lin, et al. Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world. arXiv preprint arXiv:2603.12746 , 2026

arXiv 2026

[19] [19]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Guillaume Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825 , 2023

Pith/arXiv arXiv 2023

[20] [20]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. Proceedings of the International Conference on Machine Learning , pages 1–13, 2023

2023

[21] [21]

Time3d: End-to-end joint monocular 3d object detection and tracking for autonomous driving

Peixuan Li and Jieyu Jin. Time3d: End-to-end joint monocular 3d object detection and tracking for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3885–3894, 2022

2022

[22] [22]

In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, and Shuxiang Song. Dynamic updates for language adaptation in visual-language tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1–10, 2025. doi: 10.1109/CVPR52734.2025.01785

work page doi:10.1109/cvpr52734.2025.01785 2025

[23] [23]

In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp

Xuchen Li, Xiaokun Feng, Shiyu Hu, Meiqi Wu, Dailing Zhang, Jing Zhang, and Kaiqi Huang. DTLLM-VLT: Diverse text generation for visual language tracking based on LLM. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR W) , pages 7283–7292. IEEE, 2024. ISBN 979-8-3503-6547-4. doi: 10.1109/CVPR W63382.2024.00724

work page doi:10.1109/cvpr 2024

[24] [24]

Tracking by natural language speciﬁcation

Zhenyang Li, Ran Tao, Efstratios Gavves, Cees G M Snoek, and Arnold W M Smeulders. Tracking by natural language speciﬁcation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2017. doi: 10.1109/CVPR.2017.777

work page doi:10.1109/cvpr.2017.777 2017

[25] [25]

In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xinqi Liu, Li Zhou, Zikun Zhou, Jianqiu Chen, and Zhenyu He. MambaVLT: Time-evolving multi- modal state space model for vision-language tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1–15, 2025. doi: 10.1109/CVPR52734.2025.00816. 15

work page doi:10.1109/cvpr52734.2025.00816 2025

[26] [26]

Unifying visual and vision-language tracking via contrastive learning

Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, and Mengxue Kang. Unifying visual and vision-language tracking via contrastive learning. Proceedings of the AAAI Conference on Artiﬁcial Intelligence , 38(5):4107–4116, 2024. ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v38i5.28205

work page doi:10.1609/aaai.v38i5.28205 2024

[27] [27]

TrackingNet: A large-scale dataset and benchmark for object tracking in the wild

Matthias Müller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. TrackingNet: A large-scale dataset and benchmark for object tracking in the wild. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision ECCV 2018 , volume 11205, pages 310–327. Springer International Publishing, 2018. ISBN 978-...

work page doi:10.1007/978-3-030-01246-5_19 2018

[28] [28]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/ blog?id=qwen3.5

2026

[29] [29]

Kuckreja, M

Yanyan Shao, Shuting He, Qi Ye, Yuchao Feng, Wenhan Luo, and Jiming Chen. Context-aware integration of language and visual references for natural language tracking. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 19208–19217. IEEE, 2024. ISBN 979-8-3503-5300-6. doi: 10.1109/CVPR52733.2024.01817

work page doi:10.1109/cvpr52733.2024.01817 2024

[30] [30]

Qwen2.5-vl, January 2025

Qwen Team. Qwen2.5-vl, January 2025. URL https://qwenlm.github.io/blog/qwen2.5-vl/

2025

[31] [31]

Vptracker: Global vision-language tracking via visual prompt and mllm

Jingchao Wang, Kaiwen Zhou, Zhijian Wu, Kunhua Ji, Dingjiang Huang, and Yefeng Zheng. Vptracker: Global vision-language tracking via visual prompt and mllm. arXiv preprint arXiv:2512.22799 , 2025

Pith/arXiv arXiv 2025

[32] [32]

Describe and attend to track: Learning natural language guided structural representation and visual attention for object tracking

Xiao Wang, Chenglong Li, Rui Yang, Tianzhu Zhang, Jin Tang, and Bin Luo. Describe and attend to track: Learning natural language guided structural representation and visual attention for object tracking. arXiv preprint arXiv:1811.10014 , 2018

Pith/arXiv arXiv 2018

[33] [33]

Towards more ﬂexible and accurate object tracking with natural language: Algorithms and benchmark

Xiao Wang, Xiujun Shu, Zhipeng Zhang, Bo Jiang, Yaowei Wang, Yonghong Tian, and Feng Wu. Towards more ﬂexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 13763– 13773, 2021

2021

[34] [34]

In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Hongkai Wei, Yang Yang, Shijie Sun, Mingtao Feng, Xiangyu Song, Qi Lei, Hongli Hu, Rong Wang, Huansheng Song, Naveed Akhtar, and Ajmal Saeed Mian. Mono3DVLT: Monocular-video-based 3D visual language tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1–11, 2025. doi: 10.1109/CVPR52734.2025.01296

work page doi:10.1109/cvpr52734.2025.01296 2025

[35] [35]

Mvggt: Multimodal visual geometry grounded transformer for multiview 3d referring expression segmentation

Changli Wu, Haodong Wang, Jiayi Ji, Yutian Yao, Chunsai Du, Jihua Kang, Yanwei Fu, and Liujuan Cao. Mvggt: Multimodal visual geometry grounded transformer for multiview 3d referring expression segmentation. arXiv preprint arXiv:2601.06874 , 2026

arXiv 2026

[36] [36]

URL https://doi.org/10.1109/ ICCV51070.2023.00008

Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, and Jianbing Shen. OnlineRefer: A simple online baseline for referring video object segmentation. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , pages 2749–2758. IEEE, 2023. ISBN 979-8-3503-0718-4. doi: 10.1109/ICCV51070.2023.00259

work page doi:10.1109/iccv51070.2023.00259 2023

[37] [37]

Towards Fewer Annotations: Active Learning via Region Impurity and Prediction Uncertainty for Domain Adaptive Semantic Segmentation , year=

Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. Language as queries for referring video object segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4964–4974. IEEE, 2022. ISBN 978-1-6654-6946-3. doi: 10.1109/CVPR52688.2022. 00492

work page doi:10.1109/cvpr52688.2022 2022

[38] [38]

Joint feature learning and relation modeling for tracking: A one-stream framework

Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. In Shai A vidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision ECCV 2022 , volume 16 13682, pages 341–357. Springer Nature Switzerland, 2022. ISBN 978-3...

work page doi:10.1007/978-3-031-20047-2_20 2022

[39] [39]

Mllm-4d: To- wards visual-based spatial-temporal intelligence

Xingyilang Yin, Chengzhengxu Li, Jiahao Chang, Chi-Man Pun, and Xiaodong Cun. Mllm-4d: To- wards visual-based spatial-temporal intelligence. arXiv preprint arXiv:2603.00515 , 2026

arXiv 2026

[40] [40]

All in one: Exploring uniﬁed vision-language tracking with multi-modal alignment

Chunhui Zhang, Xin Sun, Yiqian Yang, Li Liu, Qiong Liu, Xi Zhou, and Yanfeng Wang. All in one: Exploring uniﬁed vision-language tracking with multi-modal alignment. In Proceedings of the 31st ACM International Conference on Multimedia , pages 5552–5561. ACM, 2023. ISBN 979-8-4007-0108-

2023

[41] [41]

doi: 10.1145/3581783.3611803

work page doi:10.1145/3581783.3611803

[42] [42]

One-stream stepwise decreasing for vision-language tracking

Guangtong Zhang, Bineng Zhong, Qihua Liang, Zhiyi Mo, Ning Li, and Shuxiang Song. One-stream stepwise decreasing for vision-language tracking. IEEE Transactions on Circuits and Systems for Video Technology, 34(10):9053–9063, 2024. ISSN 1051-8215, 1558-2205. doi: 10.1109/TCSVT.2024.3395352

work page doi:10.1109/tcsvt.2024.3395352 2024

[43] [43]

Aware distillation for robust vision-language tracking under linguistic sparsity

Guangtong Zhang, Bineng Zhong, Shirui Yang, Yang Wang, and Tian Bai. Aware distillation for robust vision-language tracking under linguistic sparsity. Proceedings of the AAAI Conference on Artiﬁcial Intelligence , pages 1–9, 2026. doi: 10.1609/aaai.v40i15.38237

work page doi:10.1609/aaai.v40i15.38237 2026

[44] [44]

One-stream vision-language memory network for object tracking

Huanlong Zhang, Jingchao Wang, Jianwei Zhang, Tianzhu Zhang, and Bineng Zhong. One-stream vision-language memory network for object tracking. IEEE Transactions on Multimedia, 26:1720–1730,

[45] [45]

doi: 10.1109/TMM.2023.3285441

ISSN 1520-9210, 1941-0077. doi: 10.1109/TMM.2023.3285441

work page doi:10.1109/tmm.2023.3285441 1941

[46] [46]

From ﬂatland to space: Teaching vision-language models to perceive and reason in 3d

Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Jilin Mei, Chunhui Chen, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From ﬂatland to space: Teaching vision-language models to perceive and reason in 3d. Advances in Neural Information Processing Systems , 38, 2026

2026

[47] [47]

Uav-track vla: Embodied aerial tracking via vision-language-action models

Qiyao Zhang, Shuhua Zheng, Jianli Sun, Chengxiang Li, Xianke Wu, Zihan Song, Zhiyong Cui, Yisheng Lv, and Yonglin Tian. Uav-track vla: Embodied aerial tracking via vision-language-action models. arXiv preprint arXiv:2604.02241 , 2026

Pith/arXiv arXiv 2026

[48] [48]

Mutr3d: A multi-camera tracking framework via 3d-to-2d queries

Tianyuan Zhang, Xuanyao Chen, Yue Wang, Yilun Wang, and Hang Zhao. Mutr3d: A multi-camera tracking framework via 3d-to-2d queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4537–4546, 2022

2022

[49] [49]

Yu Zhang, Yiming Sun, Mi Zhang, Fan Yu, Shaoxiang Chen, Yang Li, Changbo Wang, Jianke Zhu, and Steven C.H. Hoi. ChatTracker: Enhancing visual tracking via LLM-driven iterative description reﬁnement. IEEE Transactions on Pattern Analysis and Machine Intelligence , pages 1–18, 2026. ISSN 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI.2026.3674357

work page doi:10.1109/tpami.2026.3674357 2026

[50] [50]

Llava-video: Video instruction tuning with synthetic data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 , 2024

Pith/arXiv arXiv 2024

[51] [51]

In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it ex- ist: Spatio-temporal video grounding for multi-form sentences. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 10665–10674. IEEE, 2020. ISBN 978-1- 7281-7168-5. doi: 10.1109/CVPR42600.2020.01068

work page doi:10.1109/cvpr42600.2020.01068 2020

[52] [52]

Transformer vision-language tracking via proxy token guided cross-modal fusion

Haojie Zhao, Xiao Wang, Dong Wang, Huchuan Lu, and Xiang Ruan. Transformer vision-language tracking via proxy token guided cross-modal fusion. Pattern Recognition Letters , 168:10–16, 2023. ISSN 01678655. doi: 10.1016/j.patrec.2023.02.023

work page doi:10.1016/j.patrec.2023.02.023 2023

[53] [53]

Eﬀective local and global search for fast long-term tracking

Haojie Zhao, Bin Yan, Dong Wang, Xuesheng Qian, Xiaoyun Yang, and Huchuan Lu. Eﬀective local and global search for fast long-term tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):460–474, 2023. ISSN 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI.2022. 3153645. 17

work page doi:10.1109/tpami.2022 2023

[54] [54]

In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Duo Zheng, Shijia Huang, and Liwei Wang. Video-3D LLM: Learning position-aware video represen- tation for 3D scene understanding. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1–14, 2025. doi: 10.1109/CVPR52734.2025.00841

work page doi:10.1109/cvpr52734.2025.00841 2025

[55] [55]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors

Duo Zheng, Yanyang Li, Liwei Wang, et al. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors. Advances in neural information processing systems , 38:20560–20586, 2026

2026

[56] [56]

Towards uniﬁed token learning for vision-language tracking

Yaozong Zheng, Bineng Zhong, Qihua Liang, Guorong Li, Rongrong Ji, and Xianxian Li. Towards uniﬁed token learning for vision-language tracking. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–11, 2024. doi: 10.1109/TCSVT.2023.3301933

work page doi:10.1109/tcsvt.2023.3301933 2024

[57] [57]

Llava-4d: Embedding spatiotemporal prompt into lmms for 4d scene understanding

Hanyu Zhou and Gim Hee Lee. Llava-4d: Embedding spatiotemporal prompt into lmms for 4d scene understanding. In The Fourteenth International Conference on Learning Representations , 2025

2025

[58] [58]

Self-Supervised Learning from Images with a Joint- Embedding Predictive Architecture

Li Zhou, Zikun Zhou, Kaige Mao, and Zhenyu He. Joint visual grounding and tracking with natural language speciﬁcation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23151–23160. IEEE, 2023. ISBN 979-8-3503-0129-8. doi: 10.1109/CVPR52729.2023. 02217

work page doi:10.1109/cvpr52729.2023 2023

[59] [59]

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks , url=

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. LLaV A-3D: A simple yet eﬀective pathway to empowering LMMs with 3D capabilities. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 1–18, 2025. doi: 10.1109/ICCV51701.2025.00409

work page doi:10.1109/iccv51701.2025.00409 2025

[60] [60]

Bidir.=oﬀ

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. Proceedings of the International Conference on Learning Representations , pages 1–15, 2024. 18 Appendix In the appendix, we provide additional method details including graph-conditioned prompt constru...

2024