pith. machine review for the scientific record. sign in

arxiv: 2602.19035 · v2 · submitted 2026-02-22 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual odometryego-motion estimationtemporal dynamicsfoundation modelsdashcam footagepose regressionopen-worldautonomous driving
0
0 comments X

The pith

OpenVO estimates ego-motion from monocular dashcam videos with arbitrary frame rates and unknown intrinsics by encoding temporal dynamics and using 3D priors from foundation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OpenVO as a visual odometry system that works on real dashcam footage where frame rates change unpredictably and cameras lack calibration. Standard approaches assume fixed frame rates like 10 or 12 Hz and known intrinsics, so they fail on irregular inputs from everyday driving. OpenVO fixes this by adding explicit temporal encoding to a two-frame pose regressor and pulling in 3D geometric information from foundation models. This produces more reliable trajectories that support building datasets from rare events and downstream 3D reconstruction tasks. Readers interested in practical robotics or mapping would care because it turns abundant but messy video into usable motion data without extra hardware.

Core claim

OpenVO is a framework for open-world visual odometry that explicitly encodes temporal dynamics information inside a two-frame pose regression network and incorporates 3D geometric priors from foundation models, allowing accurate real-world-scale ego-motion estimation from monocular footage under varying observation rates and uncalibrated cameras.

What carries the argument

A two-frame pose regression network augmented with temporal dynamics encoding and 3D geometric priors extracted from foundation models.

If this is right

  • Delivers more than 20 percent performance gain over prior state-of-the-art methods on KITTI, nuScenes, and Argoverse 2.
  • Produces 46 to 92 percent lower error across all metrics when observation rates are allowed to vary.
  • Supports construction of trajectory datasets from rare or irregular driving events captured by consumer dashcams.
  • Enables downstream real-world 3D reconstruction without requiring camera calibration steps.
  • Generalizes to unseen frame frequencies where fixed-rate trained models degrade.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same temporal-encoding plus prior strategy might transfer to other ego-motion tasks such as visual-inertial odometry or multi-camera setups.
  • If foundation priors prove robust, training data requirements for visual odometry could shrink because less emphasis would be placed on perfectly calibrated sequences.
  • Real-time versions could be tested on embedded dashcam hardware to see whether the added temporal module fits within latency budgets for live mapping.
  • Connections to uncertainty-aware robotics: the method implicitly treats irregular timing as a form of input noise that priors can regularize.

Load-bearing premise

Three-dimensional geometric priors taken from foundation models remain accurate and stable enough to guide pose regression on uncalibrated dashcam images recorded at irregular frame rates.

What would settle it

Run the method on a held-out set of dashcam sequences where ground-truth poses and camera intrinsics are known, observation rates vary randomly, and compare absolute trajectory error against a baseline that uses no foundation-model priors; if the gap disappears or reverses, the claim fails.

Figures

Figures reproduced from arXiv: 2602.19035 by Anh N. Nhu, Ming C. Lin, Phuc D.A. Nguyen.

Figure 1
Figure 1. Figure 1: Left: Generalized Visual Odometry provides real-world ego-motion and trajectory estimates that bridge perception and control in autonomous driving. It enables scene understanding (Driving VQA [44, 54]), simulation (Real2Sim [28, 45]), action grounding (Driving VLA [19, 27]), and precise motion feedback for low-level control [7, 21, 58]. Right: We introduce OpenVO, a generalizable visual odom￾etry framework… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of OpenVO. We propose a novel temporal-dynamics-informed, geometry-aware visual odometry system. Our method takes consecutive dashcam frames as input and extracts both temporal and geometric representations for robust egomotion estimation. The Time-Aware Flow Encoder (Sec. 3.1) leverages a Differentiable 2D-Guided 3D Flow module and time-conditioned embeddings to model motion dynamics across varyi… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results. We present trajectory prediction results on the KITTI and nuScenes datasets. Compared to ZeroVO‡ , both variants of our method — differentiable (OpenVO-diff) and non-differentiable variants (OpenVO-nodiff) of our 2D-guided 3D flow — achieve higher trajectory prediction accuracy and consistency, surpassing the current state-of-the-art. Setting KITTI 00–10 (10Hz) nuScenes (12Hz) terr rer… view at source ↗
Figure 4
Figure 4. Figure 4: Modified VectorMapNet [28]. A front-view input image is first processed by an image encoder to extract semantic and geometric features. These features are then lifted into a bird’s-eye-view (BEV) representation using inverse perspective mapping, which leverages the camera’s intrinsic and extrinsic parameters from OpenVO to geometrically project image features onto the ground plane. The resulting BEV featur… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of Global HDMap reconstruction results produced by [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative HD Map reconstruction results produced by modified monocular VectorMapNet [ [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative of Stereo benchmark on Argoverse 2 [55]. Each row shows one example, including the input stereo image and the reference metric depth. The stereo images in Argoverse 2 often provides low-quality or weakly constrained metric depth due to limited disparity in long-range regions and visually challenging street scenes. This degradation leads to information loss and introduces uncertainty into downst… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results on KITTI [13] benchmark. Each row presents one example. The KITTI camera provides a wider field of view than most datasets, allowing it to capture a richer set of dynamic objects while still preserving its long-range odometry characteristics. First View Last View First View Last View RGB Metric Depth Extracted Trajectory 20:39 18/11/25 OpenVO-quali_us.drawio.svg file:///Users/phucnda/Do… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results on real-world captured videos. We present two examples, each accompanied by the corresponding RGB frames and reference metric-depth images. Real-world videos commonly exhibit numerous environmental artifacts—such as noise, clutter, and dynamic elements—which pose significant challenges for generalizability and real-world performance assessment [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

We introduce OpenVO, a novel framework for Open-world Visual Odometry (VO) with temporal awareness under limited input conditions. OpenVO effectively estimates real-world-scale ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras, enabling robust trajectory dataset construction from rare driving events recorded in dashcam. Existing VO methods are trained on fixed observation frequency (e.g., 10Hz or 12Hz), completely overlooking temporal dynamics information. Many prior methods also require calibrated cameras with known intrinsic parameters. Consequently, their performance degrades when (1) deployed under unseen observation frequencies or (2) applied to uncalibrated cameras. These significantly limit their generalizability to many downstream tasks, such as extracting trajectories from dashcam footage. To address these challenges, OpenVO (1) explicitly encodes temporal dynamics information within a two-frame pose regression framework and (2) leverages 3D geometric priors derived from foundation models. We validate our method on three major autonomous-driving benchmarks - KITTI, nuScenes, and Argoverse 2 - achieving more than 20 performance improvement over state-of-the-art approaches. Under varying observation rate settings, our method is significantly more robust, achieving 46%-92% lower errors across all metrics. These results demonstrate the versatility of OpenVO for real-world 3D reconstruction and diverse downstream applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces OpenVO, a monocular visual odometry framework for open-world settings that explicitly encodes temporal dynamics in a two-frame pose regression network and incorporates 3D geometric priors extracted from foundation models. It targets robustness to unknown camera intrinsics and arbitrary observation rates in dashcam footage, claiming >20% improvement over SOTA methods and 46-92% lower errors across metrics on KITTI, nuScenes, and Argoverse 2 under varying frame rates.

Significance. If the empirical claims hold under rigorous protocol, the work would be significant for enabling trajectory extraction from real-world uncalibrated, variable-rate dashcam data, addressing a practical gap in existing VO systems that assume fixed frequencies and calibrated cameras. The combination of temporal encoding and foundation-model priors offers a pragmatic engineering path rather than a closed-form derivation.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the headline claims of >20% improvement and 46-92% lower errors are presented without any description of the experimental protocol, exact baselines, data splits, observation-rate sampling procedure, error bars, or statistical tests. This information is load-bearing for the central robustness claim and must be supplied before the quantitative results can be evaluated.
  2. [§3.2] §3.2 (3D Geometric Priors): no quantitative validation is provided for the accuracy of the foundation-model-derived priors (depth or point-cloud error) on the target dashcam domains versus LiDAR ground truth. Without an ablation isolating prior quality from the temporal encoder, it is impossible to determine whether the reported robustness to rate variation actually stems from the priors or from other components.
  3. [§4.3] §4.3 (Varying Observation Rate Experiments): the evaluation under arbitrary rates lacks explicit perturbation of intrinsics or domain-shift tests on uncalibrated footage. If the priors degrade under these conditions (as is common for models trained on curated datasets), the two-frame regression would receive noisy inputs, undermining the generalization claim.
minor comments (2)
  1. [§3] Notation for the temporal dynamics encoding module is introduced without a clear equation or diagram reference in the method section, making the architecture hard to reproduce from the text alone.
  2. [Abstract] The abstract states 'more than 20 performance improvement' but omits the word 'percent'; this should be corrected for precision.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper accordingly to strengthen the presentation of our experimental protocol and ablations.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claims of >20% improvement and 46-92% lower errors are presented without any description of the experimental protocol, exact baselines, data splits, observation-rate sampling procedure, error bars, or statistical tests. This information is load-bearing for the central robustness claim and must be supplied before the quantitative results can be evaluated.

    Authors: We agree that the experimental details are essential for rigorous evaluation. In the revised manuscript we will expand §4 with a complete protocol description, including exact data splits for KITTI/nuScenes/Argoverse 2, the full list of baselines, the precise procedure used to sample arbitrary observation rates, standard error bars, and statistical significance tests supporting the reported gains. revision: yes

  2. Referee: [§3.2] §3.2 (3D Geometric Priors): no quantitative validation is provided for the accuracy of the foundation-model-derived priors (depth or point-cloud error) on the target dashcam domains versus LiDAR ground truth. Without an ablation isolating prior quality from the temporal encoder, it is impossible to determine whether the reported robustness to rate variation actually stems from the priors or from other components.

    Authors: We acknowledge the absence of direct validation. We will add quantitative depth and point-cloud error metrics versus LiDAR ground truth on the three target datasets and include a dedicated ablation that isolates the 3D priors from the temporal encoder to clarify their individual contributions to rate robustness. revision: yes

  3. Referee: [§4.3] §4.3 (Varying Observation Rate Experiments): the evaluation under arbitrary rates lacks explicit perturbation of intrinsics or domain-shift tests on uncalibrated footage. If the priors degrade under these conditions (as is common for models trained on curated datasets), the two-frame regression would receive noisy inputs, undermining the generalization claim.

    Authors: Our benchmarks already exercise uncalibrated cameras via the foundation-model priors, yet we agree explicit stress tests are valuable. We will extend §4.3 with controlled intrinsic perturbation experiments and additional domain-shift evaluations on uncalibrated footage to demonstrate continued performance even when prior quality is degraded. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering contribution without derivation chain

full rationale

The paper presents OpenVO as a neural framework combining temporal dynamics encoding in a two-frame pose regression module with 3D geometric priors extracted from foundation models. No equations, closed-form derivations, or parameter-fitting steps are described that reduce predictions to inputs by construction. Performance improvements are reported via empirical evaluation on KITTI, nuScenes, and Argoverse 2 under varying observation rates, with no self-definitional loops, fitted-input predictions, or load-bearing self-citations in the provided text. The method is self-contained as an applied architecture rather than a mathematical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on standard visual-odometry scene assumptions plus the reliability of off-the-shelf foundation-model 3D priors; no free parameters or new invented entities are explicitly quantified in the abstract.

axioms (1)
  • domain assumption Rigid scene and static camera motion assumptions standard to visual odometry
    Implicit in all monocular VO pose regression methods
invented entities (1)
  • Temporal dynamics encoding module no independent evidence
    purpose: To make pose regression robust to varying observation rates
    New component introduced to address fixed-frequency limitation

pith-pipeline@v0.9.0 · 5546 in / 1317 out tokens · 25023 ms · 2026-05-15T20:49:42.520763+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 3 internal anchors

  1. [1]

    Dino-vo: A feature-based visual odometry leveraging a visual foundation model.IEEE Robotics and Au- tomation Letters (RA-L), 2025

    Maulana Bisyir Azhari and David Hyunchul Shim. Dino-vo: A feature-based visual odometry leveraging a visual foundation model.IEEE Robotics and Au- tomation Letters (RA-L), 2025. 3

  2. [2]

    Deepcrashtest: Turning dashcam videos into virtual crash tests for automated driving systems

    Sai Krishna Bashetty, Heni Ben Amor, and Georgios Fainekos. Deepcrashtest: Turning dashcam videos into virtual crash tests for automated driving systems. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 11353–11360, 2020. 2

  3. [3]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot transfer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 3

  4. [4]

    Depth pro: Sharp monocular metric depth in less than a second

    Alexey Bochkovskiy, Ama ¨el Delaunoy, Hugo Ger- main, Marcel Santos, Yichao Zhou, Stephan Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InThe Thir- teenth International Conference on Learning Repre- sentations (ICLR), 2025. 3

  5. [5]

    Nuscenes: A multimodal dataset for autonomous driv- ing

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krish- nan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. Nuscenes: A multimodal dataset for autonomous driv- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11621–11631, 2020. 2, 6, 11

  6. [6]

    Orb-slam3: An accurate open-source library for vi- sual, visual–inertial, and multimap slam.IEEE Trans- actions on Robotics, 37(6):1874–1890, 2021

    Carlos Campos, Richard Elvira, Juan J G ´omez Rodr´iguez, Jos ´e MM Montiel, and Juan D Tard ´os. Orb-slam3: An accurate open-source library for vi- sual, visual–inertial, and multimap slam.IEEE Trans- actions on Robotics, 37(6):1874–1890, 2021. 2, 3

  7. [7]

    Learning from all vehicles

    Dian Chen and Philipp Kr ¨ahenb¨uhl. Learning from all vehicles. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17222–17231, 2022. 1

  8. [8]

    Vinet: Visual-inertial odometry as a sequence-to-sequence learning prob- lem

    Ronald Clark, Sen Wang, Hongkai Wen, Andrew Markham, and Niki Trigoni. Vinet: Visual-inertial odometry as a sequence-to-sequence learning prob- lem. InProceedings of the AAAI Conference on Ar- tificial Intelligence, 2017. 2

  9. [9]

    Ciarfuglia

    Gabriele Costante, Michele Mancini, Paolo Valigi, and Thomas A. Ciarfuglia. Exploring representation learning with cnns for frame-to-frame ego-motion es- timation.IEEE Robotics and Automation Letters (RA- L), 1(1):18–25, 2016. 2

  10. [10]

    Monoslam: Real-time single cam- era slam.IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE TPAMI), 29(6):1052– 1067, 2007

    Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-time single cam- era slam.IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE TPAMI), 29(6):1052– 1067, 2007. 2

  11. [11]

    Direct sparse odometry.IEEE transactions on pat- tern analysis and machine intelligence, 40(3):611– 625, 2017

    Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.IEEE transactions on pat- tern analysis and machine intelligence, 40(3):611– 625, 2017. 2

  12. [12]

    Svo: Fast semi-direct monocular visual odom- etry

    Christian Forster, Matia Pizzoli, and Davide Scara- muzza. Svo: Fast semi-direct monocular visual odom- etry. In2014 IEEE International Conference on Robotics and Automation (ICRA), pages 15–22. IEEE,

  13. [13]

    Vision meets robotics: The kitti dataset.The international journal of robotics research, 32(11):1231–1237, 2013

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The international journal of robotics research, 32(11):1231–1237, 2013. 11, 12, 15

  14. [14]

    Vision meets robotics: The kitti dataset.The International Journal of Robotics Re- search (IJRR), 32(11):1231–1237, 2013

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The International Journal of Robotics Re- search (IJRR), 32(11):1231–1237, 2013. 2, 6

  15. [15]

    Deep geometry-aware camera self-calibration from video

    Annika Hagemann, Moritz Knorr, and Christoph Stiller. Deep geometry-aware camera self-calibration from video. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 3438– 3448, 2023. 10

  16. [16]

    Real-time stereo visual odometry for autonomous ground vehicles

    Andrew Howard. Real-time stereo visual odometry for autonomous ground vehicles. InIEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 3946–3952, 2008. 1

  17. [17]

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2024. 3, 5, 6, 7, 8, 10

  18. [18]

    Vipe: Video pose engine for 3d geo- metric perception

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixe, and Sanja Fidler. Vipe: Video pose engine for 3d geo- metric perception. InNVIDIA Research Whitepapers arXiv:2508.10934, 2025. 10

  19. [19]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to- end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024. 1

  20. [20]

    Perspective fields for sin- gle image camera calibration

    Linyi Jin, Jianming Zhang, Yannick Hold-Geoffroy, Oliver Wang, Kevin Blackburn-Matzen, Matthew Sticha, and David F Fouhey. Perspective fields for sin- gle image camera calibration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 17307–17316, 2023. 3

  21. [21]

    Advisable learning for self- driving vehicles by internalizing observation-to-action rules

    Jinkyu Kim, Suhong Moon, Anna Rohrbach, Trevor Darrell, and John Canny. Advisable learning for self- driving vehicles by internalizing observation-to-action rules. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9661–9670, 2020. 1

  22. [22]

    Xvo: Generalized visual odometry via cross-modal self-training

    Lei Lai, Zhongkai Shangguan, Jimuyang Zhang, and Eshed Ohn-Bar. Xvo: Generalized visual odometry via cross-modal self-training. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 10094–10105, 2023. 2, 3, 4, 5, 6, 7

  23. [23]

    Zerovo: Vi- sual odometry with minimal assumptions

    Lei Lai, Zekai Yin, and Eshed Ohn-Bar. Zerovo: Vi- sual odometry with minimal assumptions. InProceed- ings of the Computer Vision and Pattern Recognition Conference, pages 17092–17102, 2025. 2, 3, 4, 6, 7, 8

  24. [24]

    Ctrl-c: Cam- era calibration transformer with line-classification

    Jinwoo Lee, Hyunsung Go, Hyunjoon Lee, Sunghyun Cho, Minhyuk Sung, and Junho Kim. Ctrl-c: Cam- era calibration transformer with line-classification. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 16228– 16237, 2021. 3

  25. [25]

    Hdmapnet: An online hd map construction and evalu- ation framework

    Qi Li, Yue Wang, Yilun Wang, and Hang Zhao. Hdmapnet: An online hd map construction and evalu- ation framework. In2022 International Conference on Robotics and Automation (ICRA), pages 4628–4634. IEEE, 2022. 16

  26. [26]

    Undeepvo: Monocular visual odometry through unsupervised deep learning

    Ruihao Li, Sen Wang, Zhiqiang Long, and Dongbing Gu. Undeepvo: Monocular visual odometry through unsupervised deep learning. In2018 IEEE interna- tional conference on robotics and automation (ICRA), pages 7286–7291. IEEE, 2018. 2

  27. [27]

    Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

    Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025. 1

  28. [28]

    Vectormapnet: End-to-end vectorized hd map learning

    Yicheng Liu, Tianyuan Yuan, Yue Wang, Yilun Wang, and Hang Zhao. Vectormapnet: End-to-end vectorized hd map learning. InInternational Conference on Ma- chine Learning, pages 22352–22369. PMLR, 2023. 1, 11, 12, 13, 16

  29. [29]

    Cnn-svo: Improving the mapping in semi-direct visual odome- try using single-image depth prediction

    Shing Yan Loo, Ali Jahani Amiri, Syamsiah Mashohor, Sai Hong Tang, and Hong Zhang. Cnn-svo: Improving the mapping in semi-direct visual odome- try using single-image depth prediction. InInterna- tional conference on robotics and automation (ICRA), pages 5218–5223. IEEE, 2019. 2, 3

  30. [30]

    Two years of visual odometry on the mars exploration rovers.Journal of Field Robotics, 24(3):169–186,

    Mark Maimone, Yang Cheng, and Larry Matthies. Two years of visual odometry on the mars exploration rovers.Journal of Field Robotics, 24(3):169–186,

  31. [31]

    John Wiley & Sons, 2009

    Kanti V Mardia and Peter E Jupp.Directional statis- tics. John Wiley & Sons, 2009. 6

  32. [32]

    Probabilistic orientation estimation with ma- trix fisher distributions.Advances in Neural Informa- tion Processing Systems, 33:4884–4893, 2020

    David Mohlin, Josephine Sullivan, and G ´erald Bianchi. Probabilistic orientation estimation with ma- trix fisher distributions.Advances in Neural Informa- tion Processing Systems, 33:4884–4893, 2020. 6

  33. [33]

    Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras.IEEE Transactions on Robotics, 33 (5):1255–1262, 2017

    Raul Mur-Artal and Juan D Tard ´os. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras.IEEE Transactions on Robotics, 33 (5):1255–1262, 2017. 2

  34. [34]

    Orb-slam: A versatile and accu- rate monocular slam system.IEEE Transactions on Robotics, 31(5):1147–1163, 2015

    Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accu- rate monocular slam system.IEEE Transactions on Robotics, 31(5):1147–1163, 2015. 2

  35. [35]

    Ha-rdet: Hybrid anchor rotation de- tector for oriented object detection

    Phuc Nguyen. Ha-rdet: Hybrid anchor rotation de- tector for oriented object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2889–2898, 2025. 3

  36. [36]

    Open3dis: Open-vocabulary 3d instance seg- mentation with 2d mask guidance

    Phuc Nguyen, Tuan Duc Ngo, Evangelos Kaloger- akis, Chuang Gan, Anh Tran, Cuong Pham, and Khoi Nguyen. Open3dis: Open-vocabulary 3d instance seg- mentation with 2d mask guidance. InProceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, pages 4018–4028, 2024. 3

  37. [37]

    Any3dis: Class-agnostic 3d instance segmentation by 2d mask tracking

    Phuc Nguyen, Minh Luu, Anh Tran, Cuong Pham, and Khoi Nguyen. Any3dis: Class-agnostic 3d instance segmentation by 2d mask tracking. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3636–3645, 2025

  38. [38]

    Open-ended 3d point cloud in- stance segmentation

    Phuc Nguyen, Minh Luu, Anh Tran, Cuong Pham, and Khoi Nguyen. Open-ended 3d point cloud in- stance segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2580–2590, 2025. 3

  39. [39]

    Time- aware world model for adaptive prediction and con- trol

    Anh N Nhu, Sanghyun Son, and Ming Lin. Time- aware world model for adaptive prediction and con- trol. InForty-second International Conference on Ma- chine Learning (ICML), 2025. 2, 3

  40. [40]

    Nister, O

    D. Nister, O. Naroditsky, and J. Bergen. Visual odom- etry. InProceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., pages I–I, 2004. 1, 2, 3

  41. [41]

    Dinov2: Learning robust visual fea- tures without supervision.Transactions on Machine Learning Research Journal, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual fea- tures without supervision.Transactions on Machine Learning Research Journal, 2024. 3

  42. [42]

    Unidepth: Universal monocular metric depth esti- mation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth esti- mation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10106–10116, 2024. 3

  43. [43]

    UniDepthV2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110, 2025

    Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular met- ric depth estimation made simpler.arXiv preprint arXiv:2502.20110, 2025. 3

  44. [44]

    Nuscenes-qa: A multi-modal vi- sual question answering benchmark for autonomous driving scenario

    Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal vi- sual question answering benchmark for autonomous driving scenario. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 4542–4550,

  45. [45]

    Globalmapnet: An online frame- work for vectorized global hd map construction.arXiv preprint arXiv:2409.10063, 2024

    Anqi Shi, Yuze Cai, Xiangyu Chen, Jian Pu, Zeyu Fu, and Hong Lu. Globalmapnet: An online frame- work for vectorized global hd map construction.arXiv preprint arXiv:2409.10063, 2024. 1, 11

  46. [46]

    Real-time visual odometry from dense rgb- d images

    Frank Steinbr ¨ucker, J ¨urgen Sturm, and Daniel Cre- mers. Real-time visual odometry from dense rgb- d images. In2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 719–722, 2011. 2, 3

  47. [47]

    Droid-slam: Deep vi- sual slam for monocular, stereo, and rgb-d cameras

    Zachary Teed and Jia Deng. Droid-slam: Deep vi- sual slam for monocular, stereo, and rgb-d cameras. Advances in Neural Information Processing Systems (NeurIPS), 34:16558–16569, 2021. 2, 3, 6, 7

  48. [48]

    Deep patch visual odometry.Advances in Neural Informa- tion Processing Systems (NeurIPS), 36:39033–39051,

    Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch visual odometry.Advances in Neural Informa- tion Processing Systems (NeurIPS), 36:39033–39051,

  49. [49]

    Benchmarking real-time reinforcement learning

    Pierre Thodoroff, Wenyu Li, and Neil D Lawrence. Benchmarking real-time reinforcement learning. In NeurIPS 2021 Workshop on Pre-registration in Ma- chine Learning, pages 26–41. PMLR, 2022. 2, 3

  50. [50]

    Stereo dso: Large-scale direct sparse visual odome- try with stereo cameras

    Rui Wang, Martin Schworer, and Daniel Cremers. Stereo dso: Large-scale direct sparse visual odome- try with stereo cameras. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3903–3911, 2017. 2

  51. [51]

    Deepvo: Towards end-to-end visual odom- etry with deep recurrent convolutional neural net- works

    Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni. Deepvo: Towards end-to-end visual odom- etry with deep recurrent convolutional neural net- works. In2017 IEEE international conference on robotics and automation (ICRA), pages 2043–2050. IEEE, 2017. 2, 5

  52. [52]

    End-to-end, sequence-to-sequence prob- abilistic visual odometry through deep neural net- works.The International Journal of Robotics Re- search (IJRR), 37(4-5):513–542, 2018

    Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni. End-to-end, sequence-to-sequence prob- abilistic visual odometry through deep neural net- works.The International Journal of Robotics Re- search (IJRR), 37(4-5):513–542, 2018. 2

  53. [53]

    Tartanvo: A generalizable learning-based vo

    Wenshan Wang, Yaoyu Hu, and Sebastian Scherer. Tartanvo: A generalizable learning-based vo. InCon- ference on Robot Learning, pages 1761–1772. PMLR,

  54. [54]

    Driveqa: Passing the driving knowledge test.arXiv preprint arXiv:2508.21824, 2025

    Maolin Wei, Wanzhou Liu, and Eshed Ohn-Bar. Driveqa: Passing the driving knowledge test.arXiv preprint arXiv:2508.21824, 2025. 1

  55. [55]

    Argoverse 2: Next generation datasets for self-driving perception and forecasting

    Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaese- model Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting. InThirty- fifth Conference on Neural Information Processing Systems...

  56. [56]

    Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

    Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaese- model Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting. arXiv preprint arXiv:2301.00493, 2023. 2, 6

  57. [57]

    Yingfu Xu and Guido C. H. E. de Croon. Cnn-based ego-motion estimation for fast mav maneuvers. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 7606–7612, 2021. 1

  58. [58]

    Drivegpt4-v2: Harnessing large lan- guage model capabilities for enhanced closed-loop au- tonomous driving

    Zhenhua Xu, Yan Bai, Yujia Zhang, Zhuoling Li, Fei Xia, Kwan-Yee K Wong, Jianqiang Wang, and Heng- shuang Zhao. Drivegpt4-v2: Harnessing large lan- guage model capabilities for enhanced closed-loop au- tonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17261–17270, 2025. 1

  59. [59]

    Depth any- thing: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10371–10381, 2024. 3

  60. [60]

    Depth anything v2.Advances in Neural Informa- tion Processing Systems (NeurIPS), 37:21875–21911,

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.Advances in Neural Informa- tion Processing Systems (NeurIPS), 37:21875–21911,

  61. [61]

    D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry

    Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cremers. D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 1281–1292, 2020. 2, 3

  62. [62]

    Metric3d: Towards zero-shot metric 3d prediction from a single image

    Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9043–9053, 2023. 3

  63. [63]

    Geonet: Unsupervised learning of dense depth, optical flow and camera pose

    Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1983– 1992, 2018. 3

  64. [64]

    Streammapnet: Streaming mapping network for vectorized online hd map construction

    Tianyuan Yuan, Yicheng Liu, Yue Wang, Yilun Wang, and Hang Zhao. Streammapnet: Streaming mapping network for vectorized online hd map construction. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7356–7365,

  65. [65]

    Dynamic r-cnn: Towards high quality object detection via dynamic training

    Hongkai Zhang, Hong Chang, Bingpeng Ma, Naiyan Wang, and Xilin Chen. Dynamic r-cnn: Towards high quality object detection via dynamic training. InEuro- pean conference on computer vision, pages 260–275. Springer, 2020. 3

  66. [66]

    Maskflownet: Asymmetric fea- ture matching with learnable occlusion mask

    Shengyu Zhao, Yilun Sheng, Yue Dong, Eric I-Chao Chang, and Yan Xu. Maskflownet: Asymmetric fea- ture matching with learnable occlusion mask. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 3, 4

  67. [67]

    Unsupervised learning of depth and ego-motion from video

    Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 1851–1858, 2017. 3

  68. [68]

    Tame a wild camera: In-the-wild monocular camera calibration.Advances in Neural Information Processing Systems, 36:45137–45149, 2023

    Shengjie Zhu, Abhinav Kumar, Masa Hu, and Xiaom- ing Liu. Tame a wild camera: In-the-wild monocular camera calibration.Advances in Neural Information Processing Systems, 36:45137–45149, 2023. 3, 5, 7, 8, 10, 11