arxiv: 2508.10934 · v1 · submitted 2025-08-12 · 💻 cs.CV · cs.GR· cs.RO· eess.IV

Recognition: 1 theorem link

· Lean Theorem

ViPE: Video Pose Engine for 3D Geometric Perception

Jiahui Huang , Qunjie Zhou , Hesam Rabeti , Aleksandr Korovko , Huan Ling , Xuanchi Ren , Tianchang Shen , Jun Gao

show 7 more authors

Dmitry Slepichev Chen-Hsuan Lin Jiawei Ren Kevin Xie Joydeep Biswas Laura Leal-Taixe Sanja Fidler

Authors on Pith no claims yet

Pith reviewed 2026-05-16 16:36 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.ROeess.IV

keywords video pose estimation3D geometric perceptioncamera intrinsicsdense depth mapsuncalibrated videosspatial AIlarge-scale annotation

0 comments

The pith

ViPE estimates camera poses and near-metric depth maps from any raw video without calibration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ViPE, a video processing engine that estimates camera intrinsics, motion, and dense near-metric depth from unconstrained videos. It handles diverse scenarios like dynamic selfies, cinematic shots, and dashcams across pinhole, wide-angle, and 360-degree panoramic cameras. This matters because acquiring consistent 3D annotations from in-the-wild videos has been a major bottleneck for spatial AI systems that rely on precise geometry. ViPE runs at 3-5 frames per second on a single GPU and has already annotated around 96 million frames from 100,000 real-world internet videos, 1 million AI-generated videos, and 2,000 panoramic videos. The engine and the annotated collection are open-sourced to support further work in 3D perception.

Core claim

ViPE is a versatile video processing engine that efficiently estimates camera intrinsics, camera motion, and dense near-metric depth maps from unconstrained raw videos. It remains robust across dynamic selfie videos, cinematic shots, and dashcams while supporting pinhole, wide-angle, and 360-degree panorama camera models. On standard benchmarks, ViPE outperforms existing uncalibrated pose estimation baselines by 18 percent on TUM sequences and 50 percent on KITTI sequences.

What carries the argument

The ViPE engine, a unified pipeline that jointly solves for intrinsics, motion, and dense depth from uncalibrated video input.

If this is right

Outperforms uncalibrated baselines by 18 percent on TUM and 50 percent on KITTI pose estimation.
Annotates approximately 96 million frames with camera poses and dense depth maps.
Supports pinhole, wide-angle, and 360-degree camera models in a single pipeline.
Runs at 3-5 frames per second on one GPU for standard resolutions.
Supplies large-scale annotated data from real internet videos, AI-generated content, and panoramas for spatial AI training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing video archives could be automatically converted into training resources for many 3D vision models without manual labeling.
Robotics and augmented-reality applications might gain easier access to metric-scale geometry from ordinary consumer footage.
Outputs from the engine could serve as direct supervision signals inside larger end-to-end reconstruction networks.

Load-bearing premise

The engine produces reliable near-metric depth and accurate poses on diverse in-the-wild videos without per-video calibration or ground-truth supervision.

What would settle it

Independent ground-truth measurement of camera poses and depths on a fresh collection of in-the-wild videos, checked against ViPE outputs for error rates below the reported benchmark levels.

read the original abstract

Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360{\deg} panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames -- all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ViPE bundles intrinsics, pose, and depth into one engine that beats baselines on TUM and KITTI while releasing a 96M-frame annotated set, but the accuracy on wild videos is asserted without matching validation.

read the letter

ViPE combines camera intrinsics estimation, pose tracking, and dense depth into a single pipeline that works on raw videos from different camera types. The authors demonstrate better performance than previous uncalibrated methods on TUM and KITTI, and they have used it to label nearly 100K real videos, a million AI-generated ones, and some panoramas for a total of 96 million frames. What stands out is the versatility across pinhole, wide-angle, and 360-degree inputs without extra calibration steps. The reported speed of 3-5 FPS on a GPU is practical for processing large collections. Releasing the code and the full annotated set is the real contribution here, as it directly addresses the data shortage for training spatial AI models. The weaker part is the evidence for accuracy on the unconstrained videos. All the concrete numbers come from the two benchmarks that have ground truth available. For the internet and generated videos, the paper states the annotations are accurate but does not include separate metrics, cross-checks, or examples of where it might fail under motion blur, changing illumination, or non-rigid scenes. That leaves the generalization claim as the part that would benefit from more detail in review. This paper is aimed at computer vision researchers and roboticists who need large-scale 3D video data or a ready engine for their own pipelines. Someone building a new model for depth or SLAM would find the dataset useful to try. The work shows clear thinking about the practical bottlenecks and provides reproducible artifacts, so it deserves to go through peer review rather than being rejected at the desk.

Referee Report

2 major / 2 minor

Summary. The paper introduces ViPE, a video processing engine that estimates camera intrinsics, poses, and dense near-metric depth maps from unconstrained raw videos. It claims robustness across dynamic selfie videos, cinematic shots, dashcams, and camera models including pinhole, wide-angle, and 360° panoramas. ViPE reportedly outperforms uncalibrated pose estimation baselines by 18% on TUM and 50% on KITTI sequences, runs at 3-5 FPS on a single GPU, and is used to annotate ~96M frames from 100K internet videos, 1M AI-generated videos, and 2K panoramic videos with accurate poses and depth maps. The code and annotated dataset are open-sourced.

Significance. If the performance and annotation accuracy claims hold, ViPE would provide a practical tool and large-scale resource for 3D geometric perception in spatial AI, addressing the scarcity of consistent in-the-wild annotations. The benchmark gains on standard datasets, efficiency, and support for diverse camera models are concrete strengths; open-sourcing the engine and dataset further enhances potential impact.

major comments (2)

[Large-scale annotation and results section] The central claim that ViPE annotates ~96M frames from unconstrained videos (internet, AI-generated, panoramic) with 'accurate' camera poses and dense depth maps lacks any reported quantitative metrics, error analysis, consistency checks, or failure-case evaluation on those videos. Only TUM and KITTI results with ground truth are quantified; this generalization is load-bearing for the dataset contribution and requires direct evidence.
[Method section] The method for producing near-metric depth and reliable poses without per-video calibration or ground-truth supervision is not supported by derivations, equations, or ablations in the manuscript. The abstract asserts robustness to dynamic content, lighting, and non-pinhole models, but without concrete pipeline details or assumptions, the reliability on in-the-wild data cannot be assessed.

minor comments (2)

[Abstract] The abstract uses informal phrasing such as 'handy and versatile'; replace with more formal terms like 'practical and versatile' for journal style.
[Experiments section] Benchmark comparisons should explicitly state the uncalibrated baselines used and include error metrics (e.g., absolute trajectory error) with standard deviations for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, acknowledging where additional evidence and details are warranted, and outline the revisions we will incorporate.

read point-by-point responses

Referee: [Large-scale annotation and results section] The central claim that ViPE annotates ~96M frames from unconstrained videos (internet, AI-generated, panoramic) with 'accurate' camera poses and dense depth maps lacks any reported quantitative metrics, error analysis, consistency checks, or failure-case evaluation on those videos. Only TUM and KITTI results with ground truth are quantified; this generalization is load-bearing for the dataset contribution and requires direct evidence.

Authors: We agree that the manuscript would be strengthened by direct quantitative evidence on the large-scale annotations. The TUM and KITTI benchmarks demonstrate the core method's accuracy where ground truth exists, while the 96M-frame collection was validated through extensive visual inspection and temporal consistency checks that were not quantified in the text. In the revised manuscript we will add a dedicated subsection reporting quantitative consistency metrics (e.g., average pose drift over long sequences, inter-frame depth map variance, and reprojection error statistics) computed on a stratified sample of the annotated videos, together with a failure-case analysis. This will provide the requested direct evidence for the dataset contribution. revision: yes
Referee: [Method section] The method for producing near-metric depth and reliable poses without per-video calibration or ground-truth supervision is not supported by derivations, equations, or ablations in the manuscript. The abstract asserts robustness to dynamic content, lighting, and non-pinhole models, but without concrete pipeline details or assumptions, the reliability on in-the-wild data cannot be assessed.

Authors: We acknowledge that the current method description is high-level and would benefit from explicit technical support. The manuscript outlines the joint optimization pipeline, but we will expand the method section with the full set of equations for intrinsics estimation, pose optimization, and near-metric depth regression, including the loss terms and the key assumptions (e.g., scale anchoring via learned priors). We will also add targeted ablations that isolate performance under dynamic content, varying illumination, and non-pinhole camera models. These additions will make the reliability claims on unconstrained videos directly verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents ViPE as an engineering system for video-based pose and depth estimation, with performance claims supported by benchmarks on external datasets (TUM, KITTI) possessing independent ground truth. No mathematical derivations, equations, or parameter-fitting steps are described that reduce by construction to the inputs or to self-generated labels. The large-scale annotation of 96M frames is presented as an output application rather than a self-referential prediction, and no load-bearing self-citations, uniqueness theorems, or ansatz smuggling are invoked in the provided text. The central claims remain externally falsifiable via the cited benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unverified assumption that the engine generalizes robustly across video types.

pith-pipeline@v0.9.0 · 5611 in / 997 out tokens · 42618 ms · 2026-05-16T16:36:23.572987+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CalibAnyView: Beyond Single-View Camera Calibration in the Wild
cs.CV 2026-05 conditional novelty 8.0

A multi-view transformer predicts dense perspective fields that feed a geometric optimizer to estimate camera intrinsics and gravity from arbitrary numbers of real-world views.
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
cs.CV 2026-05 unverdicted novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
cs.CV 2026-05 unverdicted novelty 7.0

MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
cs.CV 2026-05 unverdicted novelty 7.0

MoCam uses structured denoising dynamics in diffusion models to temporally decouple geometric alignment from appearance refinement, enabling unified novel view synthesis that outperforms prior methods on imperfect poi...
Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
cs.CV 2026-04 unverdicted novelty 7.0

Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.
EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates
cs.CV 2026-04 unverdicted novelty 7.0

EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.
MoRight: Motion Control Done Right
cs.CV 2026-04 unverdicted novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
cs.CV 2026-05 unverdicted novelty 6.0

RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
cs.CV 2026-04 unverdicted novelty 6.0

RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph opti...
Geometric Context Transformer for Streaming 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
cs.CV 2026-04 unverdicted novelty 6.0

Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
Lyra 2.0: Explorable Generative 3D Worlds
cs.CV 2026-04 unverdicted novelty 6.0

Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
cs.CV 2026-02 unverdicted novelty 6.0

OpenVO estimates ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras by encoding temporal dynamics in a two-frame regression framework and using 3D priors from foundation ...
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
cs.CV 2025-12 unverdicted novelty 6.0

WorldPlay uses dual action representation, reconstituted context memory, and context forcing distillation to produce consistent 720p streaming video at 24 FPS for interactive world modeling.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
WildPose: A Unified Framework for Robust Pose Estimation in the Wild
cs.CV 2026-05 unverdicted novelty 5.0

WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
cs.CV 2026-04 unverdicted novelty 4.0

Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...
World Simulation with Video Foundation Models for Physical AI
cs.CV 2025-10 unverdicted novelty 4.0

Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 17 Pith papers · 7 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 9, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

H. A. Alhaija, J. Alvarez, M. Bala, T. Cai, T. Cao, L. Cha, J. Chen, M. Chen, F. Ferroni, S. Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025. 9

work page arXiv 2025
[3]

Badki, H

A. Badki, H. Su, B. Wen, and O. Gallo. L4p: Low-level 4d vision perception unified.arXiv preprint arXiv:2502.13078,

work page arXiv
[4]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y. Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021. 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y. Zhou, S. R. Richter, and V. Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, pages 611–625. Springer, 2012. 11

work page 2012
[7]

Cabon, L

Y. Cabon, L. Stoffl, L. Antsfeld, G. Csurka, B. Chidlovskii, J. Revaud, and V. Leroy. Must3r: Multi-view network for stereo 3d reconstruction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1050–1060,

work page
[8]

Campos, R

C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6):1874–1890, 2021. 2, 3

work page 2021
[9]

S. Chen, H. Guo, S. Zhu, F. Zhang, Z. Huang, J. Feng, and B. Kang. Video depth anything: Consistent depth estimation for super-long videos.arXiv preprint arXiv:2501.12375, 2025. 3, 8

work page arXiv 2025
[10]

T.-S. Chen, A. Siarohin, W. Menapace, E. Deyneka, H.-w. Chao, B. E. Jeon, Y. Fang, H.-Y. Lee, J. Ren, M.-H. Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024. 13

work page 2024
[11]

X. Chen, Y. Chen, Y. Xiu, A. Geiger, and A. Chen. Easi3r: Estimating disentangled motion from dust3r without training. arXiv preprint arXiv:2503.24391, 2025. 3

work page arXiv 2025
[12]

H. K. Cheng and A. G. Schwing. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. InEuropean Conference on Computer Vision, pages 640–658. Springer, 2022. 7

work page 2022
[13]

Cheng, L

Y. Cheng, L. Li, Y. Xu, X. Li, Z. Yang, W. Wang, and Y. Yang. Segment and track anything. arXiv preprint arXiv:2305.06558, 2023. 7

work page arXiv 2023
[14]

G. Chou, W. Xian, G. Yang, M. Abdelfattah, B. Hariharan, N. Snavely, N. Yu, and P. Debevec. Flashdepth: Real-time streaming video depth estimation at 2k resolution.arXiv preprint arXiv:2504.07093, 2025. 3

work page arXiv 2025
[15]

W. Cong, Y. Liang, Y. Zhang, Z. Yang, Y. Wang, B. Ivanovic, M. Pavone, C. Chen, Z. Wang, and Z. Fan. E3d-bench: A benchmark for end-to-end 3d geometric foundation models.arXiv preprint arXiv:2506.01933, 2025. 2

work page arXiv 2025
[16]

T. A. Davis, J. R. Gilbert, S. I. Larimore, and E. G. Ng. Algorithm 836: Colamd, a column approximate minimum degree ordering algorithm.ACM Transactions on Mathematical Software (TOMS), 30(3):377–380, 2004. 5

work page 2004
[17]

A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam: Real-time single camera slam.IEEE transactions on pattern analysis and machine intelligence, 29(6):1052–1067, 2007. 2, 3

work page 2007
[18]

Duisterhof, L

B. Duisterhof, L. Zust, P. Weinzaepfel, V. Leroy, Y. Cabon, and J. Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion.arXiv preprint arXiv:2409.19152, 2024. 3 17 ViPE: Video Pose Engine for 3D Geometric Perception

work page arXiv 2024
[19]

Elflein, Q

S. Elflein, Q. Zhou, and L. Leal-Taixé. Light3r-sfm: Towards feed-forward structure-from-motion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16774–16784, 2025. 3

work page 2025
[20]

Engel, V

J. Engel, V. Koltun, and D. Cremers. Direct sparse odometry.IEEE transactions on pattern analysis and machine intelligence, 40(3):611–625, 2017. 3

work page 2017
[21]

H. Feng, J. Zhang, Q. Wang, Y. Ye, P. Yu, M. J. Black, T. Darrell, and A. Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world.arXiv preprint arXiv:2504.13152, 2025. 3

work page arXiv 2025
[22]

R. Gao, A. Holynski, P. Henzler, A. Brussee, R. Martin-Brualla, P. Srinivasan, J. T. Barron, and B. Poole. Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024. 4

work page arXiv 2024
[23]

Geiger, P

A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012. 4, 8, 9, 10

work page 2012
[24]

L. Goli, S. Sabour, M. Matthews, M. Brubaker, D. Lagun, A. Jacobson, D. J. Fleet, S. Saxena, and A. Tagliasacchi. Romo: Robust motion segmentation improves structure from motion.arXiv preprint arXiv:2411.18650, 2024. 6

work page arXiv 2024
[25]

Greff, F

K. Greff, F. Belletti, L. Beyer, C. Doersch, Y. Du, D. Duckworth, D. J. Fleet, D. Gnanapragasam, F. Golemo, C. Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761, 2022. 4

work page 2022
[26]

Hagemann, M

A. Hagemann, M. Knorr, and C. Stiller. Deep geometry-aware camera self-calibration from video. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3438–3448, 2023. 7

work page 2023
[27]

M. Hu, W. Yin, C. Zhang, Z. Cai, X. Long, H. Chen, K. Wang, G. Yu, C. Shen, and S. Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 3, 6

work page 2024
[28]

Huang, Z

J. Huang, Z. Gojcic, M. Atzmon, O. Litany, S. Fidler, and F. Williams. Neural kernel surface reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4369–4379, 2023. 6

work page 2023
[29]

Huang, W

N. Huang, W. Zheng, C. Xu, K. Keutzer, S. Zhang, A. Kanazawa, and Q. Wang. Segment any motion in videos.arXiv preprint arXiv:2503.22268, 2025. 6

work page arXiv 2025
[30]

Izquierdo, M

S. Izquierdo, M. Sayed, M. Firman, G. Garcia-Hernando, D. Turmukhambetov, J. Civera, O. Mac Aodha, G. Brostow, and J. Watson. Mvsanywhere: Zero-shot multi-view stereo. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11493–11504, 2025. 4

work page 2025
[31]

Jiang, C

Z. Jiang, C. Zheng, I. Laina, D. Larlus, and A. Vedaldi. Geo4d: Leveraging video generators for geometric 4d scene reconstruction. arXiv preprint arXiv:2504.07961, 2025. 3

work page arXiv 2025
[32]

H. Jin, H. Jiang, H. Tan, K. Zhang, S. Bi, T. Zhang, F. Luan, N. Snavely, and Z. Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias.arXiv preprint arXiv:2410.17242, 2024. 4

work page arXiv 2024
[33]

L. Jin, R. Tucker, Z. Li, D. Fouhey, N. Snavely, and A. Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint arXiv:2412.09621, 2024. 3, 4

work page arXiv 2024
[34]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026,

work page
[35]

Korovko, D

A. Korovko, D. Slepichev, A. Efitorov, A. Dzhumamuratova, V. Kuznetsov, H. Rabeti, and J. Biswas. cuvslam: Cuda accelerated visual odometry.arXiv preprint arXiv:2506.04359, 2025. 3, 6

work page arXiv 2025
[36]

Leroy, Y

V. Leroy, Y. Cabon, and J. Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision, pages 71–91. Springer, 2024. 2, 3

work page 2024
[37]

Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V. Ye, A. Kanazawa, A. Holynski, and N. Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10486–10496, 2025. 2, 3, 9, 10, 11 18 ViPE: Video Pose Engine for 3D Geometric Perception

work page 2025
[38]

Liang, J

H. Liang, J. Ren, A. Mirzaei, A. Torralba, Z. Liu, I. Gilitschenski, S. Fidler, C. Oztireli, H. Ling, Z. Gojcic, et al. Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos.arXiv preprint arXiv:2412.03526,

work page arXiv
[39]

Z. Lin, S. Cen, D. Jiang, J. Karhade, H. Wang, C. Mitra, T. Ling, Y. Huang, S. Liu, M. Chen, et al. Towards understanding camera motions in any video.arXiv preprint arXiv:2504.15376, 2025. 4

work page arXiv 2025
[40]

Lindenberger, P.-E

P. Lindenberger, P.-E. Sarlin, and M. Pollefeys. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17627–17638, 2023. 10

work page 2023
[41]

S. Liu, W. Li, P. Qiao, and Y. Dou. Regist3r: Incremental registration with stereo foundation model.arXiv preprint arXiv:2504.12356, 2025. 3

work page arXiv 2025
[42]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision, pages 38–55. Springer, 2024. 7

work page 2024
[43]

Y. Liu, S. Dong, S. Wang, Y. Yin, Y. Yang, Q. Fan, and B. Chen. Slam3r: Real-time dense scene reconstruction from monocular rgb videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16651–16662,

work page
[44]

J. Lu, T. Huang, P. Li, Z. Dou, C. Lin, Z. Cui, Z. Dong, S.-K. Yeung, W. Wang, and Y. Liu. Align3r: Aligned monocular depth estimation for dynamic videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22820–22830, 2025. 3

work page 2025
[45]

Y. Lu, X. Ren, J. Yang, T. Shen, Z. Wu, J. Gao, Y. Wang, S. Chen, M. Chen, S. Fidler, et al. Infinicube: Unbounded and controllable dynamic 3d driving scene generation with world-guided video models.arXiv preprint arXiv:2412.03934,

work page arXiv
[46]

B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. InIJCAI’81: 7th international joint conference on Artificial intelligence, volume 2, pages 674–679, 1981. 6

work page 1981
[47]

Maggio, H

D. Maggio, H. Lim, and L. Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025. 2, 3

work page arXiv 2025
[48]

Mei and P

C. Mei and P. Rives. Single view point omnidirectional camera calibration from planar grids. InProceedings 2007 IEEE International Conference on Robotics and Automation, pages 3945–3950. IEEE, 2007. 7

work page 2007
[49]

Mur-Artal, J

R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147–1163, 2015. 2, 3, 9

work page 2015
[50]

Murai, E

R. Murai, E. Dexheimer, and A. J. Davison. Mast3r-slam: Real-time dense slam with 3d reconstruction priors. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 16695–16705, 2025. 2, 3, 9

work page 2025
[51]

L. Pan, D. Baráth, M. Pollefeys, and J. L. Schönberger. Global structure-from-motion revisited. InEuropean Conference on Computer Vision, pages 58–77. Springer, 2024. 3

work page 2024
[52]

Piccinelli, C

L. Piccinelli, C. Sakaridis, M. Segu, Y.-H. Yang, S. Li, W. Abbeloos, and L. Van Gool. Unik3d: Universal camera monocular 3d estimation.arXiv preprint arXiv:2503.16591, 2025. 3, 6

work page arXiv 2025
[53]

Piccinelli, C

L. Piccinelli, C. Sakaridis, Y.-H. Yang, M. Segu, S. Li, W. Abbeloos, and L. Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110, 2025. 3, 6, 11

work page arXiv 2025
[54]

V. A. Prisacariu, O. Kähler, S. Golodetz, M. Sapienza, T. Cavallari, P. H. Torr, and D. W. Murray. Infinitam v3: A framework for large-scale 3d reconstruction with loop closure.arXiv preprint arXiv:1708.00783, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[55]

X. Ren, Y. Lu, T. Cao, R. Gao, S. Huang, A. Sabour, T. Shen, T. Pfaff, J. Z. Wu, R. Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042, 2025. 4

work page arXiv 2025
[56]

X. Ren, Y. Lu, H. Liang, Z. Wu, H. Ling, M. Chen, S. Fidler, F. Williams, and J. Huang. Scube: Instant large-scale scene reconstruction using voxsplats.Advances in Neural Information Processing Systems, 37:97670–97698, 2024. 4 19 ViPE: Video Pose Engine for 3D Geometric Perception

work page 2024
[57]

X. Ren, T. Shen, J. Huang, H. Ling, Y. Lu, M. Nimier-David, T. Müller, A. Keller, S. Fidler, and J. Gao. Gen3c: 3d- informed world-consistent video generation with precise camera control. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6121–6132, 2025. 4, 13

work page 2025
[58]

Rockwell, J

C. Rockwell, J. Tung, T.-Y. Lin, M.-Y. Liu, D. F. Fouhey, and C.-H. Lin. Dynamic camera poses and where to find them. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12444–12455, 2025. 10, 12

work page 2025
[59]

J. L. Schönberger and J.-M. Frahm. Structure-from-motion revisited. InConference on Computer Vision and Pattern Recognition (CVPR), 2016. 2, 3

work page 2016
[60]

Schops, T

T. Schops, T. Sattler, and M. Pollefeys. Bad slam: Bundle adjusted direct rgb-d slam. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 134–144, 2019. 11

work page 2019
[61]

Shi et al

J. Shi et al. Good features to track. In1994 Proceedings of IEEE conference on computer vision and pattern recognition, pages 593–600. IEEE, 1994. 6

work page 1994
[62]

Sturm, N

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 573–580. IEEE, 2012. 4, 8, 9

work page 2012
[63]

Sucar, Z

E. Sucar, Z. Lai, E. Insafutdinov, and A. Vedaldi. Dynamic point maps: A versatile representation for dynamic 3d reconstruction. arXiv preprint arXiv:2503.16318, 2025. 3

work page arXiv 2025
[64]

Z. Tang, Y. Fan, D. Wang, H. Xu, R. Ranjan, A. Schwing, and Z. Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5283–5293, 2025. 3

work page 2025
[65]

A. Team, H. Zhu, Y. Wang, J. Zhou, W. Chang, Y. Zhou, Z. Li, J. Chen, C. Shen, J. Pang, et al. Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945, 2025. 4

work page arXiv 2025
[66]

Teed and J

Z. Teed and J. Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569, 2021. 4, 5, 7, 9

work page 2021
[67]

Veicht, P.-E

A. Veicht, P.-E. Sarlin, P. Lindenberger, and M. Pollefeys. Geocalib: Learning single-image calibration with geometric optimization. InEuropean Conference on Computer Vision, pages 1–20. Springer, 2024. 4, 9

work page 2024
[68]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Wang and L

H. Wang and L. Agapito. 3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061, 2024. 3

work page arXiv 2024
[70]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 3, 9, 11

work page 2025
[71]

J. Wang, N. Karaev, C. Rupprecht, and D. Novotny. Vggsfm: Visual geometry grounded deep structure from motion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21686–21697, 2024. 3

work page 2024
[72]

Q. Wang, W. Li, C. Mou, X. Cheng, and J. Zhang. 360dvd: Controllable panorama video generation with 360-degree video diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6913–6923, 2024. 13

work page 2024
[73]

Q. Wang, Y. Zhang, A. Holynski, A. A. Efros, and A. Kanazawa. Continuous 3d perception model with persistent state. arXiv preprint arXiv:2501.12387, 2025. 3, 11

work page arXiv 2025
[74]

S. Wang, V. Leroy, Y. Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 3

work page 2024
[75]

Y. Wang, J. Zhou, H. Zhu, W. Chang, Y. Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He. pi3: Scalable permutation- equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Z. Wang, S. Chen, L. Yang, J. Wang, Z. Zhang, H. Zhao, and Z. Zhao. Depth anything with any prior.arXiv preprint arXiv:2505.10565, 2025. 3, 8 20 ViPE: Video Pose Engine for 3D Geometric Perception

work page arXiv 2025
[77]

Wimbauer, W

F. Wimbauer, W. Chen, D. Muhle, C. Rupprecht, and D. Cremers. Anycam: Learning to recover camera poses and intrinsics from casual videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16717–16727, 2025. 3

work page 2025
[78]

R. Wu, R. Gao, B. Poole, A. Trevithick, C. Zheng, J. T. Barron, and A. Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26057–26068, 2025. 4

work page 2025
[79]

Y. Xiao, J. Wang, N. Xue, N. Karaev, Y. Makarov, B. Kang, X. Zhu, H. Bao, Y. Shen, and X. Zhou. Spatialtrackerv2: 3d point tracking made easy.arXiv preprint arXiv:2507.12462, 2025. 3

work page arXiv 2025
[80]

T.-X. Xu, X. Gao, W. Hu, X. Li, S.-H. Zhang, and Y. Shan. Geometrycrafter: Consistent geometry estimation for open-world videos with diffusion priors.arXiv preprint arXiv:2504.01016, 2025. 3

work page arXiv 2025

Showing first 80 references.