pith. sign in

arxiv: 2605.26519 · v2 · pith:AK2S2SDOnew · submitted 2026-05-26 · 💻 cs.CV

R³: 3D Reconstruction via Relative Regression

Pith reviewed 2026-06-29 17:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D reconstructionrelative regressionpose estimationstreaming reconstructionfeed-forward modelscomputer visiondepth estimation
0
0 comments X

The pith

R³ uses relative regression via a lightweight MLP to predict confidence-weighted constraints, removing the global coordinate frame bottleneck for long-context and streaming 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing feed-forward geometry models are limited by their assumption of a global coordinate frame, which creates problems with arbitrary origins and growing translation magnitudes in extended sequences. It proposes relative regression as the fix, where an MLP outputs relative constraints each weighted by a predicted . These confidences then serve as the single mechanism for both weighting training losses and aggregating poses at inference time. The result supports full offline reconstruction as well as causal streaming with bounded memory. A sympathetic reader cares because this directly targets the scalability barrier that prevents current models from handling realistic long videos or live capture.

Core claim

R³ employs relative regression. We employ a lightweight MLP to predict confidence-weighted relative constraints. These confidences serve as a unified anchor: weighting losses during training and guiding pose aggregation during inference. R³ supports both full-context offline reconstruction and causal, bounded-memory streaming.

What carries the argument

Confidence-weighted relative constraints output by a lightweight MLP, acting as the single anchor for loss weighting in training and pose aggregation in inference.

If this is right

  • Full-context offline reconstruction becomes possible without global-frame constraints.
  • Causal streaming reconstruction runs with bounded memory and no need to maintain an arbitrary temporal origin.
  • Translation magnitudes no longer grow unbounded, avoiding the scaling issues that appear in long sequences.
  • The same predicted confidences improve training stability and inference aggregation in both modes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The bounded-memory streaming mode opens the door to real-time applications such as live AR or robot navigation where memory must stay fixed.
  • Relative regression may transfer to other sequential geometry tasks like video-based SLAM where global frames produce similar drift.
  • Direct comparisons of cumulative error on hour-long sequences would test whether the unified anchor fully eliminates the accumulation problem.

Load-bearing premise

The MLP can produce relative constraints and confidences accurate enough to prevent error accumulation or drift when used for streaming inference over long sequences.

What would settle it

Measuring large pose drift or reconstruction collapse on extended streaming video sequences when the relative mechanism is applied would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.26519 by Anpei Chen, Congrong Xu, Huachen Gao, Jun Gao, Xingyu Chen, Yuliang Xiu.

Figure 1
Figure 1. Figure 1: Consistent, scalable, and efficient streaming geometry via relative pose regression. R3 reconstructs camera poses and dense geometry from unbounded video streams via feed-forward relative pose regression. It maintains local consistency, scales to ultra-long sequences with bounded memory, and runs at 20+ FPS with 372M parameters. the model to emphasize reliable pairs. This enables any registered frame to se… view at source ↗
Figure 2
Figure 2. Figure 2: Three feed-forward pose paradigms, viewed as pose graphs. Edges denote supervised pairwise pose terms; arrowheads encode directional supervision. (a) VGGT [64] fixes the world frame to the first camera and supervises only edges from this anchor to every other camera. (b) π 3 [69] regresses absolute poses in a model-chosen world frame and supervises every unordered pair with uniform weight. (c) R3 drops the… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of R3 . A causal geometry backbone extracts a single camera token from each frame. A lightweight pairwise pose head then predicts directed relative-pose edges from token pairs, along with separate rotation and translation confidences. These confidence-weighted edges are fused into a coherent trajectory, enabling streaming inference with a bounded active keyframe bank. As contrasted in [PITH_FULL_… view at source ↗
Figure 5
Figure 5. Figure 5: Long streaming comparison. Qualitative in-the-wild results show that R3 maintains more consistent trajectories and point-map alignments over hundreds of frames than baselines [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pose accuracy scaling on long sequences. We plot ATE for ScanNet [14] and TUM￾dynamics [56] as the number of input frames increases. While several streaming baselines exhibit cumulative drift or trigger out-of-memory (OOM) failures, R3 maintains stable trajectory estimation. We further test pose-only trajectory accuracy on a subset of DL3DV-Benchmark [35] (304–439 frames), which contains wider camera basel… view at source ↗
Figure 6
Figure 6. Figure 6: Reconstruction gallery. Qualitative reconstruction results from R3 in streaming mode across diverse indoor and outdoor scenes. The reconstructed point clouds remain geometrically coherent and visually consistent across varied scene layouts, object scales, and camera trajectories, demonstrating R3 ’s ability to maintain stable scene structure during online reconstruction of long sequences. 16 [PITH_FULL_IM… view at source ↗
Figure 7
Figure 7. Figure 7: reports inference FPS and GPU memory usage under the same 7-Scenes protocol used in Sec. 4.3. Global-regression baselines (e.g., StreamVGGT) either hit OOM or slow sharply as N grows; R3 replaces this O(N2 ) growth with a bounded memory increase and a gentler FPS decline, consistent with the bounded keyframe bank. 200 400 600 800 1000 Number of Input Views 5 10 15 20 F P S ↑ OOM OOM OOM R 3 TTT3R CUT3R Poi… view at source ↗
Figure 8
Figure 8. Figure 8: Learned confidence behaves as pair reliability. Across all pairs from the first 20 ScanNet scenes, pairs are grouped into equal-mass confidence quantile bins; the x-axis is the bin center. The solid line shows the per-bin mean pose error and the shaded band shows the within-bin dispersion, so each polyline reports the average error and spread within that confidence quantile interval. Higher predicted confi… view at source ↗
Figure 9
Figure 9. Figure 9: Bird’s-eye view of a long streaming sequence. Without reset, the trajectory eventually drifts [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Causal streaming reconstructions. Distant viewpoints, occlusions, and long sequences can [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
read the original abstract

Recent feed-forward geometry foundation models have demonstrated impressive generalization by recovering depth and poses in a single forward pass. However, these models are typically constrained by a global coordinate frame assumption. This dependency becomes a significant bottleneck for long-context and streaming reconstruction, as it forces the network to maintain an arbitrary temporal origin and handle translation magnitudes that grow unbounded over time. Our solution, which we call $R^3$, employs relative regression. We employ a lightweight MLP to predict confidence-weighted relative constraints. These confidences serve as a unified anchor: weighting losses during training and guiding pose aggregation during inference. $R^3$ supports both full-context offline reconstruction and causal, bounded-memory streaming. Our evaluation in both offline and streaming settings validates the effectiveness of our relative mechanism. Project page: https://kevinxu02.github.io/r3-site

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes $R^3$, a feed-forward 3D reconstruction method that replaces global coordinate frame regression with relative regression. A lightweight MLP predicts confidence-weighted relative constraints; these confidences weight the training losses and, at inference, guide pose aggregation. The method is claimed to support both full-context offline reconstruction and causal streaming reconstruction with bounded memory. Evaluation in both regimes is said to validate the relative mechanism.

Significance. If the confidence-weighted aggregation demonstrably bounds drift, the unified-anchor design would be a practical contribution to long-sequence and streaming reconstruction, removing the need to regress unbounded translations. The idea of reusing the same predicted confidences for both loss weighting and inference-time aggregation is a clean architectural choice.

major comments (1)
  1. [Abstract] Abstract: the central claim that the MLP-predicted confidences prevent unbounded error growth during causal streaming aggregation is load-bearing, yet the text supplies neither the aggregation equations, a drift bound, nor an ablation isolating the confidence mechanism; without these the effectiveness statement cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and will incorporate the requested clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the MLP-predicted confidences prevent unbounded error growth during causal streaming aggregation is load-bearing, yet the text supplies neither the aggregation equations, a drift bound, nor an ablation isolating the confidence mechanism; without these the effectiveness statement cannot be evaluated.

    Authors: We agree that the abstract's claim about bounded error growth in causal streaming would be stronger with explicit technical support. The current manuscript describes the relative regression and the dual use of MLP-predicted confidences for loss weighting and pose aggregation, but does not present the aggregation equations, a drift analysis, or a dedicated ablation in the main text or appendix. In the revision we will (1) add the aggregation equations and a simple drift bound to Section 3, (2) include an ablation isolating the confidence weights in the streaming setting in Section 4, and (3) revise the abstract to reference these additions rather than stating the effectiveness claim without support. These changes directly address the concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The abstract and description present R³ as employing relative regression via a lightweight MLP for confidence-weighted constraints that act as a unified anchor, but contain no equations, derivations, self-citations, or fitted parameters renamed as predictions. No load-bearing step reduces to its own inputs by construction. The method is described at a conceptual level with evaluation claimed to validate it, making the derivation self-contained against external benchmarks with no circularity indicators.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities can be identified from the given text.

pith-pipeline@v0.9.1-grok · 5680 in / 1031 out tokens · 20806 ms · 2026-06-29T17:53:41.514900+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 27 canonical work pages · 14 internal anchors

  1. [1]

    Map-free visual relocalization: Metric pose relative to a single image

    Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Áron Monszpart, Vic- tor Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. InEuropean Conference on Computer Vision (ECCV), 2022

  2. [2]

    Neural rgb-d surface reconstruction

    Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6290–6301, June 2022

  3. [3]

    ARKitScenes: A diverse real-world dataset for 3D indoor scene understanding using mobile RGB-D data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitScenes: A diverse real-world dataset for 3D indoor scene understanding using mobile RGB-D data. In Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021

  4. [4]

    Butler, Jonas Wulff, Garrett B

    Daniel J. Butler, Jonas Wulff, Garrett B. Stanley, and Michael J. Black. A naturalistic open source movie for optical flow evaluation. InEuropean Conference on Computer Vision (ECCV), 2012

  5. [5]

    Virtual KITTI 2

    Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual KITTI 2.arXiv preprint arXiv:2001.10773, 2020

  6. [6]

    MUSt3R: Multi-view network for stereo 3D reconstruction

    Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jérôme Revaud, and Vincent Leroy. MUSt3R: Multi-view network for stereo 3D reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  7. [7]

    Gómez Rodríguez, J

    Carlos Campos, Richard Elvira, Juan J. Gómez Rodríguez, J. M. M. Montiel, and Juan D. Tardós. ORB-SLAM3: An accurate open-source library for visual, visual-inertial, and multimap SLAM.IEEE Transactions on Robotics, 37(6):1874–1890, 2021

  8. [8]

    Geometric Context Transformer for Streaming 3D Reconstruction

    Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, and Yinghao Xu. Geometric context transformer for streaming 3D reconstruction.arXiv preprint arXiv:2604.14141, 2026

  9. [9]

    Easi3R: Estimat- ing disentangled motion from DUSt3R without training

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3R: Estimat- ing disentangled motion from DUSt3R without training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. arXiv:2503.24391

  10. [10]

    TTT3R: 3D Reconstruction as Test-Time Training

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3R: 3D recon- struction as test-time training.arXiv preprint arXiv:2509.26645, 2025

  11. [11]

    arXiv preprint arXiv:2510.06219 , year=

    Yue Chen, Xingyu Chen, Yuxuan Xue, Anpei Chen, Yuliang Xiu, and Gerard Pons-Moll. Human3R: Everyone everywhere all at once. InInternational Conference on Learning Repre- sentations (ICLR), 2026. arXiv:2510.06219

  12. [12]

    LONG3R: Long se- quence streaming 3D reconstruction

    Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, and Hang Zhao. LONG3R: Long se- quence streaming 3D reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. arXiv:2507.18255

  13. [13]

    LongStream: Long-sequence streaming autoregressive visual geometry.arXiv preprint arXiv:2602.13172, 2026

    Chong Cheng, Xianda Chen, Tao Xie, Wei Yin, Weiqiang Ren, Qian Zhang, Xiaoyang Guo, and Hao Wang. LongStream: Long-sequence streaming autoregressive visual geometry.arXiv preprint arXiv:2602.13172, 2026. 10

  14. [14]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  15. [15]

    SuperPoint: Self-supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperPoint: Self-supervised interest point detection and description. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018

  16. [16]

    Flex Attention: A Programming Model for Generating Optimized Attention Kernels

    Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A pro- gramming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496, 2024

  17. [17]

    MASt3R-SfM: A fully-integrated solution for unconstrained structure- from-motion

    Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jérôme Revaud. MASt3R-SfM: A fully-integrated solution for unconstrained structure- from-motion. InInternational Conference on 3D Vision (3DV), 2025

  18. [18]

    VGG-T 3: Offline feed-forward 3D reconstruction at scale.arXiv preprint arXiv:2602.23361, 2026

    Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, and Aljosa Osep. VGG-T 3: Offline feed-forward 3D reconstruction at scale.arXiv preprint arXiv:2602.23361, 2026

  19. [19]

    Direct sparse odometry.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(3):611–625, 2018

    Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(3):611–625, 2018

  20. [20]

    Accurate, dense, and robust multi-view stereopsis.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32(8):1362–1376, 2010

    Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multi-view stereopsis.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32(8):1362–1376, 2010

  21. [21]

    Are we ready for autonomous driving? the KITTI vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012

  22. [22]

    Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025

    Jisang Han, Sunghwan Hong, Jaewoo Jung, Wooseok Jang, Honggyu An, Qianqian Wang, Seungryong Kim, and Chen Feng. Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025

  23. [23]

    DeepMVS: Learning multi-view stereopsis

    Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. DeepMVS: Learning multi-view stereopsis. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  24. [24]

    Pow3R: Empowering unconstrained 3D reconstruction with camera and scene priors

    Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jérôme Revaud. Pow3R: Empowering unconstrained 3D reconstruction with camera and scene priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1071–1081, 2025

  25. [25]

    Barron, Noah Snavely, and Aleksander Hoły´nski

    Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, and Aleksander Hoły´nski. ZipMap: Linear-time stateful 3D reconstruction via test-time training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

  26. [26]

    DynamicStereo: Consistent dynamic depth from stereo videos

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. DynamicStereo: Consistent dynamic depth from stereo videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  27. [27]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, To- bias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed-forward metric 3D reconstruc- tion....

  28. [28]

    Parallel tracking and mapping for small AR workspaces

    Georg Klein and David Murray. Parallel tracking and mapping for small AR workspaces. In IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR), 2007. 11

  29. [29]

    STream3R: Scalable sequential 3D reconstruction with causal transformer.arXiv preprint arXiv:2508.10893, 2025

    Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. STream3R: Scalable sequential 3D reconstruction with causal transformer.arXiv preprint arXiv:2508.10893, 2025

  30. [30]

    Grounding image matching in 3D with MASt3R

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3D with MASt3R. InEuropean Conference on Computer Vision (ECCV), 2024

  31. [31]

    MegaSaM: Accurate, fast, and robust structure and motion from casual dynamic videos

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Hoły´nski, and Noah Snavely. MegaSaM: Accurate, fast, and robust structure and motion from casual dynamic videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10486–10496, 2025

  32. [32]

    WinT3R: Window-based streaming reconstruction with camera token pool

    Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, and Tong He. WinT3R: Window-based streaming reconstruction with camera token pool. InInternational Conference on Learning Representations (ICLR),

  33. [33]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  34. [34]

    LightGlue: Local feature matching at light speed

    Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. LightGlue: Local feature matching at light speed. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17627–17638, 2023

  35. [35]

    DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision. InProceedings of the IEEE/CVF Conf...

  36. [36]

    SLAM3R: Real-time dense scene reconstruction from monocular RGB videos

    Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yanchao Yang, Qingnan Fan, and Baoquan Chen. SLAM3R: Real-time dense scene reconstruction from monocular RGB videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2412.09401

  37. [37]

    OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

    Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, and Yung-Yao Chen. OVGGT: O(1) constant-cost streaming visual geometry transformer.arXiv preprint arXiv:2603.05959, 2026

  38. [38]

    VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

    Dominic Maggio, Hyungtae Lim, and Luca Carlone. VGGT-SLAM: Dense RGB SLAM optimized on the SL(4) manifold.arXiv preprint arXiv:2505.12549, 2025

  39. [39]

    Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo

    Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nalivayko, and Andrés Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  40. [40]

    Raúl Mur-Artal and Juan D. Tardós. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras.IEEE Transactions on Robotics, 33(5):1255–1262, 2017

  41. [41]

    Raúl Mur-Artal, J. M. M. Montiel, and Juan D. Tardós. ORB-SLAM: A versatile and accurate monocular SLAM system.IEEE Transactions on Robotics, 31(5):1147–1163, 2015

  42. [42]

    Riku Murai, Eric Dexheimer, and Andrew J. Davison. MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2412.12392

  43. [43]

    DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research (TMLR), 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

  44. [44]

    ReFusion: 3D reconstruction in dynamic environments for RGB-D cameras exploiting residuals

    Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguère, and Cyrill Stachniss. ReFusion: 3D reconstruction in dynamic environments for RGB-D cameras exploiting residuals. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019

  45. [45]

    Global structure- from-motion revisited

    Linfei Pan, Daniel Barath, Marc Pollefeys, and Johannes Lutz Schönberger. Global structure- from-motion revisited. InEuropean Conference on Computer Vision (ECCV), 2024

  46. [46]

    Aria synthetic environments dataset

    Project Aria. Aria synthetic environments dataset. https://www.projectaria.com/ datasets/ase/, 2024. Meta Reality Labs Research

  47. [47]

    Common objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021

  48. [48]

    Susskind

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021

  49. [49]

    Fleet, and Andrea Tagliasac- chi

    Sara Sabour, Suhani V ora, Daniel Duckworth, Ivan Krasin, David J. Fleet, and Andrea Tagliasac- chi. RobustNeRF: Ignoring distractors with robust losses. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20626–20636, June 2023

  50. [50]

    SuperGlue: Learning feature matching with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature matching with graph neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  51. [51]

    Structure-from-motion revisited

    Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  52. [52]

    Pixelwise view selection for unstructured multi-view stereo

    Johannes Lutz Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016

  53. [53]

    Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger

    Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  54. [54]

    FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

    You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. FastVGGT: Training-free acceleration of visual geometry transformer. InInternational Conference on Learning Representations (ICLR), 2026. arXiv:2509.02560

  55. [55]

    Scene coordinate regression forests for camera relocalization in RGB-D images

    Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013

  56. [56]

    A benchmark for the evaluation of RGB-D SLAM systems

    Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012

  57. [57]

    LoFTR: Detector-free local feature matching with transformers

    Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  58. [58]

    Marwan Taher, Ignacio Alzugaray, Kirill Mazur, Xin Kong, and Andrew J. Davison. KV-Tracker: Real-time pose tracking with transformers.arXiv preprint arXiv:2512.22581, 2025

  59. [59]

    DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras

    Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. 13

  60. [60]

    Deep patch visual odometry

    Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch visual odometry. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  61. [61]

    AMB3R: Accurate feed-forward metric-scale 3D recon- struction with backend.arXiv preprint arXiv:2511.20343, 2025

    Hengyi Wang and Lourdes Agapito. AMB3R: Accurate feed-forward metric-scale 3D recon- struction with backend.arXiv preprint arXiv:2511.20343, 2025

  62. [62]

    3D reconstruction with spatial memory

    Hengyi Wang and Lourdes Agapito. 3D reconstruction with spatial memory. InInternational Conference on 3D Vision (3DV), 2025

  63. [63]

    VGGSfM: Visual geometry grounded deep structure from motion

    Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. VGGSfM: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  64. [64]

    VGGT: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  65. [65]

    Efros, and Angjoo Kanazawa

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D perception model with persistent state. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  66. [66]

    DUSt3R: Geometric 3D vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  67. [67]

    TartanAir: A dataset to push the limits of visual SLAM

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. TartanAir: A dataset to push the limits of visual SLAM. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020

  68. [68]

    Efficient LoFTR: Semi- dense local feature matching with sparse-like speed

    Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou. Efficient LoFTR: Semi- dense local feature matching with sparse-like speed. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2024

  69. [69]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning. InInternational Conference on Learning Representations (ICLR), 2026. arXiv:2507.13347

  70. [70]

    Point3R: Streaming 3D reconstruction with explicit spatial pointer memory

    Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3R: Streaming 3D reconstruction with explicit spatial pointer memory. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  71. [71]

    RGBD objects in the wild: Scaling real-world 3D object learning from RGB-D videos

    Hongchi Xia, Yang Fu, Sifei Liu, and Xiaolong Wang. RGBD objects in the wild: Scaling real-world 3D object learning from RGB-D videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  72. [72]

    Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

    Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin, Weiqiang Ren, Qian Zhang, Wei Hua, Sida Peng, Xiaoyang Guo, and Xiaowei Zhou. Scal3R: Scalable test-time training for large-scale 3D reconstruction.arXiv preprint arXiv:2604.08542, 2026

  73. [73]

    Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

    Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3R: Towards 3D reconstruction of 1000+ images in one forward pass. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  74. [74]

    MVSNet: Depth inference for unstructured multi-view stereo

    Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. MVSNet: Depth inference for unstructured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2018

  75. [75]

    ScanNet++: A high-fidelity dataset of 3D indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high-fidelity dataset of 3D indoor scenes. InIEEE/CVF International Conference on Computer Vision (ICCV), 2023

  76. [76]

    InfiniteVGGT: Visual geometry grounded transformer for endless streams

    Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. InfiniteVGGT: Visual geometry grounded transformer for endless streams. arXiv preprint arXiv:2601.02281, 2026. 14

  77. [77]

    MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. MonST3R: A simple approach for estimating geometry in the presence of motion. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2410.03825

  78. [78]

    LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. LoGeR: Long-context geometric reconstruction with hybrid memory.arXiv preprint arXiv:2603.03269, 2026

  79. [79]

    FLARE: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

    Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. FLARE: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21936–21947, 2025

  80. [80]

    Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

    Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhoujie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, and Tong He. OmniWorld: A multi- domain and multi-modal dataset for 4D world modeling.arXiv preprint arXiv:2509.12201, 2025

Showing first 80 references.