UniT: Unified Geometry Learning with Group Autoregressive Transformer

Gang Hua; Haotian Wang; Hongliang Lu; Meng Yang; Xinhu Zheng; Yusong Huang; Zhaonian Kuang

arxiv: 2605.21131 · v1 · pith:X6GM547Fnew · submitted 2026-05-20 · 💻 cs.CV

UniT: Unified Geometry Learning with Group Autoregressive Transformer

Haotian Wang , Yusong Huang , Zhaonian Kuang , Hongliang Lu , Xinhu Zheng , Meng Yang , Gang Hua This is my paper

Pith reviewed 2026-05-21 04:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords unified geometry perceptiongroup autoregressive transformer3D reconstructiononline perceptionoffline reconstructionmetric scale estimationpoint mapsKV cache

0 comments

The pith

A Group Autoregressive Transformer unifies online and offline 3D geometry perception in one model by processing observation groups as autoregressive units.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to replace multiple incompatible systems for 3D geometry perception with a single feed-forward model that can perform real-time online tasks, full offline reconstruction, multi-modal fusion, long-sequence handling, and metric-scale output. It does so by defining groups of sensor frames as the basic autoregressive unit and generating corresponding point maps without anchors or fixed scales. Online operation steps through single-frame groups over successive passes while offline operation consumes an entire multi-frame group at once; a queue-style cache discards early-frame memory on the fly to keep computation bounded. A scale-adaptive loss blends relative geometry constraints with a partial absolute term so that the same network transitions toward metric accuracy across scenes.

Core claim

UniT reformulates online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation inside one framework. Groups of sensor observations serve as the autoregressive units; point maps are predicted in an anchor-free, scale-adaptive manner. Online mode runs multiple steps on single-frame groups while offline mode consumes a multi-frame group in one pass. A queue-style KV cache keeps memory bounded by discarding outdated early-frame dependencies, and a scale-adaptive geometry loss couples relative constraints with a partial absolute scale term to improve metric generalization.

What carries the argument

The Group Autoregressive Transformer, which treats groups of observations as the basic autoregressive unit to predict point maps while allowing group size to switch between online and offline modes.

If this is right

Online perception runs over multiple autoregressive steps using single-frame groups.
Offline reconstruction consumes an entire multi-frame group in one forward pass.
Long sequences remain feasible because the KV cache discards early-frame memory on the fly.
Metric-scale estimates improve across scenes through the scale-adaptive loss that mixes relative and partial absolute terms.
Auxiliary modalities integrate via a dedicated modal attention module without changing the core pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A single trained network could replace separate real-time and batch pipelines in robotics or mapping applications.
The same grouping idea might extend to other dense prediction tasks such as optical flow or semantic segmentation.
Further tests on sequences longer than the reported benchmarks would expose any hidden limits of the cache eviction rule.
If the scale-adaptive loss generalizes, similar partial-absolute terms could be added to other geometry networks to reduce the need for per-scene calibration.

Load-bearing premise

That varying group size at inference time together with the queue-style KV cache and scale-adaptive loss will keep accuracy and scalability intact across all seven tasks without task-specific retraining or post-processing.

What would settle it

Measure reconstruction accuracy on a long outdoor sequence when the model is forced to alternate between single-frame and multi-frame groups without any fine-tuning; a clear drop relative to separate task-specific baselines would falsify the unification claim.

Figures

Figures reproduced from arXiv: 2605.21131 by Gang Hua, Haotian Wang, Hongliang Lu, Meng Yang, Xinhu Zheng, Yusong Huang, Zhaonian Kuang.

**Figure 1.** Figure 1: We present UNIT, a unified feed-forward model that reformulates a wide range of geometry perception capabilities into a single framework, covering diverse view configurations, modality combinations, metric-scale perception, and long-horizon scalability. It supports both online and offline inference over an arbitrary number of views, flexibly incorporates auxiliary modalities such as camera parameters and d… view at source ↗

**Figure 2.** Figure 2: Four representative paradigms for geometry perception: (a) CUT3R [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture Overview. Image groups are first patchified into tokens using DINO [38]. These tokens are then fused with tokens encoded from optional modalities through a modal attention layer, followed by frame attention and global attention layers. In global attention, bidirectional attention operates within each group, while causal attention is applied across groups. The fused tokens are finally decoded i… view at source ↗

**Figure 4.** Figure 4: Illustration of three types of attention mask, including (a) null mask, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 7.** Figure 7: Illustration of two types of normal ni: (a) Regular normal, computed on local surfaces within each frame to enforce local geometric consistency; (b) Shuffled normal, constructed on randomly formed virtual surfaces across frames to encourage global consistency. Points sharing the same color denote pixels originating from the same frame. usage over time. This allows outdated KV-cache entries to be discarded … view at source ↗

**Figure 8.** Figure 8: Qualitative results on multi-view reconstruction. All point clouds are presented in their raw form, without any alignment or filtering. Point clouds [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Long-horizon perception on the NRGBD dataset. The top plot shows [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Ablation study of KV-cache queue capacity and group size for camera [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

read the original abstract

Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniT unifies online and offline 3D geometry via group autoregression and a queue cache, but the claim that anchor-free modeling safely allows memory eviction lacks direct supporting evidence.

read the letter

The main point is that this paper puts online perception and offline reconstruction into one autoregressive transformer by treating frame groups as the basic unit and switching modes through group size alone. The Group Autoregressive Transformer with anchor-free point map prediction, the scale-adaptive loss that mixes relative constraints and a partial absolute term, and the modal attention module for extra inputs are the concrete new pieces. These choices let the model run single-frame groups step by step for online use or collapse a multi-frame group into one pass for offline use, while the queue KV cache is supposed to bound memory by dropping early frames. The unification angle is practical for robotics and AR setups that currently keep separate pipelines. The architecture description and loss formulation read cleanly and address the fragmentation problem directly. The paper reports SOTA numbers across ten benchmarks and seven tasks, which suggests the experiments were run at reasonable scale. The soft spot is the KV cache claim. The text states that anchor-free relational modeling reduces long-range dependencies enough to evict outdated entries without accuracy loss, yet there is no ablation that measures attention weights on early versus recent frames or compares full-cache versus queue-eviction performance on extended sequences. If residual coupling remains, the online mode could degrade on long horizons without task-specific patches. The scale-adaptive loss is presented as helping metric generalization, but its effect size across scenes would need the full tables to judge. This paper is for people working on integrated 3D perception systems rather than narrow single-task improvements. A reader who cares about autoregressive designs for geometry would get concrete implementation ideas from the group and cache mechanics. It deserves a serious referee because the unification goal is substantive and the architecture is specified enough to test, even if the dependency-reduction assumption needs tighter validation.

Referee Report

2 major / 2 minor

Summary. The paper introduces UniT, a unified model for geometry perception tasks based on a novel Group Autoregressive Transformer. It reformulates online perception, offline reconstruction, multi-modal fusion, long-horizon scalability, and metric-scale estimation into a single framework by treating groups of sensor observations as autoregressive units and predicting anchor-free, scale-adaptive point maps. Online mode uses single-frame groups over multiple steps while offline aggregates multi-frame groups in one pass; a queue-style KV cache bounds memory by discarding early frames (enabled by reduced long-range dependencies via anchor-free modeling), and a scale-adaptive loss couples relative constraints with a partial absolute scale term. A modal attention module integrates auxiliary inputs. The model claims SOTA results across ten benchmarks spanning seven tasks.

Significance. If the empirical claims hold, the work would be significant for consolidating fragmented geometry perception paradigms into one architecture, potentially simplifying real-world deployment in robotics and AR where both online and offline operation are required. The anchor-free design and scale-adaptive loss target practical challenges in long-horizon inference and metric generalization; the provision of a single model supporting variable group sizes without task-specific retraining would represent a meaningful advance over prior specialized feed-forward approaches.

major comments (2)

[Abstract / §3] Abstract and §3 (Group Autoregressive Transformer description): The central unification claim for long-horizon online inference rests on the queue-style KV cache safely evicting early-frame entries without accuracy loss, justified by anchor-free relational modeling reducing long-range dependencies. However, no ablation is presented that quantifies attention scores to early versus recent frames or directly compares full-cache versus queue-eviction performance on extended sequences across the seven tasks. This is load-bearing for the scalability and online-mode claims; residual coupling would require task-specific fixes that undermine the single-framework assertion.
[§4 / scale-adaptive loss] §4 (Experiments) and scale-adaptive loss definition: The scale-adaptive geometry loss is described as coupling relative constraints with a partial absolute scale term to induce progressive transition to metric-scale solutions, yet the exact weighting schedule, the definition of the partial absolute term, and its interaction with the group autoregression process lack sufficient detail or sensitivity analysis. This affects the metric-scale generalization claim across scenes and benchmarks.

minor comments (2)

[Abstract] The abstract asserts SOTA performance on ten benchmarks but does not include any quantitative numbers, error bars, or comparison tables; these should be summarized in the abstract or a dedicated results overview for immediate verifiability.
[§3] Notation for the modal attention module and the precise formulation of the group autoregressive process (e.g., how group size is varied at inference) could be clarified with a compact equation or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of our claims on long-horizon scalability and the scale-adaptive loss. We address each point below and have revised the manuscript to incorporate additional details and experiments.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (Group Autoregressive Transformer description): The central unification claim for long-horizon online inference rests on the queue-style KV cache safely evicting early-frame entries without accuracy loss, justified by anchor-free relational modeling reducing long-range dependencies. However, no ablation is presented that quantifies attention scores to early versus recent frames or directly compares full-cache versus queue-eviction performance on extended sequences across the seven tasks. This is load-bearing for the scalability and online-mode claims; residual coupling would require task-specific fixes that undermine the single-framework assertion.

Authors: We agree that direct empirical validation strengthens the long-horizon claims. Section 3 explains that anchor-free point map prediction focuses on local relational geometry, thereby attenuating long-range dependencies and permitting safe eviction of early frames via the queue-style KV cache. To address the request, the revised manuscript includes new ablations: (i) average attention scores to early versus recent frames across sequence lengths for representative tasks, and (ii) performance comparison of full-cache versus queue-eviction modes on extended sequences spanning the seven tasks. These results show negligible degradation upon eviction, supporting the unified framework without task-specific adjustments. revision: yes
Referee: [§4 / scale-adaptive loss] §4 (Experiments) and scale-adaptive loss definition: The scale-adaptive geometry loss is described as coupling relative constraints with a partial absolute scale term to induce progressive transition to metric-scale solutions, yet the exact weighting schedule, the definition of the partial absolute term, and its interaction with the group autoregression process lack sufficient detail or sensitivity analysis. This affects the metric-scale generalization claim across scenes and benchmarks.

Authors: We thank the referee for noting the need for greater precision. The original description in §4 states that the loss couples relative geometric constraints with a partial absolute scale term to induce progressive transition to metric-scale solutions. In the revision we have added: the exact mathematical definition of the partial absolute term, the weighting schedule (linear ramp from 0 to 1 over the first 50% of training epochs), and its per-group application within the autoregressive process. We also include a sensitivity study varying the absolute-term coefficient and reporting metric-scale errors across scenes and benchmarks, confirming improved generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained architectural design

full rationale

The paper introduces UniT as a new Group Autoregressive Transformer architecture that unifies geometry perception tasks by treating sensor observation groups as autoregressive units, employing anchor-free relational modeling to enable queue-style KV caching, and adding a scale-adaptive loss that couples relative constraints with partial absolute scale. These are presented as explicit design choices and modeling decisions rather than derivations that reduce to fitted parameters, self-definitions, or prior self-citations. No equations or claims equate predictions to inputs by construction, and the unification across online/offline modes via group size variation is justified structurally without reducing to the same data fits. The central claims rest on the model's novel components and empirical validation across benchmarks, remaining independent of the input quantities.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that a single autoregressive process with variable group size can replace separate online and offline pipelines without performance loss, plus the effectiveness of the partial absolute scale term in the loss.

free parameters (2)

group size
Chosen per mode (single-frame vs multi-frame) to switch between online and offline behavior.
partial absolute scale term weight
Introduced in the scale-adaptive geometry loss to regularize global scale.

axioms (1)

domain assumption Transformer attention and KV caching can be applied to geometric point-map prediction without breaking relational consistency across frames.
Invoked when describing the queue-style KV caching and anchor-free relational modeling.

invented entities (2)

Group Autoregressive Transformer no independent evidence
purpose: Core architecture that treats observation groups as autoregressive units.
New model component introduced to unify the tasks.
modal attention module no independent evidence
purpose: Integrates auxiliary modalities such as depth or IMU.
Dedicated module for multi-modal input.

pith-pipeline@v0.9.0 · 5823 in / 1455 out tokens · 31866 ms · 2026-05-21T04:48:50.395072+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner... queue-style KV caching mechanism ensures bounded autoregressive memory... reducing long-range dependencies on early frames through anchor-free relational modeling
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 15 internal anchors

[1]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketiet al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

An overview of augmented reality,

F. Arena, M. Collotta, G. Pau, and F. Termine, “An overview of augmented reality,”Computers, vol. 11, no. 2, p. 28, 2022

work page 2022
[3]

End-to-end autonomous driving: Challenges and frontiers,

L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End-to-end autonomous driving: Challenges and frontiers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[4]

Structure-from-motion revisited,

J. L. Sch ¨onberger and J.-M. Frahm, “Structure-from-motion revisited,” inConference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[5]

Orb-slam3: An accurate open-source library for visual, visual– inertial, and multimap slam,

C. Campos, R. Elvira, J. J. G. Rodr ´ıguez, J. M. Montiel, and J. D. Tard´os, “Orb-slam3: An accurate open-source library for visual, visual– inertial, and multimap slam,”IEEE transactions on robotics, vol. 37, no. 6, pp. 1874–1890, 2021

work page 2021
[6]

Dust3r: Geometric 3d vision made easy,

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20 697–20 709

work page 2024
[7]

Streaming 4D Visual Geometry Transformer

D. Zhuo, W. Zheng, J. Guo, Y . Wu, J. Zhou, and J. Lu, “Streaming 4d visual geometry transformer,”arXiv preprint arXiv:2507.11539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Y . Wang, J. Zhou, H. Zhu, W. Chang, Y . Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He, “π 3: Permutation-equivariant visual geometry learning,”arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Pow3r: Empowering unconstrained 3d reconstruction with camera and scene priors,

W. Jang, P. Weinzaepfel, V . Leroy, L. Agapito, and J. Revaud, “Pow3r: Empowering unconstrained 3d reconstruction with camera and scene priors,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1071–1081

work page 2025
[10]

TTT3R: 3D Reconstruction as Test-Time Training

X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen, “Ttt3r: 3d reconstruction as test-time training,”arXiv preprint arXiv:2509.26645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Depth anything 3: Recovering the visual space from any views,

H. Lin, S. Chen, J. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,” 2025. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (UNDER REVIEW) 14

work page 2025
[12]

Continuous 3d perception model with persistent state,

Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa, “Continuous 3d perception model with persistent state,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10 510–10 522

work page 2025
[13]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294– 5306

work page 2025
[14]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

N. Keetha, N. M ¨uller, J. Sch ¨onberger, L. Porzi, Y . Zhang, T. Fischer, A. Knapitsch, D. Zauss, E. Weber, N. Antuneset al., “Mapany- thing: Universal feed-forward metric 3d reconstruction,”arXiv preprint arXiv:2509.13414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inPro- ceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019
[16]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[18]

Infinitevggt: Visual geometry grounded transformer for endless streams,

S. Yuan, Y . Yang, X. Yang, X. Zhang, Z. Zhao, L. Zhang, and Z. Zhang, “Infinitevggt: Visual geometry grounded transformer for endless streams,”arXiv preprint arXiv:2601.02281, 2026

work page arXiv 2026
[19]

Flashvggt: Efficient and scalable visual geome- try transformers with compressed descriptor attention,

Z. Wang and D. Xu, “Flashvggt: Efficient and scalable visual geome- try transformers with compressed descriptor attention,”arXiv preprint arXiv:2512.01540, 2025

work page arXiv 2025
[20]

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

Y . Shen, Z. Zhang, Y . Qu, X. Zheng, J. Ji, S. Zhang, and L. Cao, “Fastvggt: Training-free acceleration of visual geometry transformer,” arXiv preprint arXiv:2509.02560, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,

R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 3, pp. 1623–1637, 2020

work page 2020
[22]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

R. Wang, S. Xu, Y . Dong, Y . Deng, J. Xiang, Z. Lv, G. Sun, X. Tong, and J. Yang, “Moge-2: Accurate monocular geometry with metric scale and sharp details,”arXiv preprint arXiv:2507.02546, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Unidepthv2: Universal monocular metric depth estimation made simpler,

L. Piccinelli, C. Sakaridis, Y .-H. Yang, M. Segu, S. Li, W. Abbeloos, and L. Van Gool, “Unidepthv2: Universal monocular metric depth estimation made simpler,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[24]

G2-monodepth: A general frame- work of generalized depth inference from monocular rgb+ x data,

H. Wang, M. Yang, and N. Zheng, “G2-monodepth: A general frame- work of generalized depth inference from monocular rgb+ x data,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 3753–3771, 2023

work page 2023
[25]

Visual odometry,

D. Nist ´er, O. Naroditsky, and J. Bergen, “Visual odometry,” inProceed- ings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., vol. 1. Ieee, 2004, pp. I–I

work page 2004
[26]

Depthcrafter: Generating consistent long depth sequences for open-world videos,

W. Hu, X. Gao, X. Li, S. Zhao, X. Cun, Y . Zhang, L. Quan, and Y . Shan, “Depthcrafter: Generating consistent long depth sequences for open-world videos,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2005–2015

work page 2025
[27]

Depth map prediction from a single image using a multi-scale deep network,

D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,”Advances in neural information processing systems, vol. 27, 2014

work page 2014
[28]

Grounding image matching in 3d with mast3r,

V . Leroy, Y . Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 71–91

work page 2024
[29]

Mast3r-slam: Real- time dense slam with 3d reconstruction priors,

R. Murai, E. Dexheimer, and A. J. Davison, “Mast3r-slam: Real- time dense slam with 3d reconstruction priors,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 16 695– 16 705

work page 2025
[30]

Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion,

B. P. Duisterhof, L. Zust, P. Weinzaepfel, V . Leroy, Y . Cabon, and J. Revaud, “Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion,” in2025 International Conference on 3D Vision (3DV). IEEE, 2025, pp. 1–10

work page 2025
[31]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass,

J. Yang, A. Sax, K. J. Liang, M. Henaff, H. Tang, A. Cao, J. Chai, F. Meier, and M. Feiszli, “Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21 924–21 935

work page 2025
[32]

3d reconstruction with spatial memory,

H. Wang and L. Agapito, “3d reconstruction with spatial memory,” in 2025 International Conference on 3D Vision (3DV). IEEE, 2025, pp. 78–89

work page 2025
[33]

Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025

Y . Wu, W. Zheng, J. Zhou, and J. Lu, “Point3r: Streaming 3d reconstruction with explicit spatial pointer memory,”arXiv preprint arXiv:2507.02863, 2025

work page arXiv 2025
[34]

Worldmirror: Universal 3d world reconstruction with any-prior prompting,

Y . Liu, Z. Min, Z. Wang, J. Wu, T. Wang, Y . Yuan, Y . Luo, and C. Guo, “Worldmirror: Universal 3d world reconstruction with any-prior prompting,”arXiv preprint arXiv:2510.10726, 2025

work page arXiv 2025
[35]

Omnivggt: Omni-modality driven visual geometry grounded,

H. Peng, H. Li, Y . Dai, Y . Lan, Y . Luo, T. Qi, Z. Zhang, Y . Zhan, J. Zhang, W. Xuet al., “Omnivggt: Omni-modality driven visual geometry grounded,”arXiv preprint arXiv:2511.10560, 2025

work page arXiv 2025
[36]

G-cut3r: Guided 3d reconstruction with camera and depth prior inte- gration,

R. Khafizov, A. Komarichev, R. Rakhimov, P. Wonka, and E. Burnaev, “G-cut3r: Guided 3d reconstruction with camera and depth prior inte- gration,”arXiv preprint arXiv:2508.11379, 2025

work page arXiv 2025
[37]

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

K. Deng, Z. Ti, J. Xu, J. Yang, and J. Xie, “Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences,” arXiv preprint arXiv:2507.16443, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 179–12 188

work page 2021
[40]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017
[42]

Scale propagation network for generalizable depth completion,

H. Wang, M. Yang, X. Zheng, and G. Hua, “Scale propagation network for generalizable depth completion,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[43]

Prompting depth anything for 4k resolution accurate metric depth estimation,

H. Lin, S. Peng, J. Chen, S. Peng, J. Sun, M. Liu, H. Bao, J. Feng, X. Zhou, and B. Kang, “Prompting depth anything for 4k resolution accurate metric depth estimation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 17 070–17 080

work page 2025
[44]

Scannet++: A high- fidelity dataset of 3d indoor scenes,

C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high- fidelity dataset of 3d indoor scenes,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12–22

work page 2023
[45]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartzet al., “Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data,” arXiv preprint arXiv:2111.08897, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[46]

Scannet: Richly-annotated 3d reconstructions of indoor scenes,

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839

work page 2017
[47]

Matterport3D: Learning from RGB-D Data in Indoor Environments

A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y . Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,”arXiv preprint arXiv:1709.06158, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

Dynamicstereo: Consistent dynamic depth from stereo videos,

N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rup- precht, “Dynamicstereo: Consistent dynamic depth from stereo videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 229–13 239

work page 2023
[49]

Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,

M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 912–10 922

work page 2021
[50]

Scalability in perception for autonomous driving: Waymo open dataset,

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454

work page 2020
[51]

Map-free visual relocalization: Metric pose relative to a single image,

E. Arnold, J. Wynn, S. Vicente, G. Garcia-Hernando, A. Monszpart, V . Prisacariu, D. Turmukhambetov, and E. Brachmann, “Map-free visual relocalization: Metric pose relative to a single image,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 690–708

work page 2022
[52]

Virtual KITTI 2

Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,”arXiv preprint arXiv:2001.10773, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[53]

Deepmvs: Learning multi-view stereopsis,

P.-H. Huang, K. Matzen, J. Kopf, N. Ahuja, and J.-B. Huang, “Deepmvs: Learning multi-view stereopsis,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2821–2830

work page 2018
[54]

Generative camera dolly: Extreme monocular dynamic novel view synthesis,

B. Van Hoorick, R. Wu, E. Ozguroglu, K. Sargent, R. Liu, P. Tokmakov, A. Dave, C. Zheng, and C. V ondrick, “Generative camera dolly: Extreme monocular dynamic novel view synthesis,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 313–331. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (UNDER REVIEW) 15

work page 2024
[55]

Flow-motion and depth network for monocular stereo and beyond,

K. Wang and S. Shen, “Flow-motion and depth network for monocular stereo and beyond,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3307–3314, 2020

work page 2020
[56]

Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond,

Y . Li, L. Jiang, L. Xu, Y . Xiangli, Z. Wang, D. Lin, and B. Dai, “Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3205–3215

work page 2023
[57]

Mid-air: A multi-modal dataset for extremely low altitude drone flights,

M. Fonder and M. V . Droogenbroeck, “Mid-air: A multi-modal dataset for extremely low altitude drone flights,” inConference on Computer Vision and Pattern Recognition Workshop (CVPRW), June 2019

work page 2019
[58]

Smd-nets: Stereo mixture density networks,

F. Tosi, Y . Liao, C. Schmitt, and A. Geiger, “Smd-nets: Stereo mixture density networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8942–8952

work page 2021
[59]

Tartanair: A dataset to push the limits of visual slam,

W. Wang, D. Zhu, X. Wang, Y . Hu, Y . Qiu, C. Wang, Y . Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4909–4916

work page 2020
[60]

Pointodyssey: A large-scale synthetic dataset for long-term point track- ing,

Y . Zheng, A. W. Harley, B. Shen, G. Wetzstein, and L. J. Guibas, “Pointodyssey: A large-scale synthetic dataset for long-term point track- ing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 855–19 865

work page 2023
[61]

Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo,

L. Mehl, J. Schmalfuss, A. Jahedi, Y . Nalivayko, and A. Bruhn, “Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4981–4991

work page 2023
[62]

Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos,

H. Xia, Y . Fu, S. Liu, and X. Wang, “Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 378–22 389

work page 2024
[63]

Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation,

T. Wu, J. Zhang, X. Fu, Y . Wang, J. Ren, L. Pan, W. Wu, L. Yang, J. Wang, C. Qianet al., “Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 803–814

work page 2023
[64]

Humman: Multi-modal 4d human dataset for versatile sensing and modeling,

Z. Cai, D. Ren, A. Zeng, Z. Lin, T. Yu, W. Wang, X. Fan, Y . Gao, Y . Yu, L. Panet al., “Humman: Multi-modal 4d human dataset for versatile sensing and modeling,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 557–577

work page 2022
[65]

Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction,

W. Yin, Y . Liu, and C. Shen, “Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 7282– 7295, 2021

work page 2021
[66]

Pacgdc: Label- efficient generalizable depth completion with projection ambiguity and consistency,

H. Wang, A. Xiao, X. Zhang, M. Yang, and S. Lu, “Pacgdc: Label- efficient generalizable depth completion with projection ambiguity and consistency,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 7709–7720

work page 2025
[67]

An analysis of svd for deep rotation esti- mation,

J. Levinson, C. Esteves, K. Chen, N. Snavely, A. Kanazawa, A. Ros- tamizadeh, and A. Makadia, “An analysis of svd for deep rotation esti- mation,”Advances in Neural Information Processing Systems, vol. 33, pp. 22 554–22 565, 2020

work page 2020
[68]

On the continuity of rotation representations in neural networks,

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5745–5753

work page 2019
[69]

Scene coordinate regression forests for camera relocalization in rgb-d images,

J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgib- bon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2930–2937

work page 2013
[70]

Neural rgb-d surface reconstruction,

D. Azinovi ´c, R. Martin-Brualla, D. B. Goldman, M. Nießner, and J. Thies, “Neural rgb-d surface reconstruction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6290–6301

work page 2022
[71]

Large scale multi-view stereopsis evaluation,

R. Jensen, A. Dahl, G. V ogiatzis, E. Tola, and H. Aanæs, “Large scale multi-view stereopsis evaluation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 406–413

work page 2014
[72]

A naturalistic open source movie for optical flow evaluation,

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” inEuropean conference on computer vision. Springer, 2012, pp. 611–625

work page 2012
[73]

A benchmark for the evaluation of rgb-d slam systems,

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 573–580

work page 2012
[74]

Re- fusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals,

E. Palazzolo, J. Behley, P. Lottes, P. Giguere, and C. Stachniss, “Re- fusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals,” in2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 7855–7862

work page 2019
[75]

A multi-view stereo benchmark with high- resolution images and multi-camera videos,

T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high- resolution images and multi-camera videos,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3260– 3269

work page 2017
[76]

Vision meets robotics: The kitti dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The international journal of robotics research, vol. 32, no. 11, pp. 1231–1237, 2013

work page 2013
[77]

Indoor segmentation and support inference from rgbd images,

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inEuropean conference on computer vision. Springer, 2012, pp. 746–760

work page 2012
[78]

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

J. Zhang, C. Herrmann, J. Hur, V . Jampani, T. Darrell, F. Cole, D. Sun, and M.-H. Yang, “Monst3r: A simple approach for estimating geometry in the presence of motion,”arXiv preprint arXiv:2410.03825, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

Omni-dc: Highly robust depth completion with multiresolution depth integration,

Y . Zuo, W. Yang, Z. Ma, and J. Deng, “Omni-dc: Highly robust depth completion with multiresolution depth integration,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9287–9297

work page 2025
[80]

Sparsedc: Depth completion from sparse and non-uniform inputs,

C. Long, W. Zhang, Z. Chen, H. Wang, Y . Liu, P. Tong, Z. Cao, Z. Dong, and B. Yang, “Sparsedc: Depth completion from sparse and non-uniform inputs,”Information Fusion, vol. 110, p. 102470, 2024. Haotian Wang(Member, IEEE) received the Ph.D. degree from the Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, China, in 2025...

work page 2024

[1] [1]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketiet al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

An overview of augmented reality,

F. Arena, M. Collotta, G. Pau, and F. Termine, “An overview of augmented reality,”Computers, vol. 11, no. 2, p. 28, 2022

work page 2022

[3] [3]

End-to-end autonomous driving: Challenges and frontiers,

L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End-to-end autonomous driving: Challenges and frontiers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024

[4] [4]

Structure-from-motion revisited,

J. L. Sch ¨onberger and J.-M. Frahm, “Structure-from-motion revisited,” inConference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016

[5] [5]

Orb-slam3: An accurate open-source library for visual, visual– inertial, and multimap slam,

C. Campos, R. Elvira, J. J. G. Rodr ´ıguez, J. M. Montiel, and J. D. Tard´os, “Orb-slam3: An accurate open-source library for visual, visual– inertial, and multimap slam,”IEEE transactions on robotics, vol. 37, no. 6, pp. 1874–1890, 2021

work page 2021

[6] [6]

Dust3r: Geometric 3d vision made easy,

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20 697–20 709

work page 2024

[7] [7]

Streaming 4D Visual Geometry Transformer

D. Zhuo, W. Zheng, J. Guo, Y . Wu, J. Zhou, and J. Lu, “Streaming 4d visual geometry transformer,”arXiv preprint arXiv:2507.11539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Y . Wang, J. Zhou, H. Zhu, W. Chang, Y . Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He, “π 3: Permutation-equivariant visual geometry learning,”arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Pow3r: Empowering unconstrained 3d reconstruction with camera and scene priors,

W. Jang, P. Weinzaepfel, V . Leroy, L. Agapito, and J. Revaud, “Pow3r: Empowering unconstrained 3d reconstruction with camera and scene priors,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1071–1081

work page 2025

[10] [10]

TTT3R: 3D Reconstruction as Test-Time Training

X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen, “Ttt3r: 3d reconstruction as test-time training,”arXiv preprint arXiv:2509.26645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Depth anything 3: Recovering the visual space from any views,

H. Lin, S. Chen, J. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,” 2025. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (UNDER REVIEW) 14

work page 2025

[12] [12]

Continuous 3d perception model with persistent state,

Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa, “Continuous 3d perception model with persistent state,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10 510–10 522

work page 2025

[13] [13]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294– 5306

work page 2025

[14] [14]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

N. Keetha, N. M ¨uller, J. Sch ¨onberger, L. Porzi, Y . Zhang, T. Fischer, A. Knapitsch, D. Zauss, E. Weber, N. Antuneset al., “Mapany- thing: Universal feed-forward metric 3d reconstruction,”arXiv preprint arXiv:2509.13414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inPro- ceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

work page 2019

[16] [16]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024

[18] [18]

Infinitevggt: Visual geometry grounded transformer for endless streams,

S. Yuan, Y . Yang, X. Yang, X. Zhang, Z. Zhao, L. Zhang, and Z. Zhang, “Infinitevggt: Visual geometry grounded transformer for endless streams,”arXiv preprint arXiv:2601.02281, 2026

work page arXiv 2026

[19] [19]

Flashvggt: Efficient and scalable visual geome- try transformers with compressed descriptor attention,

Z. Wang and D. Xu, “Flashvggt: Efficient and scalable visual geome- try transformers with compressed descriptor attention,”arXiv preprint arXiv:2512.01540, 2025

work page arXiv 2025

[20] [20]

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

Y . Shen, Z. Zhang, Y . Qu, X. Zheng, J. Ji, S. Zhang, and L. Cao, “Fastvggt: Training-free acceleration of visual geometry transformer,” arXiv preprint arXiv:2509.02560, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,

R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 3, pp. 1623–1637, 2020

work page 2020

[22] [22]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

R. Wang, S. Xu, Y . Dong, Y . Deng, J. Xiang, Z. Lv, G. Sun, X. Tong, and J. Yang, “Moge-2: Accurate monocular geometry with metric scale and sharp details,”arXiv preprint arXiv:2507.02546, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Unidepthv2: Universal monocular metric depth estimation made simpler,

L. Piccinelli, C. Sakaridis, Y .-H. Yang, M. Segu, S. Li, W. Abbeloos, and L. Van Gool, “Unidepthv2: Universal monocular metric depth estimation made simpler,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[24] [24]

G2-monodepth: A general frame- work of generalized depth inference from monocular rgb+ x data,

H. Wang, M. Yang, and N. Zheng, “G2-monodepth: A general frame- work of generalized depth inference from monocular rgb+ x data,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 3753–3771, 2023

work page 2023

[25] [25]

Visual odometry,

D. Nist ´er, O. Naroditsky, and J. Bergen, “Visual odometry,” inProceed- ings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., vol. 1. Ieee, 2004, pp. I–I

work page 2004

[26] [26]

Depthcrafter: Generating consistent long depth sequences for open-world videos,

W. Hu, X. Gao, X. Li, S. Zhao, X. Cun, Y . Zhang, L. Quan, and Y . Shan, “Depthcrafter: Generating consistent long depth sequences for open-world videos,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2005–2015

work page 2025

[27] [27]

Depth map prediction from a single image using a multi-scale deep network,

D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,”Advances in neural information processing systems, vol. 27, 2014

work page 2014

[28] [28]

Grounding image matching in 3d with mast3r,

V . Leroy, Y . Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 71–91

work page 2024

[29] [29]

Mast3r-slam: Real- time dense slam with 3d reconstruction priors,

R. Murai, E. Dexheimer, and A. J. Davison, “Mast3r-slam: Real- time dense slam with 3d reconstruction priors,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 16 695– 16 705

work page 2025

[30] [30]

Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion,

B. P. Duisterhof, L. Zust, P. Weinzaepfel, V . Leroy, Y . Cabon, and J. Revaud, “Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion,” in2025 International Conference on 3D Vision (3DV). IEEE, 2025, pp. 1–10

work page 2025

[31] [31]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass,

J. Yang, A. Sax, K. J. Liang, M. Henaff, H. Tang, A. Cao, J. Chai, F. Meier, and M. Feiszli, “Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21 924–21 935

work page 2025

[32] [32]

3d reconstruction with spatial memory,

H. Wang and L. Agapito, “3d reconstruction with spatial memory,” in 2025 International Conference on 3D Vision (3DV). IEEE, 2025, pp. 78–89

work page 2025

[33] [33]

Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025

Y . Wu, W. Zheng, J. Zhou, and J. Lu, “Point3r: Streaming 3d reconstruction with explicit spatial pointer memory,”arXiv preprint arXiv:2507.02863, 2025

work page arXiv 2025

[34] [34]

Worldmirror: Universal 3d world reconstruction with any-prior prompting,

Y . Liu, Z. Min, Z. Wang, J. Wu, T. Wang, Y . Yuan, Y . Luo, and C. Guo, “Worldmirror: Universal 3d world reconstruction with any-prior prompting,”arXiv preprint arXiv:2510.10726, 2025

work page arXiv 2025

[35] [35]

Omnivggt: Omni-modality driven visual geometry grounded,

H. Peng, H. Li, Y . Dai, Y . Lan, Y . Luo, T. Qi, Z. Zhang, Y . Zhan, J. Zhang, W. Xuet al., “Omnivggt: Omni-modality driven visual geometry grounded,”arXiv preprint arXiv:2511.10560, 2025

work page arXiv 2025

[36] [36]

G-cut3r: Guided 3d reconstruction with camera and depth prior inte- gration,

R. Khafizov, A. Komarichev, R. Rakhimov, P. Wonka, and E. Burnaev, “G-cut3r: Guided 3d reconstruction with camera and depth prior inte- gration,”arXiv preprint arXiv:2508.11379, 2025

work page arXiv 2025

[37] [37]

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

K. Deng, Z. Ti, J. Xu, J. Yang, and J. Xie, “Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences,” arXiv preprint arXiv:2507.16443, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 179–12 188

work page 2021

[40] [40]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

work page 2017

[42] [42]

Scale propagation network for generalizable depth completion,

H. Wang, M. Yang, X. Zheng, and G. Hua, “Scale propagation network for generalizable depth completion,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024

[43] [43]

Prompting depth anything for 4k resolution accurate metric depth estimation,

H. Lin, S. Peng, J. Chen, S. Peng, J. Sun, M. Liu, H. Bao, J. Feng, X. Zhou, and B. Kang, “Prompting depth anything for 4k resolution accurate metric depth estimation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 17 070–17 080

work page 2025

[44] [44]

Scannet++: A high- fidelity dataset of 3d indoor scenes,

C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high- fidelity dataset of 3d indoor scenes,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12–22

work page 2023

[45] [45]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartzet al., “Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data,” arXiv preprint arXiv:2111.08897, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[46] [46]

Scannet: Richly-annotated 3d reconstructions of indoor scenes,

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839

work page 2017

[47] [47]

Matterport3D: Learning from RGB-D Data in Indoor Environments

A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y . Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,”arXiv preprint arXiv:1709.06158, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[48] [48]

Dynamicstereo: Consistent dynamic depth from stereo videos,

N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rup- precht, “Dynamicstereo: Consistent dynamic depth from stereo videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 229–13 239

work page 2023

[49] [49]

Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,

M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 912–10 922

work page 2021

[50] [50]

Scalability in perception for autonomous driving: Waymo open dataset,

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454

work page 2020

[51] [51]

Map-free visual relocalization: Metric pose relative to a single image,

E. Arnold, J. Wynn, S. Vicente, G. Garcia-Hernando, A. Monszpart, V . Prisacariu, D. Turmukhambetov, and E. Brachmann, “Map-free visual relocalization: Metric pose relative to a single image,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 690–708

work page 2022

[52] [52]

Virtual KITTI 2

Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,”arXiv preprint arXiv:2001.10773, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[53] [53]

Deepmvs: Learning multi-view stereopsis,

P.-H. Huang, K. Matzen, J. Kopf, N. Ahuja, and J.-B. Huang, “Deepmvs: Learning multi-view stereopsis,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2821–2830

work page 2018

[54] [54]

Generative camera dolly: Extreme monocular dynamic novel view synthesis,

B. Van Hoorick, R. Wu, E. Ozguroglu, K. Sargent, R. Liu, P. Tokmakov, A. Dave, C. Zheng, and C. V ondrick, “Generative camera dolly: Extreme monocular dynamic novel view synthesis,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 313–331. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (UNDER REVIEW) 15

work page 2024

[55] [55]

Flow-motion and depth network for monocular stereo and beyond,

K. Wang and S. Shen, “Flow-motion and depth network for monocular stereo and beyond,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3307–3314, 2020

work page 2020

[56] [56]

Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond,

Y . Li, L. Jiang, L. Xu, Y . Xiangli, Z. Wang, D. Lin, and B. Dai, “Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3205–3215

work page 2023

[57] [57]

Mid-air: A multi-modal dataset for extremely low altitude drone flights,

M. Fonder and M. V . Droogenbroeck, “Mid-air: A multi-modal dataset for extremely low altitude drone flights,” inConference on Computer Vision and Pattern Recognition Workshop (CVPRW), June 2019

work page 2019

[58] [58]

Smd-nets: Stereo mixture density networks,

F. Tosi, Y . Liao, C. Schmitt, and A. Geiger, “Smd-nets: Stereo mixture density networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8942–8952

work page 2021

[59] [59]

Tartanair: A dataset to push the limits of visual slam,

W. Wang, D. Zhu, X. Wang, Y . Hu, Y . Qiu, C. Wang, Y . Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4909–4916

work page 2020

[60] [60]

Pointodyssey: A large-scale synthetic dataset for long-term point track- ing,

Y . Zheng, A. W. Harley, B. Shen, G. Wetzstein, and L. J. Guibas, “Pointodyssey: A large-scale synthetic dataset for long-term point track- ing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19 855–19 865

work page 2023

[61] [61]

Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo,

L. Mehl, J. Schmalfuss, A. Jahedi, Y . Nalivayko, and A. Bruhn, “Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4981–4991

work page 2023

[62] [62]

Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos,

H. Xia, Y . Fu, S. Liu, and X. Wang, “Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 378–22 389

work page 2024

[63] [63]

Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation,

T. Wu, J. Zhang, X. Fu, Y . Wang, J. Ren, L. Pan, W. Wu, L. Yang, J. Wang, C. Qianet al., “Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 803–814

work page 2023

[64] [64]

Humman: Multi-modal 4d human dataset for versatile sensing and modeling,

Z. Cai, D. Ren, A. Zeng, Z. Lin, T. Yu, W. Wang, X. Fan, Y . Gao, Y . Yu, L. Panet al., “Humman: Multi-modal 4d human dataset for versatile sensing and modeling,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 557–577

work page 2022

[65] [65]

Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction,

W. Yin, Y . Liu, and C. Shen, “Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 7282– 7295, 2021

work page 2021

[66] [66]

Pacgdc: Label- efficient generalizable depth completion with projection ambiguity and consistency,

H. Wang, A. Xiao, X. Zhang, M. Yang, and S. Lu, “Pacgdc: Label- efficient generalizable depth completion with projection ambiguity and consistency,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 7709–7720

work page 2025

[67] [67]

An analysis of svd for deep rotation esti- mation,

J. Levinson, C. Esteves, K. Chen, N. Snavely, A. Kanazawa, A. Ros- tamizadeh, and A. Makadia, “An analysis of svd for deep rotation esti- mation,”Advances in Neural Information Processing Systems, vol. 33, pp. 22 554–22 565, 2020

work page 2020

[68] [68]

On the continuity of rotation representations in neural networks,

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity of rotation representations in neural networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5745–5753

work page 2019

[69] [69]

Scene coordinate regression forests for camera relocalization in rgb-d images,

J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgib- bon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2930–2937

work page 2013

[70] [70]

Neural rgb-d surface reconstruction,

D. Azinovi ´c, R. Martin-Brualla, D. B. Goldman, M. Nießner, and J. Thies, “Neural rgb-d surface reconstruction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6290–6301

work page 2022

[71] [71]

Large scale multi-view stereopsis evaluation,

R. Jensen, A. Dahl, G. V ogiatzis, E. Tola, and H. Aanæs, “Large scale multi-view stereopsis evaluation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 406–413

work page 2014

[72] [72]

A naturalistic open source movie for optical flow evaluation,

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” inEuropean conference on computer vision. Springer, 2012, pp. 611–625

work page 2012

[73] [73]

A benchmark for the evaluation of rgb-d slam systems,

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 573–580

work page 2012

[74] [74]

Re- fusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals,

E. Palazzolo, J. Behley, P. Lottes, P. Giguere, and C. Stachniss, “Re- fusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals,” in2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 7855–7862

work page 2019

[75] [75]

A multi-view stereo benchmark with high- resolution images and multi-camera videos,

T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high- resolution images and multi-camera videos,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3260– 3269

work page 2017

[76] [76]

Vision meets robotics: The kitti dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The international journal of robotics research, vol. 32, no. 11, pp. 1231–1237, 2013

work page 2013

[77] [77]

Indoor segmentation and support inference from rgbd images,

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inEuropean conference on computer vision. Springer, 2012, pp. 746–760

work page 2012

[78] [78]

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

J. Zhang, C. Herrmann, J. Hur, V . Jampani, T. Darrell, F. Cole, D. Sun, and M.-H. Yang, “Monst3r: A simple approach for estimating geometry in the presence of motion,”arXiv preprint arXiv:2410.03825, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[79] [79]

Omni-dc: Highly robust depth completion with multiresolution depth integration,

Y . Zuo, W. Yang, Z. Ma, and J. Deng, “Omni-dc: Highly robust depth completion with multiresolution depth integration,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9287–9297

work page 2025

[80] [80]

Sparsedc: Depth completion from sparse and non-uniform inputs,

C. Long, W. Zhang, Z. Chen, H. Wang, Y . Liu, P. Tong, Z. Cao, Z. Dong, and B. Yang, “Sparsedc: Depth completion from sparse and non-uniform inputs,”Information Fusion, vol. 110, p. 102470, 2024. Haotian Wang(Member, IEEE) received the Ph.D. degree from the Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, China, in 2025...

work page 2024