G3T Up! Gravity Aligned Coordinate Frames Simplify Pointmap Processing

Bharath Raj Nagoor Kani; Noah Snavely

arxiv: 2605.27372 · v1 · pith:PTH3FLENnew · submitted 2026-05-26 · 💻 cs.CV

G3T Up! Gravity Aligned Coordinate Frames Simplify Pointmap Processing

Bharath Raj Nagoor Kani , Noah Snavely This is my paper

Pith reviewed 2026-06-29 18:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D reconstructionpointmapsgravity alignmentfeed-forward modelspose estimationcoordinate framesincremental reconstruction

0 comments

The pith

Predicting pointmaps in gravity-aligned frames rather than camera-centric ones improves 3D reconstruction by sharing a common vertical axis across views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that feed-forward 3D reconstruction should move from camera-centric pointmap predictions to upright, gravity-aligned frames. This exploits the consistent vertical direction present in many real scenes, so that pointmaps from different viewpoints already share one axis and require less rotation to align. The authors introduce G3T, a transformer fine-tuned on gravity-aligned data that outputs both upright pointmaps and camera-to-gravity poses. They then build G3T-Long, an incremental pipeline that uses the reduced rotational freedom to produce more accurate submap-based reconstructions. A sympathetic reader would care because the change in coordinate frame is presented as a simple, data-driven lever that directly raises accuracy without altering network architecture.

Core claim

Gravity-aligned frames let pointmaps share a common vertical axis across viewpoints, which reduces the rotational degrees of freedom needed to relate them; the resulting G3T model produces accurate upright pointmaps and camera-to-gravity poses, and the G3T-Long submap pipeline that operates on these predictions delivers significantly higher reconstruction accuracy than camera-centric baselines.

What carries the argument

Gravity Grounded Geometry Transformer (G3T), a model fine-tuned to predict pointmaps and poses directly in gravity-aligned frames.

If this is right

Upright pointmaps share one vertical axis across all views, so only two rotational degrees of freedom remain when aligning them.
G3T produces both the upright pointmaps and the camera-to-gravity pose estimates needed for incremental reconstruction.
G3T-Long, the submap-based pipeline, converts the reduced rotational freedom into measurably higher final accuracy.
The same gravity-aligned output format works for any base model that can be fine-tuned on aligned 3D data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Coordinate-frame choice becomes a first-class design decision for future feed-forward 3D models, comparable to network architecture.
The method could be combined with IMU or accelerometer data to supply the gravity direction when visual cues are weak.
Scenes with strong but varying gravity (tilted buildings, sloped terrain) may require an adaptive gravity vector per submap rather than a single global direction.

Load-bearing premise

Many real-world scenes contain strong structural cues with a consistent gravity direction that can be exploited by predicting pointmaps in gravity-aligned frames.

What would settle it

Reconstruction accuracy on a test set of scenes lacking consistent gravity direction (for example, underwater footage or microgravity environments) shows no gain or a drop relative to camera-centric baselines.

Figures

Figures reproduced from arXiv: 2605.27372 by Bharath Raj Nagoor Kani, Noah Snavely.

**Figure 1.** Figure 1: Gravity Grounded Geometry Transformer (G3T) predicts pointmaps aligned with scene gravity, leveraging structural cues inherent in natural scenes. We visualize uprightness using a groundparallel grid and height-dependent color encoding. Pointmaps produced by G3T show near-constant color in ground-parallel regions (such as floors, benches etc.) indicating upright-alignment (a), whereas those produced by VGG… view at source ↗

**Figure 2.** Figure 2: Camera-aligned and gravity-aligned coordinate frames. Feedforward 3D reconstruction methods such as VGGT [43] typically predict pointmaps in a camera coordinate frame. Such pointmaps can be related to each other using a pose π(s, R, t) ∈ Sim(3) that has 7 degrees of freedom (DoF). In contrast, G3T predicts pointmaps in a gravity-aligned coordinate frame. Such pointmaps can be related to each other using a … view at source ↗

**Figure 3.** Figure 3: Model architecture. G3T builds upon VGGT with two key modifications. First, the point head outputs pointmaps in the gravity-aligned frame of the first image (X G1 = {X G1 i }). Second, we replace VGGT’s camera head with two new heads: the local camera head, whose outputs capture gravity-to-camera rotation and camera intrinsics parameters in G l = {Gl i }; and the relative camera head, which capture 1-DoF r… view at source ↗

**Figure 4.** Figure 4: G3T can consistently place pointmaps in an upright frame. For the given set of input images, we compare VGGT predictions (red) made upright using GeoCalib, and our G3T predictions (blue) with ground-truth gravity-aligned pointmaps (green). We also render a grid depicting a plane parallel to the ground. We observe that the composition of VGGT and GeoCalib often produces “slanted” pointmap predictions that a… view at source ↗

**Figure 5.** Figure 5: Visualizing failure cases. G3T struggles to resolve upright alignment in scenes with ambiguous structural cues. We illustrate this with two self-captured examples. In (a), close-up floor images cause G3T to produce a slanted pointmap. In (b), horizontally rotated images of a vertically-aligned cabinet lead to pointmaps with incorrect orientation. depthmap unprojection results additionally validate that the… view at source ↗

**Figure 6.** Figure 6: G3T can consistently place pointmaps in an upright frame. For the given set of input images, we compare VGGT predictions (red) made upright using GeoCalib, and our G3T predictions (blue) with ground-truth gravity-aligned pointmaps (green). We also render a grid depicting a plane parallel to the ground. We observe that the composition of VGGT and GeoCalib often does not produce sufficiently upright pointmap… view at source ↗

**Figure 7.** Figure 7: Visualizing gravity-aligned ground-truth created using model_orientation_aligner. We visualize perspective fields [13] using gravity-aligned data created using model_orientation_aligner for a few samples from the DL3DV, 7Scenes, NRGBD and ETH3D datasets. For all samples, we can see that the up vectors are aligned with vertical structures present in the scene, thus validating the reliability of using model_… view at source ↗

**Figure 8.** Figure 8: Effect of number of overlapping frames in submap-based incremental reconstruction. Here, we plot pose metrics (APER, APEt) and structure metrics (ACC, COMP) as a function of number of overlapping frames, averaged across the 10 sequences from Tables 3 and 4 (lower is better for all metrics). While performance generally improves with more overlapping frames, we observe that G3T has a consistent lead across a… view at source ↗

read the original abstract

Modern feed-forward 3D reconstruction methods like VGGT predict pixel-aligned pointmaps in camera-centric coordinate frames. However, this choice of coordinate frame is not always optimal. We propose instead to predict pointmaps in upright, gravity-aligned frames that exploit strong structural cues present in many real-world scenes. Unlike camera-centric frames, gravity-aligned frames share a common vertical axis across viewpoints, reducing the rotational degrees of freedom needed to relate pointmaps to one another. To this end, we introduce the Gravity Grounded Geometry Transformer (G3T), fine-tuned from existing models on gravity-aligned 3D data. G3T produces highly accurate gravity-aware predictions, including upright pointmaps and camera-to-gravity poses. We further introduce G3T-Long, a submap-based incremental 3D reconstruction pipeline that leverages the reduced rotational degrees of freedom afforded by upright frames to achieve significantly improved reconstruction accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The core idea is to predict pointmaps in gravity-aligned frames instead of camera frames to cut rotational alignment work, but the abstract gives no numbers or experiments to show it actually helps.

read the letter

The paper's main move is to switch from camera-centric pointmaps (like in VGGT) to upright, gravity-aligned ones. Because many scenes have a consistent up direction, this shared vertical axis removes one rotational degree of freedom when relating different views. They introduce G3T, a fine-tuned transformer that outputs both the pointmaps and the camera-to-gravity poses, then build G3T-Long as an incremental pipeline that uses the simpler alignment to stitch submaps.

That modeling choice is internally consistent and directly addresses a real pain point in multi-view pointmap fusion. If the gravity direction is reliable, the reduced search space for relative pose should make incremental reconstruction easier and potentially more stable.

The obvious soft spot is the complete absence of any quantitative evidence. The abstract claims "significantly improved reconstruction accuracy" for G3T-Long, yet supplies no error metrics, baselines, datasets, or even qualitative examples. Without those, it is impossible to judge whether the benefit is real or just plausible on paper. The stress-test note correctly flags that the argument is coherent on its own terms, but coherence alone does not substitute for results.

This is the kind of targeted modeling tweak that people working on feed-forward 3D reconstruction (DUSt3R, VGGT, etc.) would want to see tested. A serious referee could check whether the gravity-aligned data preparation is clean, whether the fine-tuning actually delivers the promised accuracy lift, and whether the incremental pipeline scales without drift. The work is coherent enough to deserve that review rather than a desk reject.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes predicting pixel-aligned pointmaps in upright gravity-aligned coordinate frames rather than camera-centric frames (as in VGGT) to exploit consistent vertical axes across views and thereby reduce rotational degrees of freedom in multi-view alignment. It introduces the Gravity Grounded Geometry Transformer (G3T), obtained by fine-tuning existing models on gravity-aligned 3D data, which outputs upright pointmaps together with camera-to-gravity poses, and presents G3T-Long, a submap-based incremental reconstruction pipeline that is claimed to deliver significantly improved accuracy by leveraging the reduced rotational freedom.

Significance. If the claimed accuracy gains are demonstrated, the modeling choice of gravity-aligned frames would constitute a simple, parameter-free structural prior that could improve robustness of feed-forward pointmap methods on man-made and outdoor scenes without altering network architecture or training objectives.

major comments (1)

[Abstract] Abstract: the claim that G3T-Long 'achieves significantly improved reconstruction accuracy' is load-bearing for the central contribution yet is unsupported by any quantitative results, baselines, error metrics, or experimental protocol; without such evidence the magnitude and reliability of the improvement cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The single major comment concerns the abstract's claim of significantly improved accuracy for G3T-Long. We address it directly below and agree that the abstract should better reflect the supporting evidence present in the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that G3T-Long 'achieves significantly improved reconstruction accuracy' is load-bearing for the central contribution yet is unsupported by any quantitative results, baselines, error metrics, or experimental protocol; without such evidence the magnitude and reliability of the improvement cannot be evaluated.

Authors: We agree that the abstract, as currently worded, makes a strong claim without embedding the supporting numbers or protocol. The full manuscript (Sections 4 and 5) reports quantitative comparisons on standard benchmarks, including absolute trajectory error and pointmap accuracy metrics against VGGT and other baselines, with the gravity-aligned formulation yielding consistent reductions in rotational error. To make the abstract self-contained and address the concern, we will revise it to include the key quantitative improvements and a brief reference to the evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a modeling choice to predict pointmaps in gravity-aligned upright frames rather than camera-centric ones, justified by the presence of consistent gravity directions in real-world scenes. This choice directly reduces rotational degrees of freedom between views by construction of the coordinate system, but the paper does not derive this as a 'prediction' from fitted parameters or reduce any central claim to self-citation. No equations, uniqueness theorems, or ansatzes are smuggled via self-citation; the pipeline (fine-tuning on gravity-aligned data, predicting camera-to-gravity poses, and incremental reconstruction) is presented as a coherent new approach without load-bearing reductions to inputs. The abstract and described method are self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on a domain assumption about consistent gravity direction in real scenes and the effectiveness of fine-tuning; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)

domain assumption Real-world scenes often exhibit a consistent vertical direction due to gravity that can be exploited across viewpoints
Invoked to justify why gravity-aligned frames reduce rotational degrees of freedom compared with camera-centric frames.

pith-pipeline@v0.9.1-grok · 5686 in / 1191 out tokens · 46719 ms · 2026-06-29T18:27:43.637566+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Neural rgb-d surface reconstruction

Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InCVPR, pages 6290–6301, June 2022

2022
[2]

ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. In NeurIPS Datasets and Benchmarks, 2021

2021
[3]

Deep regression on manifolds: a 3D rotation case study

Romain Brégier. Deep regression on manifolds: a 3D rotation case study. In2021 International Conference on 3D Vision (3DV), 2021

2021
[4]

Can generative video models help pose estimation? InCVPR, 2025

Ruojin Cai, Jason Y Zhang, Philipp Henzler, Zhengqi Li, Noah Snavely, and Ricardo Martin- Brualla. Can generative video models help pose estimation? InCVPR, 2025

2025
[5]

Using vanishing points for camera calibration

Bruno Caprile and Vincent Torre. Using vanishing points for camera calibration. InIJCV, 1990

1990
[6]

Easi3r: Estimating disentangled motion from dust3r without training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disentangled motion from dust3r without training. InICCV, 2025

2025
[7]

Ttt3r: 3d reconstruc- tion as test-time training.ICLR, 2026

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruc- tion as test-time training.ICLR, 2026

2026
[8]

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Black, Trevor Darrell, and Angjoo Kanazawa

Haiwen Feng*, Junyi Zhang*, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J. Black, Trevor Darrell, and Angjoo Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world. InICCV, 2025

2025
[10]

Seitz, and Richard Szeliski

Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski. Manhattan-world stereo. InCVPR, 2009

2009
[11]

Barron, Noah Snavely, and Aleksander Holynski

Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, and Aleksander Holynski. ZipMap: Linear-time stateful 3d reconstruction via test-time training. In CVPR, 2026

2026
[12]

Stereo4d: Learning how things move in 3d from internet stereo videos

Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos. InCVPR, 2025

2025
[13]

Linyi Jin, Jianming Zhang, Yannick Hold-Geoffroy, Oliver Wang, Kevin Matzen, Matthew Sticha, and David F. Fouhey. Perspective fields for single image camera calibration. InCVPR, 2023. 10

2023
[14]

MapAnything: Universal feed-forward metric 3D reconstruc- tion

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, To- bias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed-forward metric 3D reconstruc- tion....

2026
[15]

gradslam: Dense slam meets automatic differentiation

Jatavallabhula Krishna Murthy, Soroush Saryazdi, Ganesh Iyer, and Liam Paull. gradslam: Dense slam meets automatic differentiation. InIEEE International Conference on Robotics & Automation (ICRA), 2020

2020
[16]

Closed-form solutions to minimal absolute pose problems with known vertical direction

Zuzana Kukelova, Martin Bujnak, and Tomas Pajdla. Closed-form solutions to minimal absolute pose problems with known vertical direction. InACCV, 2010

2010
[17]

Canonical surface mapping via geometric cycle consistency

Nilesh Kulkarni, Abhinav Gupta, and Shubham Tulsiani. Canonical surface mapping via geometric cycle consistency. InICCV, 2019

2019
[18]

Automatic upright adjustment of photographs with robust camera calibration

Hyunjoon Lee, Eli Shechtman, Jue Wang, and Seungyong Lee. Automatic upright adjustment of photographs with robust camera calibration. InIEEE TPAMI, 2014

2014
[19]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InECCV, 2024

2024
[20]

Megadepth: Learning single-view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. InCVPR, 2018

2018
[21]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, pages 22160–22169, 2024

2024
[23]

Vggt-slam: Dense rgb slam optimized on the sl (4) manifold

Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold. InNeurIPS, 2025

2025
[24]

Gravity-aligned rotation averaging with circular regression

Linfei Pan, Marc Pollefeys, and Dániel Baráth. Gravity-aligned rotation averaging with circular regression. InECCV, 2024

2024
[25]

Schönberger, and Marc Pollefeys

Zador Pataki, Paul-Edouard Sarlin, Johannes L. Schönberger, and Marc Pollefeys. MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion. InCVPR, 2025

2025
[26]

Susskind

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InICCV, 2021

2021
[27]

Homography based egomotion estimation with a common direction

Olivier Saurer, Pascal Vasseur, Rémi Boutteau, Cédric Demonceaux, Marc Pollefeys, and Friedrich Fraundorfer. Homography based egomotion estimation with a common direction. IEEE TPAMI, 2017

2017
[28]

Structure-from-motion revisited

Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016

2016
[29]

Pixelwise view selection for unstructured multi-view stereo

Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016

2016
[30]

Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger

Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InCVPR, 2017

2017
[31]

World-grounded human motion recovery via gravity-view coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. In SIGGRAPH Asia Conference Proceedings, 2024. 11

2024
[32]

Scene coordinate regression forests for camera relocalization in rgb-d images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In CVPR, June 2013

2013
[33]

Jefferey A. Shufelt. Performance evaluation and analysis of vanishing point detection techniques. IEEE TPAMI, 21(3):282–288, 1999

1999
[34]

Sturm, N

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. InProc. of the Int. Conf. on Intelligent Robot Systems (IROS), 2012

2012
[35]

Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds

Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. InCVPR, 2025

2025
[36]

Non-iterative approach for fast and accurate vanishing point detection

Jean-Philippe Tardif. Non-iterative approach for fast and accurate vanishing point detection. In ICCV, 2009

2009
[37]

Megascenes: Scene-level view synthesis at scale

Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. InECCV, 2024

2024
[38]

S. Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE TPAMI, 1991

1991
[39]

GeoCalib: Single-image Calibration with Geometric Optimization

Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Single-image Calibration with Geometric Optimization. InECCV, 2024

2024
[40]

PyPose: A library for robot learning with physics-based optimization

Chen Wang, Dasong Gao, Kuan Xu, Junyi Geng, Yaoyu Hu, Yuheng Qiu, Bowen Li, Fan Yang, Brady Moon, Abhinav Pandey, Aryan, Jiahe Xu, Tianhao Wu, Haonan He, Daning Huang, Zhongqiang Ren, Shibo Zhao, Taimeng Fu, Pranay Reddy, Xiao Lin, Wenshan Wang, Jingnan Shi, Rajat Talak, Kun Cao, Yi Du, Han Wang, Huai Yu, Shanzhao Wang, Siyu Chen, Ananth Kashyap, Rohan Ba...

2023
[41]

He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J. Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. In CVPR, 2019

2019
[42]

3d reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. InInternational Conference on 3D Vision (3DV), 2025

2025
[43]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

2025
[44]

Efros, and Angjoo Kanazawa

Qianqian Wang*, Yifei Zhang*, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InCVPR, 2025

2025
[45]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024

2024
[46]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020

2020
[47]

π3: Permutation-equivariant visual geometry learning.ICLR, 2026

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning.ICLR, 2026

2026
[48]

Changchang Wu, Sameer Agarwal, Brian Curless, and Steven M. Seitz. Schematic surface reconstruction. InCVPR, 2012

2012
[49]

Uprightnet: geometry-aware camera orientation estimation from single images

Wenqi Xian, Zhengqi Li, Matthew Fisher, Jonathan Eisenmann, Eli Shechtman, and Noah Snavely. Uprightnet: geometry-aware camera orientation estimation from single images. In ICCV, 2019. 12

2019
[50]

Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InCVPR, 2025

2025
[51]

Monst3r: A simple approach for estimating geometry in the presence of motion.ICLR, 2025

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.ICLR, 2025

2025
[52]

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. Loger: Long-context geometric reconstruction with hybrid memory.arXiv preprint arXiv:2603.03269, 2026. 13 Supplementary Material A Additional qualitative results We have attached additional qualitative results in Figure 6. We can see that ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Neural rgb-d surface reconstruction

Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InCVPR, pages 6290–6301, June 2022

2022

[2] [2]

ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. In NeurIPS Datasets and Benchmarks, 2021

2021

[3] [3]

Deep regression on manifolds: a 3D rotation case study

Romain Brégier. Deep regression on manifolds: a 3D rotation case study. In2021 International Conference on 3D Vision (3DV), 2021

2021

[4] [4]

Can generative video models help pose estimation? InCVPR, 2025

Ruojin Cai, Jason Y Zhang, Philipp Henzler, Zhengqi Li, Noah Snavely, and Ricardo Martin- Brualla. Can generative video models help pose estimation? InCVPR, 2025

2025

[5] [5]

Using vanishing points for camera calibration

Bruno Caprile and Vincent Torre. Using vanishing points for camera calibration. InIJCV, 1990

1990

[6] [6]

Easi3r: Estimating disentangled motion from dust3r without training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disentangled motion from dust3r without training. InICCV, 2025

2025

[7] [7]

Ttt3r: 3d reconstruc- tion as test-time training.ICLR, 2026

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruc- tion as test-time training.ICLR, 2026

2026

[8] [8]

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Black, Trevor Darrell, and Angjoo Kanazawa

Haiwen Feng*, Junyi Zhang*, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J. Black, Trevor Darrell, and Angjoo Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world. InICCV, 2025

2025

[10] [10]

Seitz, and Richard Szeliski

Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski. Manhattan-world stereo. InCVPR, 2009

2009

[11] [11]

Barron, Noah Snavely, and Aleksander Holynski

Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, and Aleksander Holynski. ZipMap: Linear-time stateful 3d reconstruction via test-time training. In CVPR, 2026

2026

[12] [12]

Stereo4d: Learning how things move in 3d from internet stereo videos

Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos. InCVPR, 2025

2025

[13] [13]

Linyi Jin, Jianming Zhang, Yannick Hold-Geoffroy, Oliver Wang, Kevin Matzen, Matthew Sticha, and David F. Fouhey. Perspective fields for single image camera calibration. InCVPR, 2023. 10

2023

[14] [14]

MapAnything: Universal feed-forward metric 3D reconstruc- tion

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, To- bias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed-forward metric 3D reconstruc- tion....

2026

[15] [15]

gradslam: Dense slam meets automatic differentiation

Jatavallabhula Krishna Murthy, Soroush Saryazdi, Ganesh Iyer, and Liam Paull. gradslam: Dense slam meets automatic differentiation. InIEEE International Conference on Robotics & Automation (ICRA), 2020

2020

[16] [16]

Closed-form solutions to minimal absolute pose problems with known vertical direction

Zuzana Kukelova, Martin Bujnak, and Tomas Pajdla. Closed-form solutions to minimal absolute pose problems with known vertical direction. InACCV, 2010

2010

[17] [17]

Canonical surface mapping via geometric cycle consistency

Nilesh Kulkarni, Abhinav Gupta, and Shubham Tulsiani. Canonical surface mapping via geometric cycle consistency. InICCV, 2019

2019

[18] [18]

Automatic upright adjustment of photographs with robust camera calibration

Hyunjoon Lee, Eli Shechtman, Jue Wang, and Seungyong Lee. Automatic upright adjustment of photographs with robust camera calibration. InIEEE TPAMI, 2014

2014

[19] [19]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InECCV, 2024

2024

[20] [20]

Megadepth: Learning single-view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. InCVPR, 2018

2018

[21] [21]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, pages 22160–22169, 2024

2024

[23] [23]

Vggt-slam: Dense rgb slam optimized on the sl (4) manifold

Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold. InNeurIPS, 2025

2025

[24] [24]

Gravity-aligned rotation averaging with circular regression

Linfei Pan, Marc Pollefeys, and Dániel Baráth. Gravity-aligned rotation averaging with circular regression. InECCV, 2024

2024

[25] [25]

Schönberger, and Marc Pollefeys

Zador Pataki, Paul-Edouard Sarlin, Johannes L. Schönberger, and Marc Pollefeys. MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion. InCVPR, 2025

2025

[26] [26]

Susskind

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InICCV, 2021

2021

[27] [27]

Homography based egomotion estimation with a common direction

Olivier Saurer, Pascal Vasseur, Rémi Boutteau, Cédric Demonceaux, Marc Pollefeys, and Friedrich Fraundorfer. Homography based egomotion estimation with a common direction. IEEE TPAMI, 2017

2017

[28] [28]

Structure-from-motion revisited

Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016

2016

[29] [29]

Pixelwise view selection for unstructured multi-view stereo

Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016

2016

[30] [30]

Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger

Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InCVPR, 2017

2017

[31] [31]

World-grounded human motion recovery via gravity-view coordinates

Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. In SIGGRAPH Asia Conference Proceedings, 2024. 11

2024

[32] [32]

Scene coordinate regression forests for camera relocalization in rgb-d images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In CVPR, June 2013

2013

[33] [33]

Jefferey A. Shufelt. Performance evaluation and analysis of vanishing point detection techniques. IEEE TPAMI, 21(3):282–288, 1999

1999

[34] [34]

Sturm, N

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. InProc. of the Int. Conf. on Intelligent Robot Systems (IROS), 2012

2012

[35] [35]

Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds

Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. InCVPR, 2025

2025

[36] [36]

Non-iterative approach for fast and accurate vanishing point detection

Jean-Philippe Tardif. Non-iterative approach for fast and accurate vanishing point detection. In ICCV, 2009

2009

[37] [37]

Megascenes: Scene-level view synthesis at scale

Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. InECCV, 2024

2024

[38] [38]

S. Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE TPAMI, 1991

1991

[39] [39]

GeoCalib: Single-image Calibration with Geometric Optimization

Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Single-image Calibration with Geometric Optimization. InECCV, 2024

2024

[40] [40]

PyPose: A library for robot learning with physics-based optimization

Chen Wang, Dasong Gao, Kuan Xu, Junyi Geng, Yaoyu Hu, Yuheng Qiu, Bowen Li, Fan Yang, Brady Moon, Abhinav Pandey, Aryan, Jiahe Xu, Tianhao Wu, Haonan He, Daning Huang, Zhongqiang Ren, Shibo Zhao, Taimeng Fu, Pranay Reddy, Xiao Lin, Wenshan Wang, Jingnan Shi, Rajat Talak, Kun Cao, Yi Du, Han Wang, Huai Yu, Shanzhao Wang, Siyu Chen, Ananth Kashyap, Rohan Ba...

2023

[41] [41]

He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J. Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. In CVPR, 2019

2019

[42] [42]

3d reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. InInternational Conference on 3D Vision (3DV), 2025

2025

[43] [43]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

2025

[44] [44]

Efros, and Angjoo Kanazawa

Qianqian Wang*, Yifei Zhang*, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InCVPR, 2025

2025

[45] [45]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024

2024

[46] [46]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020

2020

[47] [47]

π3: Permutation-equivariant visual geometry learning.ICLR, 2026

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning.ICLR, 2026

2026

[48] [48]

Changchang Wu, Sameer Agarwal, Brian Curless, and Steven M. Seitz. Schematic surface reconstruction. InCVPR, 2012

2012

[49] [49]

Uprightnet: geometry-aware camera orientation estimation from single images

Wenqi Xian, Zhengqi Li, Matthew Fisher, Jonathan Eisenmann, Eli Shechtman, and Noah Snavely. Uprightnet: geometry-aware camera orientation estimation from single images. In ICCV, 2019. 12

2019

[50] [50]

Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InCVPR, 2025

2025

[51] [51]

Monst3r: A simple approach for estimating geometry in the presence of motion.ICLR, 2025

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.ICLR, 2025

2025

[52] [52]

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. Loger: Long-context geometric reconstruction with hybrid memory.arXiv preprint arXiv:2603.03269, 2026. 13 Supplementary Material A Additional qualitative results We have attached additional qualitative results in Figure 6. We can see that ...

work page internal anchor Pith review Pith/arXiv arXiv 2026