pith. sign in

arxiv: 2605.27372 · v1 · pith:PTH3FLENnew · submitted 2026-05-26 · 💻 cs.CV

G3T Up! Gravity Aligned Coordinate Frames Simplify Pointmap Processing

Pith reviewed 2026-06-29 18:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D reconstructionpointmapsgravity alignmentfeed-forward modelspose estimationcoordinate framesincremental reconstruction
0
0 comments X

The pith

Predicting pointmaps in gravity-aligned frames rather than camera-centric ones improves 3D reconstruction by sharing a common vertical axis across views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that feed-forward 3D reconstruction should move from camera-centric pointmap predictions to upright, gravity-aligned frames. This exploits the consistent vertical direction present in many real scenes, so that pointmaps from different viewpoints already share one axis and require less rotation to align. The authors introduce G3T, a transformer fine-tuned on gravity-aligned data that outputs both upright pointmaps and camera-to-gravity poses. They then build G3T-Long, an incremental pipeline that uses the reduced rotational freedom to produce more accurate submap-based reconstructions. A sympathetic reader would care because the change in coordinate frame is presented as a simple, data-driven lever that directly raises accuracy without altering network architecture.

Core claim

Gravity-aligned frames let pointmaps share a common vertical axis across viewpoints, which reduces the rotational degrees of freedom needed to relate them; the resulting G3T model produces accurate upright pointmaps and camera-to-gravity poses, and the G3T-Long submap pipeline that operates on these predictions delivers significantly higher reconstruction accuracy than camera-centric baselines.

What carries the argument

Gravity Grounded Geometry Transformer (G3T), a model fine-tuned to predict pointmaps and poses directly in gravity-aligned frames.

If this is right

  • Upright pointmaps share one vertical axis across all views, so only two rotational degrees of freedom remain when aligning them.
  • G3T produces both the upright pointmaps and the camera-to-gravity pose estimates needed for incremental reconstruction.
  • G3T-Long, the submap-based pipeline, converts the reduced rotational freedom into measurably higher final accuracy.
  • The same gravity-aligned output format works for any base model that can be fine-tuned on aligned 3D data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Coordinate-frame choice becomes a first-class design decision for future feed-forward 3D models, comparable to network architecture.
  • The method could be combined with IMU or accelerometer data to supply the gravity direction when visual cues are weak.
  • Scenes with strong but varying gravity (tilted buildings, sloped terrain) may require an adaptive gravity vector per submap rather than a single global direction.

Load-bearing premise

Many real-world scenes contain strong structural cues with a consistent gravity direction that can be exploited by predicting pointmaps in gravity-aligned frames.

What would settle it

Reconstruction accuracy on a test set of scenes lacking consistent gravity direction (for example, underwater footage or microgravity environments) shows no gain or a drop relative to camera-centric baselines.

Figures

Figures reproduced from arXiv: 2605.27372 by Bharath Raj Nagoor Kani, Noah Snavely.

Figure 1
Figure 1. Figure 1: Gravity Grounded Geometry Transformer (G3T) predicts pointmaps aligned with scene gravity, leveraging structural cues inherent in natural scenes. We visualize uprightness using a ground￾parallel grid and height-dependent color encoding. Pointmaps produced by G3T show near-constant color in ground-parallel regions (such as floors, benches etc.) indicating upright-alignment (a), whereas those produced by VGG… view at source ↗
Figure 2
Figure 2. Figure 2: Camera-aligned and gravity-aligned coordinate frames. Feedforward 3D reconstruction methods such as VGGT [43] typically predict pointmaps in a camera coordinate frame. Such pointmaps can be related to each other using a pose π(s, R, t) ∈ Sim(3) that has 7 degrees of freedom (DoF). In contrast, G3T predicts pointmaps in a gravity-aligned coordinate frame. Such pointmaps can be related to each other using a … view at source ↗
Figure 3
Figure 3. Figure 3: Model architecture. G3T builds upon VGGT with two key modifications. First, the point head outputs pointmaps in the gravity-aligned frame of the first image (X G1 = {X G1 i }). Second, we replace VGGT’s camera head with two new heads: the local camera head, whose outputs capture gravity-to-camera rotation and camera intrinsics parameters in G l = {Gl i }; and the relative camera head, which capture 1-DoF r… view at source ↗
Figure 4
Figure 4. Figure 4: G3T can consistently place pointmaps in an upright frame. For the given set of input images, we compare VGGT predictions (red) made upright using GeoCalib, and our G3T predictions (blue) with ground-truth gravity-aligned pointmaps (green). We also render a grid depicting a plane parallel to the ground. We observe that the composition of VGGT and GeoCalib often produces “slanted” pointmap predictions that a… view at source ↗
Figure 5
Figure 5. Figure 5: Visualizing failure cases. G3T struggles to resolve upright alignment in scenes with ambiguous structural cues. We illustrate this with two self-captured examples. In (a), close-up floor images cause G3T to produce a slanted pointmap. In (b), horizontally rotated images of a vertically-aligned cabinet lead to pointmaps with incorrect orientation. depthmap unprojection results additionally validate that the… view at source ↗
Figure 6
Figure 6. Figure 6: G3T can consistently place pointmaps in an upright frame. For the given set of input images, we compare VGGT predictions (red) made upright using GeoCalib, and our G3T predictions (blue) with ground-truth gravity-aligned pointmaps (green). We also render a grid depicting a plane parallel to the ground. We observe that the composition of VGGT and GeoCalib often does not produce sufficiently upright pointmap… view at source ↗
Figure 7
Figure 7. Figure 7: Visualizing gravity-aligned ground-truth created using model_orientation_aligner. We visualize perspective fields [13] using gravity-aligned data created using model_orientation_aligner for a few samples from the DL3DV, 7Scenes, NRGBD and ETH3D datasets. For all samples, we can see that the up vectors are aligned with vertical structures present in the scene, thus validating the reliability of using model_… view at source ↗
Figure 8
Figure 8. Figure 8: Effect of number of overlapping frames in submap-based incremental reconstruction. Here, we plot pose metrics (APER, APEt) and structure metrics (ACC, COMP) as a function of number of overlapping frames, averaged across the 10 sequences from Tables 3 and 4 (lower is better for all metrics). While performance generally improves with more overlapping frames, we observe that G3T has a consistent lead across a… view at source ↗
read the original abstract

Modern feed-forward 3D reconstruction methods like VGGT predict pixel-aligned pointmaps in camera-centric coordinate frames. However, this choice of coordinate frame is not always optimal. We propose instead to predict pointmaps in upright, gravity-aligned frames that exploit strong structural cues present in many real-world scenes. Unlike camera-centric frames, gravity-aligned frames share a common vertical axis across viewpoints, reducing the rotational degrees of freedom needed to relate pointmaps to one another. To this end, we introduce the Gravity Grounded Geometry Transformer (G3T), fine-tuned from existing models on gravity-aligned 3D data. G3T produces highly accurate gravity-aware predictions, including upright pointmaps and camera-to-gravity poses. We further introduce G3T-Long, a submap-based incremental 3D reconstruction pipeline that leverages the reduced rotational degrees of freedom afforded by upright frames to achieve significantly improved reconstruction accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes predicting pixel-aligned pointmaps in upright gravity-aligned coordinate frames rather than camera-centric frames (as in VGGT) to exploit consistent vertical axes across views and thereby reduce rotational degrees of freedom in multi-view alignment. It introduces the Gravity Grounded Geometry Transformer (G3T), obtained by fine-tuning existing models on gravity-aligned 3D data, which outputs upright pointmaps together with camera-to-gravity poses, and presents G3T-Long, a submap-based incremental reconstruction pipeline that is claimed to deliver significantly improved accuracy by leveraging the reduced rotational freedom.

Significance. If the claimed accuracy gains are demonstrated, the modeling choice of gravity-aligned frames would constitute a simple, parameter-free structural prior that could improve robustness of feed-forward pointmap methods on man-made and outdoor scenes without altering network architecture or training objectives.

major comments (1)
  1. [Abstract] Abstract: the claim that G3T-Long 'achieves significantly improved reconstruction accuracy' is load-bearing for the central contribution yet is unsupported by any quantitative results, baselines, error metrics, or experimental protocol; without such evidence the magnitude and reliability of the improvement cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The single major comment concerns the abstract's claim of significantly improved accuracy for G3T-Long. We address it directly below and agree that the abstract should better reflect the supporting evidence present in the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that G3T-Long 'achieves significantly improved reconstruction accuracy' is load-bearing for the central contribution yet is unsupported by any quantitative results, baselines, error metrics, or experimental protocol; without such evidence the magnitude and reliability of the improvement cannot be evaluated.

    Authors: We agree that the abstract, as currently worded, makes a strong claim without embedding the supporting numbers or protocol. The full manuscript (Sections 4 and 5) reports quantitative comparisons on standard benchmarks, including absolute trajectory error and pointmap accuracy metrics against VGGT and other baselines, with the gravity-aligned formulation yielding consistent reductions in rotational error. To make the abstract self-contained and address the concern, we will revise it to include the key quantitative improvements and a brief reference to the evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a modeling choice to predict pointmaps in gravity-aligned upright frames rather than camera-centric ones, justified by the presence of consistent gravity directions in real-world scenes. This choice directly reduces rotational degrees of freedom between views by construction of the coordinate system, but the paper does not derive this as a 'prediction' from fitted parameters or reduce any central claim to self-citation. No equations, uniqueness theorems, or ansatzes are smuggled via self-citation; the pipeline (fine-tuning on gravity-aligned data, predicting camera-to-gravity poses, and incremental reconstruction) is presented as a coherent new approach without load-bearing reductions to inputs. The abstract and described method are self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on a domain assumption about consistent gravity direction in real scenes and the effectiveness of fine-tuning; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)
  • domain assumption Real-world scenes often exhibit a consistent vertical direction due to gravity that can be exploited across viewpoints
    Invoked to justify why gravity-aligned frames reduce rotational degrees of freedom compared with camera-centric frames.

pith-pipeline@v0.9.1-grok · 5686 in / 1191 out tokens · 46719 ms · 2026-06-29T18:27:43.637566+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 3 canonical work pages · 3 internal anchors

  1. [1]

    Neural rgb-d surface reconstruction

    Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InCVPR, pages 6290–6301, June 2022

  2. [2]

    ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. In NeurIPS Datasets and Benchmarks, 2021

  3. [3]

    Deep regression on manifolds: a 3D rotation case study

    Romain Brégier. Deep regression on manifolds: a 3D rotation case study. In2021 International Conference on 3D Vision (3DV), 2021

  4. [4]

    Can generative video models help pose estimation? InCVPR, 2025

    Ruojin Cai, Jason Y Zhang, Philipp Henzler, Zhengqi Li, Noah Snavely, and Ricardo Martin- Brualla. Can generative video models help pose estimation? InCVPR, 2025

  5. [5]

    Using vanishing points for camera calibration

    Bruno Caprile and Vincent Torre. Using vanishing points for camera calibration. InIJCV, 1990

  6. [6]

    Easi3r: Estimating disentangled motion from dust3r without training

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disentangled motion from dust3r without training. InICCV, 2025

  7. [7]

    Ttt3r: 3d reconstruc- tion as test-time training.ICLR, 2026

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruc- tion as test-time training.ICLR, 2026

  8. [8]

    VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

    Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025

  9. [9]

    Black, Trevor Darrell, and Angjoo Kanazawa

    Haiwen Feng*, Junyi Zhang*, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J. Black, Trevor Darrell, and Angjoo Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world. InICCV, 2025

  10. [10]

    Seitz, and Richard Szeliski

    Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski. Manhattan-world stereo. InCVPR, 2009

  11. [11]

    Barron, Noah Snavely, and Aleksander Holynski

    Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, and Aleksander Holynski. ZipMap: Linear-time stateful 3d reconstruction via test-time training. In CVPR, 2026

  12. [12]

    Stereo4d: Learning how things move in 3d from internet stereo videos

    Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos. InCVPR, 2025

  13. [13]

    Linyi Jin, Jianming Zhang, Yannick Hold-Geoffroy, Oliver Wang, Kevin Matzen, Matthew Sticha, and David F. Fouhey. Perspective fields for single image camera calibration. InCVPR, 2023. 10

  14. [14]

    MapAnything: Universal feed-forward metric 3D reconstruc- tion

    Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, To- bias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed-forward metric 3D reconstruc- tion....

  15. [15]

    gradslam: Dense slam meets automatic differentiation

    Jatavallabhula Krishna Murthy, Soroush Saryazdi, Ganesh Iyer, and Liam Paull. gradslam: Dense slam meets automatic differentiation. InIEEE International Conference on Robotics & Automation (ICRA), 2020

  16. [16]

    Closed-form solutions to minimal absolute pose problems with known vertical direction

    Zuzana Kukelova, Martin Bujnak, and Tomas Pajdla. Closed-form solutions to minimal absolute pose problems with known vertical direction. InACCV, 2010

  17. [17]

    Canonical surface mapping via geometric cycle consistency

    Nilesh Kulkarni, Abhinav Gupta, and Shubham Tulsiani. Canonical surface mapping via geometric cycle consistency. InICCV, 2019

  18. [18]

    Automatic upright adjustment of photographs with robust camera calibration

    Hyunjoon Lee, Eli Shechtman, Jue Wang, and Seungyong Lee. Automatic upright adjustment of photographs with robust camera calibration. InIEEE TPAMI, 2014

  19. [19]

    Grounding image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InECCV, 2024

  20. [20]

    Megadepth: Learning single-view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. InCVPR, 2018

  21. [21]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  22. [22]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, pages 22160–22169, 2024

  23. [23]

    Vggt-slam: Dense rgb slam optimized on the sl (4) manifold

    Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold. InNeurIPS, 2025

  24. [24]

    Gravity-aligned rotation averaging with circular regression

    Linfei Pan, Marc Pollefeys, and Dániel Baráth. Gravity-aligned rotation averaging with circular regression. InECCV, 2024

  25. [25]

    Schönberger, and Marc Pollefeys

    Zador Pataki, Paul-Edouard Sarlin, Johannes L. Schönberger, and Marc Pollefeys. MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion. InCVPR, 2025

  26. [26]

    Susskind

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InICCV, 2021

  27. [27]

    Homography based egomotion estimation with a common direction

    Olivier Saurer, Pascal Vasseur, Rémi Boutteau, Cédric Demonceaux, Marc Pollefeys, and Friedrich Fraundorfer. Homography based egomotion estimation with a common direction. IEEE TPAMI, 2017

  28. [28]

    Structure-from-motion revisited

    Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  29. [29]

    Pixelwise view selection for unstructured multi-view stereo

    Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016

  30. [30]

    Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger

    Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InCVPR, 2017

  31. [31]

    World-grounded human motion recovery via gravity-view coordinates

    Zehong Shen, Huaijin Pi, Yan Xia, Zhi Cen, Sida Peng, Zechen Hu, Hujun Bao, Ruizhen Hu, and Xiaowei Zhou. World-grounded human motion recovery via gravity-view coordinates. In SIGGRAPH Asia Conference Proceedings, 2024. 11

  32. [32]

    Scene coordinate regression forests for camera relocalization in rgb-d images

    Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In CVPR, June 2013

  33. [33]

    Jefferey A. Shufelt. Performance evaluation and analysis of vanishing point detection techniques. IEEE TPAMI, 21(3):282–288, 1999

  34. [34]

    Sturm, N

    J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. InProc. of the Int. Conf. on Intelligent Robot Systems (IROS), 2012

  35. [35]

    Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds

    Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. InCVPR, 2025

  36. [36]

    Non-iterative approach for fast and accurate vanishing point detection

    Jean-Philippe Tardif. Non-iterative approach for fast and accurate vanishing point detection. In ICCV, 2009

  37. [37]

    Megascenes: Scene-level view synthesis at scale

    Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. InECCV, 2024

  38. [38]

    S. Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE TPAMI, 1991

  39. [39]

    GeoCalib: Single-image Calibration with Geometric Optimization

    Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Single-image Calibration with Geometric Optimization. InECCV, 2024

  40. [40]

    PyPose: A library for robot learning with physics-based optimization

    Chen Wang, Dasong Gao, Kuan Xu, Junyi Geng, Yaoyu Hu, Yuheng Qiu, Bowen Li, Fan Yang, Brady Moon, Abhinav Pandey, Aryan, Jiahe Xu, Tianhao Wu, Haonan He, Daning Huang, Zhongqiang Ren, Shibo Zhao, Taimeng Fu, Pranay Reddy, Xiao Lin, Wenshan Wang, Jingnan Shi, Rajat Talak, Kun Cao, Yi Du, Han Wang, Huai Yu, Shanzhao Wang, Siyu Chen, Ananth Kashyap, Rohan Ba...

  41. [41]

    He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J. Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. In CVPR, 2019

  42. [42]

    3d reconstruction with spatial memory

    Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. InInternational Conference on 3D Vision (3DV), 2025

  43. [43]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

  44. [44]

    Efros, and Angjoo Kanazawa

    Qianqian Wang*, Yifei Zhang*, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InCVPR, 2025

  45. [45]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024

  46. [46]

    Tartanair: A dataset to push the limits of visual slam

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020

  47. [47]

    π3: Permutation-equivariant visual geometry learning.ICLR, 2026

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning.ICLR, 2026

  48. [48]

    Changchang Wu, Sameer Agarwal, Brian Curless, and Steven M. Seitz. Schematic surface reconstruction. InCVPR, 2012

  49. [49]

    Uprightnet: geometry-aware camera orientation estimation from single images

    Wenqi Xian, Zhengqi Li, Matthew Fisher, Jonathan Eisenmann, Eli Shechtman, and Noah Snavely. Uprightnet: geometry-aware camera orientation estimation from single images. In ICCV, 2019. 12

  50. [50]

    Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

    Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InCVPR, 2025

  51. [51]

    Monst3r: A simple approach for estimating geometry in the presence of motion.ICLR, 2025

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.ICLR, 2025

  52. [52]

    LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. Loger: Long-context geometric reconstruction with hybrid memory.arXiv preprint arXiv:2603.03269, 2026. 13 Supplementary Material A Additional qualitative results We have attached additional qualitative results in Figure 6. We can see that ...