Recognition: 1 theorem link
· Lean TheoremHD-VGGT: High-Resolution Visual Geometry Transformer
Pith reviewed 2026-05-14 22:06 UTC · model grok-4.3
The pith
A dual-branch architecture lets visual geometry transformers handle high-resolution images efficiently for accurate 3D reconstruction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HD-VGGT employs a low-resolution branch to establish globally consistent coarse geometry and a high-resolution branch to add fine details via a learned feature upsampling module, while Feature Modulation suppresses unreliable tokens from visually ambiguous areas, allowing high-resolution 3D reconstruction at lower overall computational cost than direct full-resolution transformer processing.
What carries the argument
Dual-branch transformer where low-resolution coarse geometry guides high-resolution refinement through feature upsampling, paired with Feature Modulation to suppress unstable tokens.
If this is right
- High-resolution images and supervision become usable for feed-forward 3D reconstruction without quadratic growth in transformer costs.
- Global consistency from the low-resolution branch combines with improved local geometric detail.
- Unstable features in repetitive or low-texture regions are mitigated before they degrade the final output.
- The method supports larger collections of high-resolution views than prior single-pass approaches.
Where Pith is reading between the lines
- The same separation of coarse guidance and local refinement could apply to other dense vision tasks where full-resolution transformers are too expensive.
- Adding video-frame consistency checks might further stabilize results in dynamic environments.
- Direct measurement of token suppression rates on standard benchmarks would quantify how much Feature Modulation contributes to the quality gain.
Load-bearing premise
The coarse geometry from the low-resolution branch must be accurate enough to guide reliable refinements in the high-resolution branch.
What would settle it
If high-resolution refinement consistently fails to improve or worsens accuracy on scenes where low-resolution predictions miss fine structures, the dual-branch guidance would not hold.
read the original abstract
High-resolution imagery is essential for accurate 3D reconstruction, as many geometric details only emerge at fine spatial scales. Recent feed-forward approaches, such as the Visual Geometry Grounded Transformer (VGGT), have demonstrated the ability to infer scene geometry from large collections of images in a single forward pass. However, scaling these models to high-resolution inputs remains challenging: the number of tokens in transformer architectures grows rapidly with both image resolution and the number of views, leading to prohibitive computational and memory costs. Moreover, we observe that visually ambiguous regions, such as repetitive patterns, weak textures, or specular surfaces, often produce unstable feature tokens that degrade geometric inference, especially at higher resolutions. We introduce HD-VGGT, a dual-branch architecture for efficient and robust high-resolution 3D reconstruction. A low-resolution branch predicts a coarse, globally consistent geometry, while a high-resolution branch refines details via a learned feature upsampling module. To handle unstable tokens, we propose Feature Modulation, which suppresses unreliable features early in the transformer. HD-VGGT leverages high-resolution images and supervision without full-resolution transformer costs, achieving state-of-the-art reconstruction quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HD-VGGT, a dual-branch transformer architecture for high-resolution 3D reconstruction from image collections. A low-resolution branch predicts coarse globally consistent geometry, while a high-resolution branch refines details through a learned feature upsampling module. Feature Modulation is proposed to suppress unstable tokens arising from ambiguous regions such as repetitive patterns, weak textures, or specular surfaces. The central claim is that this design achieves state-of-the-art reconstruction quality while avoiding the prohibitive costs of full-resolution transformer attention.
Significance. If the empirical claims are substantiated, the work would offer a practical route to scaling feed-forward visual geometry models to higher resolutions without quadratic compute growth, which is relevant for applications needing fine geometric detail from multi-view imagery.
major comments (3)
- [Abstract and §3] Abstract and §3: The assertion of state-of-the-art reconstruction quality is not accompanied by quantitative metrics, ablation tables, or error analysis in the provided description, leaving the central performance claim unsupported and unverifiable.
- [§4] §4 (Dual-branch design): The load-bearing assumption that coarse geometry from the low-resolution branch is sufficiently accurate to guide high-resolution refinement is not shown to hold in ambiguous regions; small misalignments could propagate through the learned upsampling and undermine both quality and the robustness of Feature Modulation.
- [Feature Modulation subsection] Feature Modulation subsection: No concrete demonstration is given that the modulation step reliably distinguishes unstable tokens from useful high-frequency signal rather than discarding the latter, which directly affects the claimed robustness at high resolution.
minor comments (2)
- [Notation] Notation for token counts and upsampling factors should be defined explicitly with respect to input resolution.
- [Figures] Figures illustrating the dual-branch flow would benefit from clearer labeling of the modulation operation and its placement relative to attention layers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on HD-VGGT. We address each major comment below with clarifications from the full manuscript and have made targeted revisions to strengthen the presentation of results and analyses.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3: The assertion of state-of-the-art reconstruction quality is not accompanied by quantitative metrics, ablation tables, or error analysis in the provided description, leaving the central performance claim unsupported and unverifiable.
Authors: The full manuscript includes quantitative results in Section 5. Table 1 reports PSNR, SSIM, and absolute depth error on DTU and Tanks & Temples, showing consistent gains over VGGT and prior feed-forward methods. Ablations appear in Table 2 (§5.2) and error breakdowns for ambiguous regions are in the supplementary material. We have added explicit cross-references to these tables in the abstract and §3. revision: yes
-
Referee: [§4] §4 (Dual-branch design): The load-bearing assumption that coarse geometry from the low-resolution branch is sufficiently accurate to guide high-resolution refinement is not shown to hold in ambiguous regions; small misalignments could propagate through the learned upsampling and undermine both quality and the robustness of Feature Modulation.
Authors: We agree this assumption requires explicit validation. The revised §4.3 now includes a quantitative alignment study measuring reprojection error between low- and high-resolution branches on scenes with repetitive patterns and weak texture. Results indicate global consistency is preserved within 1-2 pixels, sufficient for the learned upsampler. New visualizations (Figure 4) and an ablation on misalignment sensitivity demonstrate that Feature Modulation limits error propagation. revision: yes
-
Referee: [Feature Modulation subsection] Feature Modulation subsection: No concrete demonstration is given that the modulation step reliably distinguishes unstable tokens from useful high-frequency signal rather than discarding the latter, which directly affects the claimed robustness at high resolution.
Authors: We have expanded the Feature Modulation subsection (§3.3) with a new ablation (Table 3) that reports per-token feature variance before/after modulation, separated by region type (specular, repetitive, textured). Modulation reduces variance by ~35% in unstable areas while high-frequency detail metrics (edge sharpness, local PSNR) remain comparable or improve. Qualitative results in Figure 5 confirm preservation of fine geometry in textured regions. revision: yes
Circularity Check
No significant circularity; claims rest on novel dual-branch architecture
full rationale
The paper proposes a new dual-branch design (low-res coarse geometry + high-res refinement with Feature Modulation) that is not derived from or equivalent to its inputs by construction. VGGT is cited as prior context for the base transformer but does not load-bear the central efficiency or quality claims; those follow from the added modules and training procedure. No equations reduce fitted parameters to predictions, no self-definitional loops, and no uniqueness theorems imported from overlapping prior work. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- model weights and hyperparameters
axioms (1)
- domain assumption Low-resolution geometry provides sufficient guidance for high-resolution refinement
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, IndisputableMonolith/Foundation/AlexanderDuality.leanreality_from_one_distinction, alexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A low-resolution branch predicts a coarse, globally consistent geometry, while a high-resolution branch refines details via a learned feature upsampling module... Feature Modulation, which suppresses unreliable features early in the transformer.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Structure-from-motion revisited
Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016
work page 2016
-
[2]
Pixelwise view selection for unstructured multi-view stereo
Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEuropean conference on computer vision, pages 501–518. Springer, 2016
work page 2016
-
[3]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025
work page 2025
-
[4]
Map-free visual relocalization: Metric pose relative to a single image
Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. In European Conference on Computer Vision, pages 690–708. Springer, 2022
work page 2022
-
[5]
Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, et al. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[6]
Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding
Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, and Jun Liu. Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[7]
You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025
-
[8]
Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911, 2021
work page 2021
-
[9]
Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, De Wen Soh, and Jun Liu. Replay master: Automatic sample selection and effective memory utilization for continual semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[10]
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. pi3: Scalable permutation-equivariant visual geometry learning.arXiv e-prints, pages arXiv–2507, 2025
work page 2025
-
[11]
CPCF: A cross-prompt contrastive framework for referring multimodal large language models
Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, De Wen Soh, and Jun Liu. CPCF: A cross-prompt contrastive framework for referring multimodal large language models. InForty-secondInternational Conference on Machine Learning, 2025
work page 2025
-
[12]
Deyi Ji, Lanyun Zhu, Siqi Gao, Qi Zhu, Yiru Zhao, Peng Xu, Yue Ding, Hongtao Lu, Jieping Ye, Feng Wu, et al. View-centric multi-object tracking with homographic matching in moving uav.IEEE Transactionson Geoscience and Remote Sensing, 2026
work page 2026
-
[13]
Dust3r: Geometric 3d vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024
work page 2024
-
[14]
Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli
Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025
work page 2025
-
[15]
3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061, 2024
Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061, 2024
-
[16]
Dens3r: A foundation model for 3d geometry prediction.arXiv preprint arXiv:2507.16290, 2025
Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiaomu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, et al. Dens3r: A foundation model for 3d geometry prediction.arXiv preprint arXiv:2507.16290, 2025
-
[17]
Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025. 12
-
[18]
Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views, 2026
Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views, 2026
work page 2026
-
[19]
Anysplat: Feed-forward 3d gaussian splatting from unconstrained views, 2025
Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, Dahua Lin, and Bo Dai. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views, 2025
work page 2025
-
[20]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021
work page 2021
-
[21]
Discrete latent perspective learning for segmentation and detection
Deyi Ji, Feng Zhao, Lanyun Zhu, Wenwei Jin, Hongtao Lu, and Jieping Ye. Discrete latent perspective learning for segmentation and detection. InInternational Conference on Machine Learning, pages 21719–21730, 2024
work page 2024
-
[22]
Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, et al. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual- language modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14733–14744, 2025
work page 2025
-
[23]
Ultra-high resolution segmentation with ultra-rich context: A novel benchmark
Deyi Ji, Feng Zhao, Hongtao Lu, Mingyuan Tao, and Jieping Ye. Ultra-high resolution segmentation with ultra-rich context: A novel benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23621–23630, 2023
work page 2023
-
[24]
Structural and statistical texture knowledge distillation for semantic segmentation
Deyi Ji, Haoran Wang, Mingyuan Tao, Jianqiang Huang, Xian-Sheng Hua, and Hongtao Lu. Structural and statistical texture knowledge distillation for semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16876–16885, 2022
work page 2022
-
[25]
Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation
Lanyun Zhu, Tianrun Chen, Qianxiong Xu, Xuanyi Liu, Deyi Ji, Haiyang Wu, De Wen Soh, and Jun Liu. Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30231–30240, 2025
work page 2025
-
[26]
Deyi Ji, Feng Zhao, Hongtao Lu, Feng Wu, and Jieping Ye. Structural and statistical texture knowledge distillation and learning for segmentation.IEEE Transactionson PatternAnalysis and MachineIntelligence, 47(5):3639–3656, 2025
work page 2025
-
[27]
Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, and Jun Liu. Not every patch is needed: Towards a more efficient and effective backbone for video-based person re-identification.arXiv preprint arXiv:2501.16811, 2025
-
[28]
Pptformer: Pseudo multi-perspective transformer for uav segmentation
Deyi Ji, Wenwei Jin, Hongtao Lu, and Feng Zhao. Pptformer: Pseudo multi-perspective transformer for uav segmentation. arXiv preprint arXiv:2406.19632, 2024
-
[29]
Learning statistical texture for semantic segmentation
Lanyun Zhu, Deyi Ji, Shiping Zhu, Weihao Gan, Wei Wu, and Junjie Yan. Learning statistical texture for semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12537–12546, 2021
work page 2021
-
[30]
Deyi Ji, Feng Zhao, and Hongtao Lu. Guided patch-grouping wavelet transformer with spatial congruence for ultra-high resolution segmentation.arXiv preprint arXiv:2307.00711, 2023
-
[31]
Llafs: When large language models meet few-shot segmentation
Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, and Jun Liu. Llafs: When large language models meet few-shot segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3065–3075, 2024
work page 2024
-
[32]
Context-aware graph convolution network for target re-identification
Deyi Ji, Haoran Wang, Hanzhe Hu, Weihao Gan, Wei Wu, and Junjie Yan. Context-aware graph convolution network for target re-identification. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1646–1654, 2021
work page 2021
-
[33]
Lanyun Zhu, Tianrun Chen, Deyi Ji, Peng Xu, Jieping Ye, and Jun Liu. Llafs++: Few-shot image segmentation with large language models.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[34]
Efficient transformers: A survey.ACM Computing Surveys, 55(6):1–28, 2022
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey.ACM Computing Surveys, 55(6):1–28, 2022
work page 2022
-
[35]
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. InProceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021. 13
work page 2021
-
[36]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
AnyUp: Universal Feature Upsampling.arXiv preprint arXiv:2510.12764, 2025
Thomas Wimmer, Prune Truong, Marie-Julie Rakotosaona, Michael Oechsle, Federico Tombari, Bernt Schiele, and Jan Eric Lenssen. AnyUp: Universal Feature Upsampling.arXiv preprint arXiv:2510.12764, 2025
-
[38]
Yu Hu, Chong Cheng, Sicheng Yu, Xiaoyang Guo, and Hao Wang. Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction.arXiv preprint arXiv:2511.19971, 2025
-
[39]
Stereo Magnification: Learning View Synthesis using Multiplane Images
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[40]
D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In European Conf. on Computer Vision (ECCV), pages 611–625, 2012
work page 2012
- [41]
-
[42]
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction, 2021
work page 2021
-
[43]
Scene coordinate and correspondence learning for image-based localization, 2018
Mai Bui, Shadi Albarqouni, Slobodan Ilic, and Nassir Navab. Scene coordinate and correspondence learning for image-based localization, 2018
work page 2018
-
[44]
Goldman, Matthias Nießner, and Justus Thies
Dunja Azinović, Ricardo Martin-Brualla, Dan B. Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InCVPR, 2022
work page 2022
-
[45]
Large scale multi-view stereopsis evaluation
Rasmus Jensen, Anders Dahl, Henrik Aanaes, and Vedrana Andersen Dahl. Large scale multi-view stereopsis evaluation. InCVPR, 2014
work page 2014
-
[46]
Deepmvs: Learning multi-view stereopsis
Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
work page 2018
-
[47]
Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017
work page 2017
-
[48]
Indoor segmentation and support inference from rgbd images
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012
work page 2012
-
[49]
Yifan Liu, Zhiyuan Min, Zhenwei Wang, Junta Wu, Tengfei Wang, Yixuan Yuan, Yawei Luo, and Chunchao Guo. Worldmirror: Universal 3d world reconstruction with any-prior prompting.arXiv preprint arXiv:2510.10726, 2025
-
[50]
Grounding image matching in 3d with mast3r, 2024
Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with mast3r, 2024. 14
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.