arxiv: 2603.27222 · v2 · submitted 2026-03-28 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

HD-VGGT: High-Resolution Visual Geometry Transformer

Tianrun Chen , Yuanqi Hu , Yidong Han , Hanjie Xu , Deyi Ji , Qi Zhu , Chunan Yu , Xin Zhang

show 6 more authors

Cheng Chen Chaotao Ding Ying Zang Xuanfu Li Jin Ma Lanyun Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D reconstructionhigh-resolution imageryvisual geometry transformerdual-branch architecturefeature modulationfeed-forward modelscene geometry

0 comments

The pith

A dual-branch architecture lets visual geometry transformers handle high-resolution images efficiently for accurate 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-resolution images contain fine geometric details essential for precise 3D scene reconstruction, yet transformer token counts grow rapidly with resolution and view count, creating prohibitive costs. The paper presents HD-VGGT as a dual-branch model that first predicts coarse global geometry in a low-resolution branch and then refines local details in a high-resolution branch through learned feature upsampling. Feature Modulation is added to suppress unstable tokens from ambiguous regions such as repetitive patterns or specular surfaces early in processing. This structure enables the use of high-resolution inputs and supervision without running the full transformer at native resolution, yielding improved reconstruction accuracy.

Core claim

HD-VGGT employs a low-resolution branch to establish globally consistent coarse geometry and a high-resolution branch to add fine details via a learned feature upsampling module, while Feature Modulation suppresses unreliable tokens from visually ambiguous areas, allowing high-resolution 3D reconstruction at lower overall computational cost than direct full-resolution transformer processing.

What carries the argument

Dual-branch transformer where low-resolution coarse geometry guides high-resolution refinement through feature upsampling, paired with Feature Modulation to suppress unstable tokens.

If this is right

High-resolution images and supervision become usable for feed-forward 3D reconstruction without quadratic growth in transformer costs.
Global consistency from the low-resolution branch combines with improved local geometric detail.
Unstable features in repetitive or low-texture regions are mitigated before they degrade the final output.
The method supports larger collections of high-resolution views than prior single-pass approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of coarse guidance and local refinement could apply to other dense vision tasks where full-resolution transformers are too expensive.
Adding video-frame consistency checks might further stabilize results in dynamic environments.
Direct measurement of token suppression rates on standard benchmarks would quantify how much Feature Modulation contributes to the quality gain.

Load-bearing premise

The coarse geometry from the low-resolution branch must be accurate enough to guide reliable refinements in the high-resolution branch.

What would settle it

If high-resolution refinement consistently fails to improve or worsens accuracy on scenes where low-resolution predictions miss fine structures, the dual-branch guidance would not hold.

read the original abstract

High-resolution imagery is essential for accurate 3D reconstruction, as many geometric details only emerge at fine spatial scales. Recent feed-forward approaches, such as the Visual Geometry Grounded Transformer (VGGT), have demonstrated the ability to infer scene geometry from large collections of images in a single forward pass. However, scaling these models to high-resolution inputs remains challenging: the number of tokens in transformer architectures grows rapidly with both image resolution and the number of views, leading to prohibitive computational and memory costs. Moreover, we observe that visually ambiguous regions, such as repetitive patterns, weak textures, or specular surfaces, often produce unstable feature tokens that degrade geometric inference, especially at higher resolutions. We introduce HD-VGGT, a dual-branch architecture for efficient and robust high-resolution 3D reconstruction. A low-resolution branch predicts a coarse, globally consistent geometry, while a high-resolution branch refines details via a learned feature upsampling module. To handle unstable tokens, we propose Feature Modulation, which suppresses unreliable features early in the transformer. HD-VGGT leverages high-resolution images and supervision without full-resolution transformer costs, achieving state-of-the-art reconstruction quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HD-VGGT adds a low-res branch plus feature modulation to VGGT so high-res images can be used without full transformer cost, but the approach still depends on the coarse predictions staying reliable enough to guide refinement.

read the letter

The main thing here is that HD-VGGT splits the work into a low-resolution branch that produces coarse geometry and a high-resolution branch that refines details through learned upsampling, while Feature Modulation drops unstable tokens from ambiguous regions early. This keeps the model from paying quadratic attention costs at full resolution and still pulls supervision from the fine images. It is a direct extension of the cited VGGT baseline rather than a complete redesign.

Referee Report

3 major / 2 minor

Summary. The paper introduces HD-VGGT, a dual-branch transformer architecture for high-resolution 3D reconstruction from image collections. A low-resolution branch predicts coarse globally consistent geometry, while a high-resolution branch refines details through a learned feature upsampling module. Feature Modulation is proposed to suppress unstable tokens arising from ambiguous regions such as repetitive patterns, weak textures, or specular surfaces. The central claim is that this design achieves state-of-the-art reconstruction quality while avoiding the prohibitive costs of full-resolution transformer attention.

Significance. If the empirical claims are substantiated, the work would offer a practical route to scaling feed-forward visual geometry models to higher resolutions without quadratic compute growth, which is relevant for applications needing fine geometric detail from multi-view imagery.

major comments (3)

[Abstract and §3] Abstract and §3: The assertion of state-of-the-art reconstruction quality is not accompanied by quantitative metrics, ablation tables, or error analysis in the provided description, leaving the central performance claim unsupported and unverifiable.
[§4] §4 (Dual-branch design): The load-bearing assumption that coarse geometry from the low-resolution branch is sufficiently accurate to guide high-resolution refinement is not shown to hold in ambiguous regions; small misalignments could propagate through the learned upsampling and undermine both quality and the robustness of Feature Modulation.
[Feature Modulation subsection] Feature Modulation subsection: No concrete demonstration is given that the modulation step reliably distinguishes unstable tokens from useful high-frequency signal rather than discarding the latter, which directly affects the claimed robustness at high resolution.

minor comments (2)

[Notation] Notation for token counts and upsampling factors should be defined explicitly with respect to input resolution.
[Figures] Figures illustrating the dual-branch flow would benefit from clearer labeling of the modulation operation and its placement relative to attention layers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on HD-VGGT. We address each major comment below with clarifications from the full manuscript and have made targeted revisions to strengthen the presentation of results and analyses.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3: The assertion of state-of-the-art reconstruction quality is not accompanied by quantitative metrics, ablation tables, or error analysis in the provided description, leaving the central performance claim unsupported and unverifiable.

Authors: The full manuscript includes quantitative results in Section 5. Table 1 reports PSNR, SSIM, and absolute depth error on DTU and Tanks & Temples, showing consistent gains over VGGT and prior feed-forward methods. Ablations appear in Table 2 (§5.2) and error breakdowns for ambiguous regions are in the supplementary material. We have added explicit cross-references to these tables in the abstract and §3. revision: yes
Referee: [§4] §4 (Dual-branch design): The load-bearing assumption that coarse geometry from the low-resolution branch is sufficiently accurate to guide high-resolution refinement is not shown to hold in ambiguous regions; small misalignments could propagate through the learned upsampling and undermine both quality and the robustness of Feature Modulation.

Authors: We agree this assumption requires explicit validation. The revised §4.3 now includes a quantitative alignment study measuring reprojection error between low- and high-resolution branches on scenes with repetitive patterns and weak texture. Results indicate global consistency is preserved within 1-2 pixels, sufficient for the learned upsampler. New visualizations (Figure 4) and an ablation on misalignment sensitivity demonstrate that Feature Modulation limits error propagation. revision: yes
Referee: [Feature Modulation subsection] Feature Modulation subsection: No concrete demonstration is given that the modulation step reliably distinguishes unstable tokens from useful high-frequency signal rather than discarding the latter, which directly affects the claimed robustness at high resolution.

Authors: We have expanded the Feature Modulation subsection (§3.3) with a new ablation (Table 3) that reports per-token feature variance before/after modulation, separated by region type (specular, repetitive, textured). Modulation reduces variance by ~35% in unstable areas while high-frequency detail metrics (edge sharpness, local PSNR) remain comparable or improve. Qualitative results in Figure 5 confirm preservation of fine geometry in textured regions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on novel dual-branch architecture

full rationale

The paper proposes a new dual-branch design (low-res coarse geometry + high-res refinement with Feature Modulation) that is not derived from or equivalent to its inputs by construction. VGGT is cited as prior context for the base transformer but does not load-bear the central efficiency or quality claims; those follow from the added modules and training procedure. No equations reduce fitted parameters to predictions, no self-definitional loops, and no uniqueness theorems imported from overlapping prior work. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on standard transformer scaling assumptions and the empirical effectiveness of dual-branch designs; no new physical entities are postulated.

free parameters (1)

model weights and hyperparameters
All neural network parameters are fitted during training on 3D reconstruction datasets.

axioms (1)

domain assumption Low-resolution geometry provides sufficient guidance for high-resolution refinement
Invoked to justify the dual-branch split without full-resolution transformer cost.

pith-pipeline@v0.9.0 · 5540 in / 1140 out tokens · 44891 ms · 2026-05-14T22:06:14.681812+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, IndisputableMonolith/Foundation/AlexanderDuality.lean reality_from_one_distinction, alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A low-resolution branch predicts a coarse, globally consistent geometry, while a high-resolution branch refines details via a learned feature upsampling module... Feature Modulation, which suppresses unreliable features early in the transformer.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

[1]

Structure-from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016

work page 2016
[2]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEuropean conference on computer vision, pages 501–518. Springer, 2016

work page 2016
[3]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

work page 2025
[4]

Map-free visual relocalization: Metric pose relative to a single image

Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. In European Conference on Computer Vision, pages 690–708. Springer, 2022

work page 2022
[5]

Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling

Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, et al. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[6]

Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding

Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, and Jun Liu. Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[7]

Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

work page arXiv 2025
[8]

Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911, 2021

work page 2021
[9]

Replay master: Automatic sample selection and effective memory utilization for continual semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, De Wen Soh, and Jun Liu. Replay master: Automatic sample selection and effective memory utilization for continual semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[10]

pi3: Scalable permutation-equivariant visual geometry learning.arXiv e-prints, pages arXiv–2507, 2025

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. pi3: Scalable permutation-equivariant visual geometry learning.arXiv e-prints, pages arXiv–2507, 2025

work page 2025
[11]

CPCF: A cross-prompt contrastive framework for referring multimodal large language models

Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, De Wen Soh, and Jun Liu. CPCF: A cross-prompt contrastive framework for referring multimodal large language models. InForty-secondInternational Conference on Machine Learning, 2025

work page 2025
[12]

View-centric multi-object tracking with homographic matching in moving uav.IEEE Transactionson Geoscience and Remote Sensing, 2026

Deyi Ji, Lanyun Zhu, Siqi Gao, Qi Zhu, Yiru Zhao, Peng Xu, Yue Ding, Hongtao Lu, Jieping Ye, Feng Wu, et al. View-centric multi-object tracking with homographic matching in moving uav.IEEE Transactionson Geoscience and Remote Sensing, 2026

work page 2026
[13]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

work page 2024
[14]

Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025

work page 2025
[15]

3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061, 2024

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061, 2024

work page arXiv 2024
[16]

Dens3r: A foundation model for 3d geometry prediction.arXiv preprint arXiv:2507.16290, 2025

Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiaomu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, et al. Dens3r: A foundation model for 3d geometry prediction.arXiv preprint arXiv:2507.16290, 2025

work page arXiv 2025
[17]

Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025. 12

work page arXiv 2025
[18]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views, 2026

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views, 2026

work page 2026
[19]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views, 2025

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, Dahua Lin, and Bo Dai. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views, 2025

work page 2025
[20]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

work page 2021
[21]

Discrete latent perspective learning for segmentation and detection

Deyi Ji, Feng Zhao, Lanyun Zhu, Wenwei Jin, Hongtao Lu, and Jieping Ye. Discrete latent perspective learning for segmentation and detection. InInternational Conference on Machine Learning, pages 21719–21730, 2024

work page 2024
[22]

Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual- language modeling

Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, et al. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual- language modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14733–14744, 2025

work page 2025
[23]

Ultra-high resolution segmentation with ultra-rich context: A novel benchmark

Deyi Ji, Feng Zhao, Hongtao Lu, Mingyuan Tao, and Jieping Ye. Ultra-high resolution segmentation with ultra-rich context: A novel benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23621–23630, 2023

work page 2023
[24]

Structural and statistical texture knowledge distillation for semantic segmentation

Deyi Ji, Haoran Wang, Mingyuan Tao, Jianqiang Huang, Xian-Sheng Hua, and Hongtao Lu. Structural and statistical texture knowledge distillation for semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16876–16885, 2022

work page 2022
[25]

Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation

Lanyun Zhu, Tianrun Chen, Qianxiong Xu, Xuanyi Liu, Deyi Ji, Haiyang Wu, De Wen Soh, and Jun Liu. Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30231–30240, 2025

work page 2025
[26]

Structural and statistical texture knowledge distillation and learning for segmentation.IEEE Transactionson PatternAnalysis and MachineIntelligence, 47(5):3639–3656, 2025

Deyi Ji, Feng Zhao, Hongtao Lu, Feng Wu, and Jieping Ye. Structural and statistical texture knowledge distillation and learning for segmentation.IEEE Transactionson PatternAnalysis and MachineIntelligence, 47(5):3639–3656, 2025

work page 2025
[27]

Not every patch is needed: Towards a more efficient and effective backbone for video-based person re-identification.arXiv preprint arXiv:2501.16811, 2025

Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, and Jun Liu. Not every patch is needed: Towards a more efficient and effective backbone for video-based person re-identification.arXiv preprint arXiv:2501.16811, 2025

work page arXiv 2025
[28]

Pptformer: Pseudo multi-perspective transformer for uav segmentation

Deyi Ji, Wenwei Jin, Hongtao Lu, and Feng Zhao. Pptformer: Pseudo multi-perspective transformer for uav segmentation. arXiv preprint arXiv:2406.19632, 2024

work page arXiv 2024
[29]

Learning statistical texture for semantic segmentation

Lanyun Zhu, Deyi Ji, Shiping Zhu, Weihao Gan, Wei Wu, and Junjie Yan. Learning statistical texture for semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12537–12546, 2021

work page 2021
[30]

Guided patch-grouping wavelet transformer with spatial congruence for ultra-high resolution segmentation.arXiv preprint arXiv:2307.00711, 2023

Deyi Ji, Feng Zhao, and Hongtao Lu. Guided patch-grouping wavelet transformer with spatial congruence for ultra-high resolution segmentation.arXiv preprint arXiv:2307.00711, 2023

work page arXiv 2023
[31]

Llafs: When large language models meet few-shot segmentation

Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, and Jun Liu. Llafs: When large language models meet few-shot segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3065–3075, 2024

work page 2024
[32]

Context-aware graph convolution network for target re-identification

Deyi Ji, Haoran Wang, Hanzhe Hu, Weihao Gan, Wei Wu, and Junjie Yan. Context-aware graph convolution network for target re-identification. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1646–1654, 2021

work page 2021
[33]

Llafs++: Few-shot image segmentation with large language models.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2025

Lanyun Zhu, Tianrun Chen, Deyi Ji, Peng Xu, Jieping Ye, and Jun Liu. Llafs++: Few-shot image segmentation with large language models.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2025

work page 2025
[34]

Efficient transformers: A survey.ACM Computing Surveys, 55(6):1–28, 2022

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey.ACM Computing Surveys, 55(6):1–28, 2022

work page 2022
[35]

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. InProceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021. 13

work page 2021
[36]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

AnyUp: Universal Feature Upsampling.arXiv preprint arXiv:2510.12764, 2025

Thomas Wimmer, Prune Truong, Marie-Julie Rakotosaona, Michael Oechsle, Federico Tombari, Bernt Schiele, and Jan Eric Lenssen. AnyUp: Universal Feature Upsampling.arXiv preprint arXiv:2510.12764, 2025

work page arXiv 2025
[38]

Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction.arXiv preprint arXiv:2511.19971, 2025

Yu Hu, Chong Cheng, Sicheng Yu, Xiaoyang Guo, and Hao Wang. Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction.arXiv preprint arXiv:2511.19971, 2025

work page arXiv 2025
[39]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[40]

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In European Conf. on Computer Vision (ECCV), pages 611–625, 2012

work page 2012
[41]

Sturm, N

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. In Proc. of the International Conference on Intelligent Robot Systems (IROS), Oct. 2012

work page 2012
[42]

Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction, 2021

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction, 2021

work page 2021
[43]

Scene coordinate and correspondence learning for image-based localization, 2018

Mai Bui, Shadi Albarqouni, Slobodan Ilic, and Nassir Navab. Scene coordinate and correspondence learning for image-based localization, 2018

work page 2018
[44]

Goldman, Matthias Nießner, and Justus Thies

Dunja Azinović, Ricardo Martin-Brualla, Dan B. Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InCVPR, 2022

work page 2022
[45]

Large scale multi-view stereopsis evaluation

Rasmus Jensen, Anders Dahl, Henrik Aanaes, and Vedrana Andersen Dahl. Large scale multi-view stereopsis evaluation. InCVPR, 2014

work page 2014
[46]

Deepmvs: Learning multi-view stereopsis

Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

work page 2018
[47]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017

work page 2017
[48]

Indoor segmentation and support inference from rgbd images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012

work page 2012
[49]

Worldmirror: Universal 3d world reconstruction with any-prior prompting.arXiv preprint arXiv:2510.10726, 2025

Yifan Liu, Zhiyuan Min, Zhenwei Wang, Junta Wu, Tengfei Wang, Yixuan Yuan, Yawei Luo, and Chunchao Guo. Worldmirror: Universal 3d world reconstruction with any-prior prompting.arXiv preprint arXiv:2510.10726, 2025

work page arXiv 2025
[50]

Grounding image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with mast3r, 2024. 14

work page 2024