Recognition: 2 theorem links
· Lean TheoremAttention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval
Pith reviewed 2026-05-12 04:30 UTC · model grok-4.3
The pith
RetrieveVGGT uses query-key similarity to retrieve relevant frames and keep memory constant during long-context streaming 3D reconstruction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By casting context construction as retrieval, RetrieveVGGT selects a fixed number of relevant history frames at each step using the similarity between the current frame's queries and the cached keys from VGGT's first global attention layer. This similarity alone serves as a strong relevance signal, so no separate learned retriever is needed. Segment Sampling spreads the selection across distinct temporal segments, while pose-aware spatial memory arranges stored frames by their estimated camera poses to support location-sensitive lookup. The result is a streaming system whose memory footprint stays near the model's original training length and whose output quality exceeds that of prior stream
What carries the argument
Query-key similarity retrieval at the first global attention layer, which identifies relevant history frames for inclusion in the current context window.
Load-bearing premise
The query-key similarities from the first global attention layer provide a sufficient signal for choosing which past frames contribute most to accurate current-frame reconstruction.
What would settle it
A direct comparison on long video sequences showing that full-history attention or random frame selection yields higher accuracy than the similarity-based retrieval would disprove the central premise.
read the original abstract
Visual Geometry Grounded Transformer (VGGT) advances 3D reconstruction via scalable Transformer architecture, but the quadratic complexity of global attention prevents long context application. StreamVGGT enables streaming with causal attention, yet its KV cache grows linearly with frames, causing memory overflow and quality degradation. We present RetrieveVGGT, a training-free framework, which formulates context construction for VGGT as a retrieval problem. By retrieving a fixed number of relevant frames at each step, VGGT maintains a controllable memory budget, which is close to its training context length. Interestingly, we find that the similarity between current frame queries and cached history frame keys at the first global attention layer of VGGT is already a strong indicator of relevance, eliminating the need for additional learned scoring. To enhance information diversity similar to a recommender system, we propose Segment Sampling so that the retrieval spans distinct relevant segments rather than a single high-similarity region. We design a pose-aware spatial memory mechanism that organizes history frames according to their already estimated camera poses, enabling location-aware retrieval. Extensive experiments demonstrate that RetrieveVGGT achieves state-of-the-art performance, outperforming StreamVGGT, TTT3R, and InfiniteVGGT while maintaining constant memory usage regardless of sequence length. Code is available at https://github.com/zzctmd/RetrieveVGGT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RetrieveVGGT, a training-free method for long-context streaming 3D reconstruction with VGGT. It treats context construction as a retrieval problem, selecting a fixed budget of relevant frames via cosine similarity between the current frame's queries and cached history keys at the first global attention layer, augmented by Segment Sampling for diversity across segments and a pose-aware spatial memory that organizes frames by estimated camera poses. The central claim is that this maintains constant memory usage close to the model's training context length while achieving state-of-the-art performance, outperforming StreamVGGT, TTT3R, and InfiniteVGGT on long sequences.
Significance. If the empirical claims hold under rigorous validation, RetrieveVGGT would offer a practical advance for scalable video-based 3D reconstruction by eliminating linear memory growth without retraining or architectural changes. The training-free reuse of existing first-layer attention scores for retrieval is a notable strength, as is the explicit handling of redundancy via segment sampling and pose organization; these could generalize to other long-context vision transformers.
major comments (2)
- [Abstract] Abstract: The claim of state-of-the-art performance with constant memory 'regardless of sequence length' is load-bearing for the contribution but unsupported by any reported datasets, metrics (e.g., reconstruction accuracy or completeness), sequence lengths, or quantitative deltas versus the listed baselines. Without these, the SOTA assertion cannot be assessed and risks selection bias in the retrieval process.
- [Method description (retrieval mechanism)] Method description (retrieval mechanism): The assumption that first global attention layer Q-K similarity is already a strong indicator of geometric relevance is central yet unvalidated. No ablation compares it to last-layer similarity, explicit pose-distance retrieval, or random selection at the same budget K; first-layer embeddings primarily encode local appearance, which may retrieve texture matches rather than multi-view geometric overlap, silently degrading reconstruction quality in the constant-memory regime.
minor comments (2)
- [Title] The title contains the apparent typographical artifact 'Retrieve.RetrieveVGGT'.
- [Method] The free parameters (retrieval budget K and segment sampling parameters) are mentioned but lack explicit notation or sensitivity analysis in the provided description.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim of state-of-the-art performance with constant memory 'regardless of sequence length' is load-bearing for the contribution but unsupported by any reported datasets, metrics (e.g., reconstruction accuracy or completeness), sequence lengths, or quantitative deltas versus the listed baselines. Without these, the SOTA assertion cannot be assessed and risks selection bias in the retrieval process.
Authors: We agree that the abstract would be strengthened by including concrete supporting details. The full manuscript reports experiments on long video sequences from standard 3D reconstruction benchmarks, using metrics such as reconstruction accuracy and completeness. These evaluations cover sequence lengths substantially exceeding the model's training context while maintaining constant memory, with quantitative comparisons showing outperformance over StreamVGGT, TTT3R, and InfiniteVGGT. The Segment Sampling and pose-aware spatial memory mechanisms are explicitly introduced to mitigate redundancy and selection bias. We will revise the abstract to briefly report the datasets, example sequence lengths, key metrics, and performance deltas. revision: yes
-
Referee: [Method description (retrieval mechanism)] Method description (retrieval mechanism): The assumption that first global attention layer Q-K similarity is already a strong indicator of geometric relevance is central yet unvalidated. No ablation compares it to last-layer similarity, explicit pose-distance retrieval, or random selection at the same budget K; first-layer embeddings primarily encode local appearance, which may retrieve texture matches rather than multi-view geometric overlap, silently degrading reconstruction quality in the constant-memory regime.
Authors: The manuscript states that we empirically observed first-layer Q-K similarity to be effective, but we acknowledge the absence of systematic ablations against alternatives. We will add a dedicated ablation study in the revised manuscript comparing first-layer similarity to last-layer similarity, explicit pose-distance retrieval, and random selection, all under the same fixed budget K. These results will quantify reconstruction quality (accuracy and completeness) to demonstrate that first-layer attention provides superior geometric relevance. While first-layer features do encode appearance, the downstream 3D reconstruction metrics in our experiments indicate that the selected frames support multi-view geometric consistency rather than mere texture matching; the new ablations will make this distinction explicit. revision: yes
Circularity Check
No circularity: retrieval directly reuses unmodified first-layer Q-K similarity from the base VGGT model without fitted parameters, self-referential equations, or load-bearing self-citations.
full rationale
The paper's central mechanism is explicitly training-free and formulates context selection as retrieval using cosine similarity on existing attention computations at the first global attention layer of VGGT. This is presented as an empirical observation ('we find that the similarity... is already a strong indicator') rather than a derived quantity. No equations, ansatzes, or uniqueness theorems are introduced that reduce the claimed relevance or SOTA performance back to fitted inputs or prior self-work by construction. Segment Sampling and pose-aware memory are additional heuristics built on top of the same unmodified signals. The derivation chain remains independent of the target result and does not match any enumerated circularity pattern.
Axiom & Free-Parameter Ledger
free parameters (2)
- retrieval budget K
- segment sampling parameters
axioms (1)
- domain assumption Query-key similarity at the first global attention layer indicates relevance for 3D reconstruction context selection.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the similarity between current frame queries and cached history frame keys at the first global attention layer of VGGT is already a strong indicator of relevance, eliminating the need for additional learned scoring
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Tri-perspective view for vision-based 3d semantic occupancy prediction
Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d semantic occupancy prediction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9223–9232, 2023
work page 2023
-
[2]
Occworld: Learning a 3d occupancy world model for autonomous driving
Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. InEuropean conference on computervision, pages 55–72. Springer, 2024
work page 2024
-
[3]
3d clothed human reconstruction from sparse multi-view images
Jin Gyu Hong, Seung Young Noh, Hee Kyung Lee, Won Sik Cheong, and Ju Yong Chang. 3d clothed human reconstruction from sparse multi-view images. InProceedings of the IEEE/CVF Conferenceon ComputerVision and PatternRecognition, pages 677–687, 2024
work page 2024
-
[4]
Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds
Jiahui Lei, Yijia Weng, Adam W Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. InProceedings of the ComputerVisionand PatternRecognition Conference, pages 6165–6177, 2025
work page 2025
-
[5]
ShunyuanZheng,BoyaoZhou,RuizhiShao,BoningLiu,ShengpingZhang,LiqiangNie,andYebinLiu. Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. InProceedings of the IEEE/CVF conferenceoncomputervision andpattern recognition, pages 19680–19690, 2024
work page 2024
-
[6]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, KarolHausman,BrianIchter,etal. 𝜋0: Avision-language-actionflowmodelforgeneralrobotcontrol. arXivpreprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXivpreprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprintarXiv:2411.19650, 2024
work page Pith review arXiv 2024
-
[9]
4dtam: Non-rigid tracking and mapping via dynamic surface gaussians
Hidenobu Matsuki, Gwangbin Bae, and Andrew J Davison. 4dtam: Non-rigid tracking and mapping via dynamic surface gaussians. InProceedingsofthe ComputerVisionandPatternRecognitionConference, pages 26921–26932, 2025
work page 2025
-
[10]
Building rome in a day.Communications ofthe ACM, 54(10):105–112, 2011
Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.Communications ofthe ACM, 54(10):105–112, 2011
work page 2011
-
[11]
Building rome on a cloudless day
Jan-Michael Frahm, Pierre Fite-Georgel, David Gallup, Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, Svetlana Lazebnik, et al. Building rome on a cloudless day. InEuropeanconferenceon computervision, pages 368–381. Springer, 2010
work page 2010
-
[12]
Robust incremental structure-from-motion with hybrid features
Shaohui Liu, Yidan Gao, Tianyi Zhang, Rémi Pautrat, Johannes L Schönberger, Viktor Larsson, and Marc Pollefeys. Robust incremental structure-from-motion with hybrid features. InEuropean Conference on Computer Vision, pages 249–269. Springer, 2024
work page 2024
-
[13]
Structure-from-motion revisited
Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE conferenceon computervision and pattern recognition, pages 4104–4113, 2016
work page 2016
-
[14]
Towards linear-time incremental structure from motion
Changchang Wu. Towards linear-time incremental structure from motion. In2013International Conferenceon3D Vision-3DV2013, pages 127–134. IEEE, 2013
work page 2013
-
[15]
Pixel-perfect structure-from-motion withfeaturemetricrefinement
Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-perfect structure-from-motion withfeaturemetricrefinement. In ProceedingsoftheIEEE/CVFinternationalconferenceoncomputervision,pages 5987–5997, 2021
work page 2021
-
[16]
Mvsnet: Depth inference for unstructured multi-view stereo
Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InProceedings ofthe Europeanconferenceon computervision (ECCV), pages 767–783, 2018. 12
work page 2018
-
[17]
Recurrent mvsnet for high-resolution multi-view stereo depth inference
Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi-view stereo depth inference. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5525–5534, 2019
work page 2019
-
[18]
Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis.IEEEtransactionsonpattern analysisand machineintelligence, 32(8):1362–1376, 2009
work page 2009
-
[19]
Cascade cost volume for high- resolution multi-view stereo and stereo matching
Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume for high- resolution multi-view stereo and stereo matching. InProceedingsofthe IEEE/CVFconferenceoncomputervision and pattern recognition, pages 2495–2504, 2020
work page 2020
-
[20]
Point-based multi-view stereo network
Rui Chen, Songfang Han, Jing Xu, and Hao Su. Point-based multi-view stereo network. InProceedings of the IEEE/CVF international conferenceon computervision, pages 1538–1547, 2019
work page 2019
-
[21]
A surface-growing approach to multi-view stereo reconstruction
Martin Habbecke and Leif Kobbelt. A surface-growing approach to multi-view stereo reconstruction. In2007IEEE Conferenceon ComputerVisionand PatternRecognition, pages 1–8. IEEE, 2007
work page 2007
-
[22]
Vggt: Visualgeometrygroundedtransformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visualgeometrygroundedtransformer. In ProceedingsoftheComputerVisionandPatternRecognitionConference, pages 5294–5306, 2025
work page 2025
-
[23]
3d reconstruction with spatial memory
Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. In2025 International Conferenceon 3D Vision(3DV), pages 78–89. IEEE, 2025
work page 2025
-
[24]
YuqiWu, WenzhaoZheng, JieZhou, andJiwenLu. Point3r: Streaming3dreconstructionwithexplicitspatialpointer memory. arXivpreprintarXiv:2507.02863, 2025
-
[25]
Continuous3dperception model with persistent state
QianqianWang,YifeiZhang,AleksanderHolynski,AlexeiAEfros,andAngjooKanazawa. Continuous3dperception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025
work page 2025
-
[26]
Ttt3r: 3d reconstruction as test-time training
XingyuChen,YueChen,YuliangXiu,AndreasGeiger,andAnpeiChen. Ttt3r: 3dreconstructionastest-timetraining. arXiv preprintarXiv:2509.26645, 2025
-
[27]
DongZhuo,WenzhaoZheng,JiaheGuo,YuqiWu,JieZhou,andJiwenLu. Streaming4dvisualgeometrytransformer. arXiv preprintarXiv:2507.11539, 2025
-
[28]
InfiniteVGGT: Visual geometry grounded transformer for endless streams
Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. Infinitevggt: Visual geometry grounded transformer for endless streams.arXivpreprint arXiv:2601.02281, 2026
-
[29]
D2-net: A trainable cnn for joint description and detection of local features
Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint description and detection of local features. InProceedings of the ieee/cvf conferenceon computervision andpattern recognition, pages 8092–8101, 2019
work page 2019
-
[30]
David G Lowe. Distinctive image features from scale-invariant keypoints.International journal ofcomputervision, 60(2):91–110, 2004
work page 2004
-
[31]
Orb: An efficient alternative to sift or surf
Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In2011 International conferenceon computervision, pages 2564–2571. Ieee, 2011
work page 2011
-
[32]
Learning to match features with seeded graph matching network
Hongkai Chen, Zixin Luo, Jiahui Zhang, Lei Zhou, Xuyang Bai, Zeyu Hu, Chiew-Lan Tai, and Long Quan. Learning to match features with seeded graph matching network. InProceedings of the IEEE/CVF international conference on computervision, pages 6301–6310, 2021
work page 2021
-
[33]
Lightglue: Local feature matching at light speed
Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. In Proceedings ofthe IEEE/CVF international conferenceon computervision, pages 17627–17638, 2023
work page 2023
-
[34]
Clustergnn: Cluster-based coarse-to-fine graph neural network for efficient feature matching
Yan Shi, Jun-Xiong Cai, Yoli Shavit, Tai-Jiang Mu, Wensen Feng, and Kai Zhang. Clustergnn: Cluster-based coarse-to-fine graph neural network for efficient feature matching. InProceedingsofthe IEEE/CVFconferenceon computervision andpattern recognition, pages 12517–12526, 2022
work page 2022
-
[35]
Bundle adjustment in the large
Sameer Agarwal, Noah Snavely, Steven M Seitz, and Richard Szeliski. Bundle adjustment in the large. InEuropean conferenceon computervision, pages 29–42. Springer, 2010. 13
work page 2010
-
[36]
Bundle adjustment—a modern synthesis
Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. InInternational workshopon vision algorithms, pages 298–372. Springer, 1999
work page 1999
-
[37]
Mip-nerf 360: Unbounded anti-aliased neural radiance fields
Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022
work page 2022
-
[38]
Tensorf: Tensorial radiance fields
Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean conferenceon computervision, pages 333–350. Springer, 2022
work page 2022
-
[39]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communicationsofthe ACM, 65(1):99–106, 2021
work page 2021
-
[40]
Nerf++: Analyzing and improving neural radiance fields
Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. Nerf++: Analyzing and improving neural radiance fields. arXivpreprintarXiv:2010.07492, 2020
-
[41]
arXiv preprint arXiv:2106.10689 , year=
Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction.arXiv preprintarXiv:2106.10689, 2021
-
[42]
3d gaussian splatting for real-time radiance field rendering.ACMTrans.Graph., 42(4):139–1, 2023
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACMTrans.Graph., 42(4):139–1, 2023
work page 2023
-
[43]
pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction
David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InProceedings of the IEEE/CVF conferenceon computer vision and pattern recognition, pages 19457–19467, 2024
work page 2024
-
[44]
Haodong Xiang, Xinghui Li, Kai Cheng, Xiansong Lai, Wanting Zhang, Zhichao Liao, Long Zeng, and Xueping Liu. Gaussianroom: Improving 3d gaussian splatting with sdf guidance and monocular cues for indoor scene reconstruction. In2025IEEEInternational ConferenceonRoboticsandAutomation(ICRA), pages 2686–2693. IEEE, 2025
work page 2025
-
[45]
Robust and efficient 3d gaussian splatting for urban scene reconstruction
Zhensheng Yuan, Haozhi Huang, Zhen Xiong, Di Wang, and Guanghua Yang. Robust and efficient 3d gaussian splatting for urban scene reconstruction. InProceedings of the IEEE/CVF International Conferenceon Computer Vision, pages 26209–26219, 2025
work page 2025
-
[46]
Monoslam: Real-time single camera slam
Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-time single camera slam. IEEE transactionson pattern analysisandmachineintelligence, 29(6):1052–1067, 2007
work page 2007
-
[47]
Lsd-slam: Large-scale direct monocular slam
Jakob Engel, Thomas Schöps, and Daniel Cremers. Lsd-slam: Large-scale direct monocular slam. InEuropean conferenceon computervision, pages 834–849. Springer, 2014
work page 2014
-
[48]
Parallel tracking and mapping for small ar workspaces
Georg Klein and David Murray. Parallel tracking and mapping for small ar workspaces. In20076th IEEEandACM international symposiumon mixedandaugmented reality, pages 225–234. IEEE, 2007
work page 2007
-
[49]
Dtam: Dense tracking and mapping in real-time
Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking and mapping in real-time. In 2011international conferenceon computervision, pages 2320–2327. IEEE, 2011
work page 2011
-
[50]
Kinectfusion: Real-time dense surface mapping and tracking
Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In2011 10th IEEEinternational symposiumonmixedand augmented reality, pages 127–136. Ieee, 2011
work page 2011
-
[51]
Dust3r: Geometric 3d vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedingsofthe IEEE/CVFconferenceoncomputervisionandpattern recognition, pages 20697–20709, 2024
work page 2024
-
[52]
Grounding image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean conferenceon computervision, pages 71–91. Springer, 2024
work page 2024
-
[53]
Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.arXivpreprint arXiv:2410.03825, 2024. 14
-
[54]
Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass
Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the ComputerVisionand PatternRecognitionConference, pages 21924–21935, 2025
work page 2025
-
[55]
You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXivpreprintarXiv:2509.02560, 2025
-
[56]
Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences.arXivpreprintarXiv:2507.16443, 2025
-
[57]
Zhijie Zheng, Xinhao Xiang, and Jiawei Zhang. Ttsa3r: Training-free temporal-spatial adaptive persistent state for streaming 3d reconstruction.arXiv preprintarXiv:2601.22615, 2026
-
[58]
Guole Shen, Tianchen Deng, Xingrui Qin, Nailin Wang, Jianyu Wang, Yanbo Wang, Yongtao Chen, Hesheng Wang, and Jingchuan Wang. Mut3r: Motion-aware updating transformer for dynamic 3d reconstruction.arXiv preprint arXiv:2512.03939, 2025
-
[59]
Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer.arXiv preprint arXiv:2508.10893, 2025
-
[60]
Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, and Tong He. Wint3r: Window-based streaming reconstruction with camera token pool.arXiv preprint arXiv:2509.05296, 2025
-
[61]
Scene coordinate regression forests for camera relocalization in rgb-d images
Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. InProceedingsofthe IEEEconferenceon computervision andpattern recognition, pages 2930–2937, 2013
work page 2013
-
[62]
Neural rgb-d surface reconstruction
Dejan Azinović, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6290–6301, 2022
work page 2022
-
[63]
Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals
Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals. In2019 IEEE/RSJ International Conference on IntelligentRobotsandSystems(IROS), pages 7855–7862. IEEE, 2019
work page 2019
-
[64]
A benchmark for the evaluation of rgb-d slam systems
Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 573–580. IEEE, 2012
work page 2012
-
[65]
arXiv preprint arXiv:2406.10774 , year=
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context llm inference.arXivpreprintarXiv:2406.10774, 2024
-
[66]
H2o: Heavy-hitter oracle for efficient generative inference of large language models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advancesin NeuralInformationProcessingSystems, 36:34661–34710, 2023
work page 2023
-
[67]
Snapkv: Llmknowswhatyouarelookingforbeforegeneration
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, andDemingChen. Snapkv: Llmknowswhatyouarelookingforbeforegeneration. AdvancesinNeuralInformation ProcessingSystems, 37:22947–22970, 2024
work page 2024
-
[68]
Attention is all you need.Advancesinneuralinformationprocessingsystems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesinneuralinformationprocessingsystems, 30, 2017
work page 2017
-
[69]
A naturalistic open source movie for optical flow evaluation
Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. InEuropean conferenceoncomputervision, pages 611–625. Springer, 2012
work page 2012
-
[70]
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The international journal ofroboticsresearch, 32(11):1231–1237, 2013. 15 Appendix A Retrieved Keyframe Visualization We visualize keyframes retrieved by RetrieveVGGT on five NRGBD [62] (Fig. 7) and 7-Scenes [61] (Fig. 8) scenes. For each, we show t...
work page 2013
-
[71]
Penalizes both directional misalignment and magnitude discrepancy. The standard scaling factor1/ √ 𝑑ℎ [68] is frame-independent and preserves relative rankings, so scaled dot product yields identical selections and is excluded. Analysis(Tab. 4).(1) Magnitude encodes geometric importance.VGGT’s key descriptors encode both viewing direction and geometric in...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.