pith. machine review for the scientific record. sign in

arxiv: 2605.09644 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D reconstructionstreaminglong contextattention retrievalmemory efficiencytraining-freequery-key similaritypose-aware memory
0
0 comments X

The pith

RetrieveVGGT uses query-key similarity to retrieve relevant frames and keep memory constant during long-context streaming 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The quadratic cost of global attention in VGGT makes it hard to process long video sequences for 3D reconstruction. RetrieveVGGT addresses this by treating the selection of context frames as a retrieval task that always selects a bounded number of past frames. The method relies on the observation that query-key similarities already computed in the first attention layer reliably indicate which history frames are useful. Adding segment sampling for diversity and pose-aware organization further improves the retrieved set, resulting in bounded memory and improved reconstruction quality on extended sequences.

Core claim

By casting context construction as retrieval, RetrieveVGGT selects a fixed number of relevant history frames at each step using the similarity between the current frame's queries and the cached keys from VGGT's first global attention layer. This similarity alone serves as a strong relevance signal, so no separate learned retriever is needed. Segment Sampling spreads the selection across distinct temporal segments, while pose-aware spatial memory arranges stored frames by their estimated camera poses to support location-sensitive lookup. The result is a streaming system whose memory footprint stays near the model's original training length and whose output quality exceeds that of prior stream

What carries the argument

Query-key similarity retrieval at the first global attention layer, which identifies relevant history frames for inclusion in the current context window.

Load-bearing premise

The query-key similarities from the first global attention layer provide a sufficient signal for choosing which past frames contribute most to accurate current-frame reconstruction.

What would settle it

A direct comparison on long video sequences showing that full-history attention or random frame selection yields higher accuracy than the similarity-based retrieval would disprove the central premise.

read the original abstract

Visual Geometry Grounded Transformer (VGGT) advances 3D reconstruction via scalable Transformer architecture, but the quadratic complexity of global attention prevents long context application. StreamVGGT enables streaming with causal attention, yet its KV cache grows linearly with frames, causing memory overflow and quality degradation. We present RetrieveVGGT, a training-free framework, which formulates context construction for VGGT as a retrieval problem. By retrieving a fixed number of relevant frames at each step, VGGT maintains a controllable memory budget, which is close to its training context length. Interestingly, we find that the similarity between current frame queries and cached history frame keys at the first global attention layer of VGGT is already a strong indicator of relevance, eliminating the need for additional learned scoring. To enhance information diversity similar to a recommender system, we propose Segment Sampling so that the retrieval spans distinct relevant segments rather than a single high-similarity region. We design a pose-aware spatial memory mechanism that organizes history frames according to their already estimated camera poses, enabling location-aware retrieval. Extensive experiments demonstrate that RetrieveVGGT achieves state-of-the-art performance, outperforming StreamVGGT, TTT3R, and InfiniteVGGT while maintaining constant memory usage regardless of sequence length. Code is available at https://github.com/zzctmd/RetrieveVGGT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RetrieveVGGT, a training-free method for long-context streaming 3D reconstruction with VGGT. It treats context construction as a retrieval problem, selecting a fixed budget of relevant frames via cosine similarity between the current frame's queries and cached history keys at the first global attention layer, augmented by Segment Sampling for diversity across segments and a pose-aware spatial memory that organizes frames by estimated camera poses. The central claim is that this maintains constant memory usage close to the model's training context length while achieving state-of-the-art performance, outperforming StreamVGGT, TTT3R, and InfiniteVGGT on long sequences.

Significance. If the empirical claims hold under rigorous validation, RetrieveVGGT would offer a practical advance for scalable video-based 3D reconstruction by eliminating linear memory growth without retraining or architectural changes. The training-free reuse of existing first-layer attention scores for retrieval is a notable strength, as is the explicit handling of redundancy via segment sampling and pose organization; these could generalize to other long-context vision transformers.

major comments (2)
  1. [Abstract] Abstract: The claim of state-of-the-art performance with constant memory 'regardless of sequence length' is load-bearing for the contribution but unsupported by any reported datasets, metrics (e.g., reconstruction accuracy or completeness), sequence lengths, or quantitative deltas versus the listed baselines. Without these, the SOTA assertion cannot be assessed and risks selection bias in the retrieval process.
  2. [Method description (retrieval mechanism)] Method description (retrieval mechanism): The assumption that first global attention layer Q-K similarity is already a strong indicator of geometric relevance is central yet unvalidated. No ablation compares it to last-layer similarity, explicit pose-distance retrieval, or random selection at the same budget K; first-layer embeddings primarily encode local appearance, which may retrieve texture matches rather than multi-view geometric overlap, silently degrading reconstruction quality in the constant-memory regime.
minor comments (2)
  1. [Title] The title contains the apparent typographical artifact 'Retrieve.RetrieveVGGT'.
  2. [Method] The free parameters (retrieval budget K and segment sampling parameters) are mentioned but lack explicit notation or sensitivity analysis in the provided description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of state-of-the-art performance with constant memory 'regardless of sequence length' is load-bearing for the contribution but unsupported by any reported datasets, metrics (e.g., reconstruction accuracy or completeness), sequence lengths, or quantitative deltas versus the listed baselines. Without these, the SOTA assertion cannot be assessed and risks selection bias in the retrieval process.

    Authors: We agree that the abstract would be strengthened by including concrete supporting details. The full manuscript reports experiments on long video sequences from standard 3D reconstruction benchmarks, using metrics such as reconstruction accuracy and completeness. These evaluations cover sequence lengths substantially exceeding the model's training context while maintaining constant memory, with quantitative comparisons showing outperformance over StreamVGGT, TTT3R, and InfiniteVGGT. The Segment Sampling and pose-aware spatial memory mechanisms are explicitly introduced to mitigate redundancy and selection bias. We will revise the abstract to briefly report the datasets, example sequence lengths, key metrics, and performance deltas. revision: yes

  2. Referee: [Method description (retrieval mechanism)] Method description (retrieval mechanism): The assumption that first global attention layer Q-K similarity is already a strong indicator of geometric relevance is central yet unvalidated. No ablation compares it to last-layer similarity, explicit pose-distance retrieval, or random selection at the same budget K; first-layer embeddings primarily encode local appearance, which may retrieve texture matches rather than multi-view geometric overlap, silently degrading reconstruction quality in the constant-memory regime.

    Authors: The manuscript states that we empirically observed first-layer Q-K similarity to be effective, but we acknowledge the absence of systematic ablations against alternatives. We will add a dedicated ablation study in the revised manuscript comparing first-layer similarity to last-layer similarity, explicit pose-distance retrieval, and random selection, all under the same fixed budget K. These results will quantify reconstruction quality (accuracy and completeness) to demonstrate that first-layer attention provides superior geometric relevance. While first-layer features do encode appearance, the downstream 3D reconstruction metrics in our experiments indicate that the selected frames support multi-view geometric consistency rather than mere texture matching; the new ablations will make this distinction explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: retrieval directly reuses unmodified first-layer Q-K similarity from the base VGGT model without fitted parameters, self-referential equations, or load-bearing self-citations.

full rationale

The paper's central mechanism is explicitly training-free and formulates context selection as retrieval using cosine similarity on existing attention computations at the first global attention layer of VGGT. This is presented as an empirical observation ('we find that the similarity... is already a strong indicator') rather than a derived quantity. No equations, ansatzes, or uniqueness theorems are introduced that reduce the claimed relevance or SOTA performance back to fitted inputs or prior self-work by construction. Segment Sampling and pose-aware memory are additional heuristics built on top of the same unmodified signals. The derivation chain remains independent of the target result and does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method depends on a small set of hyperparameters for retrieval count and sampling plus one key domain assumption about attention similarity; no new entities are postulated.

free parameters (2)
  • retrieval budget K
    Fixed number of frames retrieved per step to enforce constant memory budget.
  • segment sampling parameters
    Controls for selecting across distinct segments to promote information diversity.
axioms (1)
  • domain assumption Query-key similarity at the first global attention layer indicates relevance for 3D reconstruction context selection.
    Directly invoked to justify training-free retrieval without learned scorers.

pith-pipeline@v0.9.0 · 5554 in / 1387 out tokens · 70409 ms · 2026-05-12T04:30:24.937304+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 2 internal anchors

  1. [1]

    Tri-perspective view for vision-based 3d semantic occupancy prediction

    Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d semantic occupancy prediction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9223–9232, 2023

  2. [2]

    Occworld: Learning a 3d occupancy world model for autonomous driving

    Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. InEuropean conference on computervision, pages 55–72. Springer, 2024

  3. [3]

    3d clothed human reconstruction from sparse multi-view images

    Jin Gyu Hong, Seung Young Noh, Hee Kyung Lee, Won Sik Cheong, and Ju Yong Chang. 3d clothed human reconstruction from sparse multi-view images. InProceedings of the IEEE/CVF Conferenceon ComputerVision and PatternRecognition, pages 677–687, 2024

  4. [4]

    Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds

    Jiahui Lei, Yijia Weng, Adam W Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. InProceedings of the ComputerVisionand PatternRecognition Conference, pages 6165–6177, 2025

  5. [5]

    Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis

    ShunyuanZheng,BoyaoZhou,RuizhiShao,BoningLiu,ShengpingZhang,LiqiangNie,andYebinLiu. Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. InProceedings of the IEEE/CVF conferenceoncomputervision andpattern recognition, pages 19680–19690, 2024

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, KarolHausman,BrianIchter,etal. 𝜋0: Avision-language-actionflowmodelforgeneralrobotcontrol. arXivpreprint arXiv:2410.24164, 2024

  7. [7]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXivpreprint arXiv:2406.09246, 2024

  8. [8]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprintarXiv:2411.19650, 2024

  9. [9]

    4dtam: Non-rigid tracking and mapping via dynamic surface gaussians

    Hidenobu Matsuki, Gwangbin Bae, and Andrew J Davison. 4dtam: Non-rigid tracking and mapping via dynamic surface gaussians. InProceedingsofthe ComputerVisionandPatternRecognitionConference, pages 26921–26932, 2025

  10. [10]

    Building rome in a day.Communications ofthe ACM, 54(10):105–112, 2011

    Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.Communications ofthe ACM, 54(10):105–112, 2011

  11. [11]

    Building rome on a cloudless day

    Jan-Michael Frahm, Pierre Fite-Georgel, David Gallup, Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, Svetlana Lazebnik, et al. Building rome on a cloudless day. InEuropeanconferenceon computervision, pages 368–381. Springer, 2010

  12. [12]

    Robust incremental structure-from-motion with hybrid features

    Shaohui Liu, Yidan Gao, Tianyi Zhang, Rémi Pautrat, Johannes L Schönberger, Viktor Larsson, and Marc Pollefeys. Robust incremental structure-from-motion with hybrid features. InEuropean Conference on Computer Vision, pages 249–269. Springer, 2024

  13. [13]

    Structure-from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE conferenceon computervision and pattern recognition, pages 4104–4113, 2016

  14. [14]

    Towards linear-time incremental structure from motion

    Changchang Wu. Towards linear-time incremental structure from motion. In2013International Conferenceon3D Vision-3DV2013, pages 127–134. IEEE, 2013

  15. [15]

    Pixel-perfect structure-from-motion withfeaturemetricrefinement

    Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-perfect structure-from-motion withfeaturemetricrefinement. In ProceedingsoftheIEEE/CVFinternationalconferenceoncomputervision,pages 5987–5997, 2021

  16. [16]

    Mvsnet: Depth inference for unstructured multi-view stereo

    Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InProceedings ofthe Europeanconferenceon computervision (ECCV), pages 767–783, 2018. 12

  17. [17]

    Recurrent mvsnet for high-resolution multi-view stereo depth inference

    Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi-view stereo depth inference. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5525–5534, 2019

  18. [18]

    Accurate, dense, and robust multiview stereopsis.IEEEtransactionsonpattern analysisand machineintelligence, 32(8):1362–1376, 2009

    Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis.IEEEtransactionsonpattern analysisand machineintelligence, 32(8):1362–1376, 2009

  19. [19]

    Cascade cost volume for high- resolution multi-view stereo and stereo matching

    Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume for high- resolution multi-view stereo and stereo matching. InProceedingsofthe IEEE/CVFconferenceoncomputervision and pattern recognition, pages 2495–2504, 2020

  20. [20]

    Point-based multi-view stereo network

    Rui Chen, Songfang Han, Jing Xu, and Hao Su. Point-based multi-view stereo network. InProceedings of the IEEE/CVF international conferenceon computervision, pages 1538–1547, 2019

  21. [21]

    A surface-growing approach to multi-view stereo reconstruction

    Martin Habbecke and Leif Kobbelt. A surface-growing approach to multi-view stereo reconstruction. In2007IEEE Conferenceon ComputerVisionand PatternRecognition, pages 1–8. IEEE, 2007

  22. [22]

    Vggt: Visualgeometrygroundedtransformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visualgeometrygroundedtransformer. In ProceedingsoftheComputerVisionandPatternRecognitionConference, pages 5294–5306, 2025

  23. [23]

    3d reconstruction with spatial memory

    Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. In2025 International Conferenceon 3D Vision(3DV), pages 78–89. IEEE, 2025

  24. [24]

    Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025

    YuqiWu, WenzhaoZheng, JieZhou, andJiwenLu. Point3r: Streaming3dreconstructionwithexplicitspatialpointer memory. arXivpreprintarXiv:2507.02863, 2025

  25. [25]

    Continuous3dperception model with persistent state

    QianqianWang,YifeiZhang,AleksanderHolynski,AlexeiAEfros,andAngjooKanazawa. Continuous3dperception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

  26. [26]

    Ttt3r: 3d reconstruction as test-time training

    XingyuChen,YueChen,YuliangXiu,AndreasGeiger,andAnpeiChen. Ttt3r: 3dreconstructionastest-timetraining. arXiv preprintarXiv:2509.26645, 2025

  27. [27]

    GA”. The “Type

    DongZhuo,WenzhaoZheng,JiaheGuo,YuqiWu,JieZhou,andJiwenLu. Streaming4dvisualgeometrytransformer. arXiv preprintarXiv:2507.11539, 2025

  28. [28]

    InfiniteVGGT: Visual geometry grounded transformer for endless streams

    Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. Infinitevggt: Visual geometry grounded transformer for endless streams.arXivpreprint arXiv:2601.02281, 2026

  29. [29]

    D2-net: A trainable cnn for joint description and detection of local features

    Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint description and detection of local features. InProceedings of the ieee/cvf conferenceon computervision andpattern recognition, pages 8092–8101, 2019

  30. [30]

    Distinctive image features from scale-invariant keypoints.International journal ofcomputervision, 60(2):91–110, 2004

    David G Lowe. Distinctive image features from scale-invariant keypoints.International journal ofcomputervision, 60(2):91–110, 2004

  31. [31]

    Orb: An efficient alternative to sift or surf

    Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In2011 International conferenceon computervision, pages 2564–2571. Ieee, 2011

  32. [32]

    Learning to match features with seeded graph matching network

    Hongkai Chen, Zixin Luo, Jiahui Zhang, Lei Zhou, Xuyang Bai, Zeyu Hu, Chiew-Lan Tai, and Long Quan. Learning to match features with seeded graph matching network. InProceedings of the IEEE/CVF international conference on computervision, pages 6301–6310, 2021

  33. [33]

    Lightglue: Local feature matching at light speed

    Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. In Proceedings ofthe IEEE/CVF international conferenceon computervision, pages 17627–17638, 2023

  34. [34]

    Clustergnn: Cluster-based coarse-to-fine graph neural network for efficient feature matching

    Yan Shi, Jun-Xiong Cai, Yoli Shavit, Tai-Jiang Mu, Wensen Feng, and Kai Zhang. Clustergnn: Cluster-based coarse-to-fine graph neural network for efficient feature matching. InProceedingsofthe IEEE/CVFconferenceon computervision andpattern recognition, pages 12517–12526, 2022

  35. [35]

    Bundle adjustment in the large

    Sameer Agarwal, Noah Snavely, Steven M Seitz, and Richard Szeliski. Bundle adjustment in the large. InEuropean conferenceon computervision, pages 29–42. Springer, 2010. 13

  36. [36]

    Bundle adjustment—a modern synthesis

    Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. InInternational workshopon vision algorithms, pages 298–372. Springer, 1999

  37. [37]

    Mip-nerf 360: Unbounded anti-aliased neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022

  38. [38]

    Tensorf: Tensorial radiance fields

    Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean conferenceon computervision, pages 333–350. Springer, 2022

  39. [39]

    Nerf: Representing scenes as neural radiance fields for view synthesis.Communicationsofthe ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communicationsofthe ACM, 65(1):99–106, 2021

  40. [40]

    Nerf++: Analyzing and improving neural radiance fields

    Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. Nerf++: Analyzing and improving neural radiance fields. arXivpreprintarXiv:2010.07492, 2020

  41. [41]

    arXiv preprint arXiv:2106.10689 , year=

    Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction.arXiv preprintarXiv:2106.10689, 2021

  42. [42]

    3d gaussian splatting for real-time radiance field rendering.ACMTrans.Graph., 42(4):139–1, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACMTrans.Graph., 42(4):139–1, 2023

  43. [43]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

    David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InProceedings of the IEEE/CVF conferenceon computer vision and pattern recognition, pages 19457–19467, 2024

  44. [44]

    Gaussianroom: Improving 3d gaussian splatting with sdf guidance and monocular cues for indoor scene reconstruction

    Haodong Xiang, Xinghui Li, Kai Cheng, Xiansong Lai, Wanting Zhang, Zhichao Liao, Long Zeng, and Xueping Liu. Gaussianroom: Improving 3d gaussian splatting with sdf guidance and monocular cues for indoor scene reconstruction. In2025IEEEInternational ConferenceonRoboticsandAutomation(ICRA), pages 2686–2693. IEEE, 2025

  45. [45]

    Robust and efficient 3d gaussian splatting for urban scene reconstruction

    Zhensheng Yuan, Haozhi Huang, Zhen Xiong, Di Wang, and Guanghua Yang. Robust and efficient 3d gaussian splatting for urban scene reconstruction. InProceedings of the IEEE/CVF International Conferenceon Computer Vision, pages 26209–26219, 2025

  46. [46]

    Monoslam: Real-time single camera slam

    Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-time single camera slam. IEEE transactionson pattern analysisandmachineintelligence, 29(6):1052–1067, 2007

  47. [47]

    Lsd-slam: Large-scale direct monocular slam

    Jakob Engel, Thomas Schöps, and Daniel Cremers. Lsd-slam: Large-scale direct monocular slam. InEuropean conferenceon computervision, pages 834–849. Springer, 2014

  48. [48]

    Parallel tracking and mapping for small ar workspaces

    Georg Klein and David Murray. Parallel tracking and mapping for small ar workspaces. In20076th IEEEandACM international symposiumon mixedandaugmented reality, pages 225–234. IEEE, 2007

  49. [49]

    Dtam: Dense tracking and mapping in real-time

    Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking and mapping in real-time. In 2011international conferenceon computervision, pages 2320–2327. IEEE, 2011

  50. [50]

    Kinectfusion: Real-time dense surface mapping and tracking

    Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In2011 10th IEEEinternational symposiumonmixedand augmented reality, pages 127–136. Ieee, 2011

  51. [51]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedingsofthe IEEE/CVFconferenceoncomputervisionandpattern recognition, pages 20697–20709, 2024

  52. [52]

    Grounding image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean conferenceon computervision, pages 71–91. Springer, 2024

  53. [53]

    Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.arXivpreprint arXiv:2410.03825, 2024. 14

  54. [54]

    Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

    Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the ComputerVisionand PatternRecognitionConference, pages 21924–21935, 2025

  55. [55]

    Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

    You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXivpreprintarXiv:2509.02560, 2025

  56. [56]

    Vggt-long: Chunk it, loop it, align it–pushing vggt’s lim- its on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025

    Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences.arXivpreprintarXiv:2507.16443, 2025

  57. [57]

    Ttsa3r: Training-free temporal-spatial adaptive persistent state for streaming 3d reconstruction.arXiv preprintarXiv:2601.22615, 2026

    Zhijie Zheng, Xinhao Xiang, and Jiawei Zhang. Ttsa3r: Training-free temporal-spatial adaptive persistent state for streaming 3d reconstruction.arXiv preprintarXiv:2601.22615, 2026

  58. [58]

    Mut3r: Motion-aware updating transformer for dynamic 3d reconstruction.arXiv preprint arXiv:2512.03939, 2025

    Guole Shen, Tianchen Deng, Xingrui Qin, Nailin Wang, Jianyu Wang, Yanbo Wang, Yongtao Chen, Hesheng Wang, and Jingchuan Wang. Mut3r: Motion-aware updating transformer for dynamic 3d reconstruction.arXiv preprint arXiv:2512.03939, 2025

  59. [59]

    Stream3r: Scalable sequential 3d reconstruction with causal transformer.arXiv preprint arXiv:2508.10893, 2025

    Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer.arXiv preprint arXiv:2508.10893, 2025

  60. [60]

    Wint3r: Window-based streaming reconstruction with camera token pool.arXiv preprint arXiv:2509.05296, 2025

    Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, and Tong He. Wint3r: Window-based streaming reconstruction with camera token pool.arXiv preprint arXiv:2509.05296, 2025

  61. [61]

    Scene coordinate regression forests for camera relocalization in rgb-d images

    Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. InProceedingsofthe IEEEconferenceon computervision andpattern recognition, pages 2930–2937, 2013

  62. [62]

    Neural rgb-d surface reconstruction

    Dejan Azinović, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6290–6301, 2022

  63. [63]

    Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals

    Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals. In2019 IEEE/RSJ International Conference on IntelligentRobotsandSystems(IROS), pages 7855–7862. IEEE, 2019

  64. [64]

    A benchmark for the evaluation of rgb-d slam systems

    Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 573–580. IEEE, 2012

  65. [65]

    arXiv preprint arXiv:2406.10774 , year=

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context llm inference.arXivpreprintarXiv:2406.10774, 2024

  66. [66]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advancesin NeuralInformationProcessingSystems, 36:34661–34710, 2023

  67. [67]

    Snapkv: Llmknowswhatyouarelookingforbeforegeneration

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, andDemingChen. Snapkv: Llmknowswhatyouarelookingforbeforegeneration. AdvancesinNeuralInformation ProcessingSystems, 37:22947–22970, 2024

  68. [68]

    Attention is all you need.Advancesinneuralinformationprocessingsystems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesinneuralinformationprocessingsystems, 30, 2017

  69. [69]

    A naturalistic open source movie for optical flow evaluation

    Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. InEuropean conferenceoncomputervision, pages 611–625. Springer, 2012

  70. [70]

    Vision meets robotics: The kitti dataset.The international journal ofroboticsresearch, 32(11):1231–1237, 2013

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The international journal ofroboticsresearch, 32(11):1231–1237, 2013. 15 Appendix A Retrieved Keyframe Visualization We visualize keyframes retrieved by RetrieveVGGT on five NRGBD [62] (Fig. 7) and 7-Scenes [61] (Fig. 8) scenes. For each, we show t...

  71. [71]

    The standard scaling factor1/ √ 𝑑ℎ [68] is frame-independent and preserves relative rankings, so scaled dot product yields identical selections and is excluded

    Penalizes both directional misalignment and magnitude discrepancy. The standard scaling factor1/ √ 𝑑ℎ [68] is frame-independent and preserves relative rankings, so scaled dot product yields identical selections and is excluded. Analysis(Tab. 4).(1) Magnitude encodes geometric importance.VGGT’s key descriptors encode both viewing direction and geometric in...