arxiv: 2605.09644 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

Zichen Zou , Xiaosong Jia , Zuxuan Wu , Yu-Gang Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D reconstructionstreaminglong contextattention retrievalmemory efficiencytraining-freequery-key similaritypose-aware memory

0 comments

The pith

RetrieveVGGT uses query-key similarity to retrieve relevant frames and keep memory constant during long-context streaming 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The quadratic cost of global attention in VGGT makes it hard to process long video sequences for 3D reconstruction. RetrieveVGGT addresses this by treating the selection of context frames as a retrieval task that always selects a bounded number of past frames. The method relies on the observation that query-key similarities already computed in the first attention layer reliably indicate which history frames are useful. Adding segment sampling for diversity and pose-aware organization further improves the retrieved set, resulting in bounded memory and improved reconstruction quality on extended sequences.

Core claim

By casting context construction as retrieval, RetrieveVGGT selects a fixed number of relevant history frames at each step using the similarity between the current frame's queries and the cached keys from VGGT's first global attention layer. This similarity alone serves as a strong relevance signal, so no separate learned retriever is needed. Segment Sampling spreads the selection across distinct temporal segments, while pose-aware spatial memory arranges stored frames by their estimated camera poses to support location-sensitive lookup. The result is a streaming system whose memory footprint stays near the model's original training length and whose output quality exceeds that of prior stream

What carries the argument

Query-key similarity retrieval at the first global attention layer, which identifies relevant history frames for inclusion in the current context window.

Load-bearing premise

The query-key similarities from the first global attention layer provide a sufficient signal for choosing which past frames contribute most to accurate current-frame reconstruction.

What would settle it

A direct comparison on long video sequences showing that full-history attention or random frame selection yields higher accuracy than the similarity-based retrieval would disprove the central premise.

read the original abstract

Visual Geometry Grounded Transformer (VGGT) advances 3D reconstruction via scalable Transformer architecture, but the quadratic complexity of global attention prevents long context application. StreamVGGT enables streaming with causal attention, yet its KV cache grows linearly with frames, causing memory overflow and quality degradation. We present RetrieveVGGT, a training-free framework, which formulates context construction for VGGT as a retrieval problem. By retrieving a fixed number of relevant frames at each step, VGGT maintains a controllable memory budget, which is close to its training context length. Interestingly, we find that the similarity between current frame queries and cached history frame keys at the first global attention layer of VGGT is already a strong indicator of relevance, eliminating the need for additional learned scoring. To enhance information diversity similar to a recommender system, we propose Segment Sampling so that the retrieval spans distinct relevant segments rather than a single high-similarity region. We design a pose-aware spatial memory mechanism that organizes history frames according to their already estimated camera poses, enabling location-aware retrieval. Extensive experiments demonstrate that RetrieveVGGT achieves state-of-the-art performance, outperforming StreamVGGT, TTT3R, and InfiniteVGGT while maintaining constant memory usage regardless of sequence length. Code is available at https://github.com/zzctmd/RetrieveVGGT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RetrieveVGGT caps memory with first-layer QK retrieval plus segment sampling, but the geometric relevance of that early similarity signal is still the untested hinge.

read the letter

RetrieveVGGT keeps memory flat during long streaming 3D reconstruction by pulling a fixed budget of past frames instead of caching everything. The trick is to treat context selection as retrieval inside the frozen VGGT model itself. They compute cosine similarity between the current frame queries and the cached history keys at the first global attention layer, then add segment sampling to avoid pulling everything from one high-similarity clump and a pose-aware memory layout so retrieval can respect already-estimated camera positions. This is genuinely new relative to the causal-attention streaming baselines they cite. The approach is training-free and the abstract reports it beats StreamVGGT, TTT3R, and InfiniteVGGT while holding memory constant across sequence lengths. Code release is a plus for anyone who wants to test it directly. The soft spot is the central assumption that first-layer QK similarity already encodes the right kind of relevance. Those early embeddings are dominated by local appearance, not necessarily by multi-view overlap or pose consistency that matters for reconstruction accuracy. The paper layers on segment sampling and pose organization to compensate, yet the description gives no ablation that isolates whether first-layer similarity outperforms last-layer features, explicit pose-distance retrieval, or even random selection of the same size. If the signal is mostly texture-driven, quality could degrade on sequences where visual similarity does not align with geometric utility. Readers working on video-based 3D pipelines or memory-constrained robotics setups will get the most out of this. It is worth sending to referees because the problem is practical, the method is simple and reproducible, and the constant-memory claim is falsifiable even if the current evidence leaves the relevance assumption under-supported.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RetrieveVGGT, a training-free method for long-context streaming 3D reconstruction with VGGT. It treats context construction as a retrieval problem, selecting a fixed budget of relevant frames via cosine similarity between the current frame's queries and cached history keys at the first global attention layer, augmented by Segment Sampling for diversity across segments and a pose-aware spatial memory that organizes frames by estimated camera poses. The central claim is that this maintains constant memory usage close to the model's training context length while achieving state-of-the-art performance, outperforming StreamVGGT, TTT3R, and InfiniteVGGT on long sequences.

Significance. If the empirical claims hold under rigorous validation, RetrieveVGGT would offer a practical advance for scalable video-based 3D reconstruction by eliminating linear memory growth without retraining or architectural changes. The training-free reuse of existing first-layer attention scores for retrieval is a notable strength, as is the explicit handling of redundancy via segment sampling and pose organization; these could generalize to other long-context vision transformers.

major comments (2)

[Abstract] Abstract: The claim of state-of-the-art performance with constant memory 'regardless of sequence length' is load-bearing for the contribution but unsupported by any reported datasets, metrics (e.g., reconstruction accuracy or completeness), sequence lengths, or quantitative deltas versus the listed baselines. Without these, the SOTA assertion cannot be assessed and risks selection bias in the retrieval process.
[Method description (retrieval mechanism)] Method description (retrieval mechanism): The assumption that first global attention layer Q-K similarity is already a strong indicator of geometric relevance is central yet unvalidated. No ablation compares it to last-layer similarity, explicit pose-distance retrieval, or random selection at the same budget K; first-layer embeddings primarily encode local appearance, which may retrieve texture matches rather than multi-view geometric overlap, silently degrading reconstruction quality in the constant-memory regime.

minor comments (2)

[Title] The title contains the apparent typographical artifact 'Retrieve.RetrieveVGGT'.
[Method] The free parameters (retrieval budget K and segment sampling parameters) are mentioned but lack explicit notation or sensitivity analysis in the provided description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of state-of-the-art performance with constant memory 'regardless of sequence length' is load-bearing for the contribution but unsupported by any reported datasets, metrics (e.g., reconstruction accuracy or completeness), sequence lengths, or quantitative deltas versus the listed baselines. Without these, the SOTA assertion cannot be assessed and risks selection bias in the retrieval process.

Authors: We agree that the abstract would be strengthened by including concrete supporting details. The full manuscript reports experiments on long video sequences from standard 3D reconstruction benchmarks, using metrics such as reconstruction accuracy and completeness. These evaluations cover sequence lengths substantially exceeding the model's training context while maintaining constant memory, with quantitative comparisons showing outperformance over StreamVGGT, TTT3R, and InfiniteVGGT. The Segment Sampling and pose-aware spatial memory mechanisms are explicitly introduced to mitigate redundancy and selection bias. We will revise the abstract to briefly report the datasets, example sequence lengths, key metrics, and performance deltas. revision: yes
Referee: [Method description (retrieval mechanism)] Method description (retrieval mechanism): The assumption that first global attention layer Q-K similarity is already a strong indicator of geometric relevance is central yet unvalidated. No ablation compares it to last-layer similarity, explicit pose-distance retrieval, or random selection at the same budget K; first-layer embeddings primarily encode local appearance, which may retrieve texture matches rather than multi-view geometric overlap, silently degrading reconstruction quality in the constant-memory regime.

Authors: The manuscript states that we empirically observed first-layer Q-K similarity to be effective, but we acknowledge the absence of systematic ablations against alternatives. We will add a dedicated ablation study in the revised manuscript comparing first-layer similarity to last-layer similarity, explicit pose-distance retrieval, and random selection, all under the same fixed budget K. These results will quantify reconstruction quality (accuracy and completeness) to demonstrate that first-layer attention provides superior geometric relevance. While first-layer features do encode appearance, the downstream 3D reconstruction metrics in our experiments indicate that the selected frames support multi-view geometric consistency rather than mere texture matching; the new ablations will make this distinction explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: retrieval directly reuses unmodified first-layer Q-K similarity from the base VGGT model without fitted parameters, self-referential equations, or load-bearing self-citations.

full rationale

The paper's central mechanism is explicitly training-free and formulates context selection as retrieval using cosine similarity on existing attention computations at the first global attention layer of VGGT. This is presented as an empirical observation ('we find that the similarity... is already a strong indicator') rather than a derived quantity. No equations, ansatzes, or uniqueness theorems are introduced that reduce the claimed relevance or SOTA performance back to fitted inputs or prior self-work by construction. Segment Sampling and pose-aware memory are additional heuristics built on top of the same unmodified signals. The derivation chain remains independent of the target result and does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method depends on a small set of hyperparameters for retrieval count and sampling plus one key domain assumption about attention similarity; no new entities are postulated.

free parameters (2)

retrieval budget K
Fixed number of frames retrieved per step to enforce constant memory budget.
segment sampling parameters
Controls for selecting across distinct segments to promote information diversity.

axioms (1)

domain assumption Query-key similarity at the first global attention layer indicates relevance for 3D reconstruction context selection.
Directly invoked to justify training-free retrieval without learned scorers.

pith-pipeline@v0.9.0 · 5554 in / 1387 out tokens · 70409 ms · 2026-05-12T04:30:24.937304+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the similarity between current frame queries and cached history frame keys at the first global attention layer of VGGT is already a strong indicator of relevance, eliminating the need for additional learned scoring

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 2 internal anchors

[1]

Tri-perspective view for vision-based 3d semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision-based 3d semantic occupancy prediction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9223–9232, 2023

work page 2023
[2]

Occworld: Learning a 3d occupancy world model for autonomous driving

Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. InEuropean conference on computervision, pages 55–72. Springer, 2024

work page 2024
[3]

3d clothed human reconstruction from sparse multi-view images

Jin Gyu Hong, Seung Young Noh, Hee Kyung Lee, Won Sik Cheong, and Ju Yong Chang. 3d clothed human reconstruction from sparse multi-view images. InProceedings of the IEEE/CVF Conferenceon ComputerVision and PatternRecognition, pages 677–687, 2024

work page 2024
[4]

Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds

Jiahui Lei, Yĳia Weng, Adam W Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. InProceedings of the ComputerVisionand PatternRecognition Conference, pages 6165–6177, 2025

work page 2025
[5]

Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis

ShunyuanZheng,BoyaoZhou,RuizhiShao,BoningLiu,ShengpingZhang,LiqiangNie,andYebinLiu. Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. InProceedings of the IEEE/CVF conferenceoncomputervision andpattern recognition, pages 19680–19690, 2024

work page 2024
[6]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, KarolHausman,BrianIchter,etal. 𝜋0: Avision-language-actionflowmodelforgeneralrobotcontrol. arXivpreprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXivpreprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprintarXiv:2411.19650, 2024

work page Pith review arXiv 2024
[9]

4dtam: Non-rigid tracking and mapping via dynamic surface gaussians

Hidenobu Matsuki, Gwangbin Bae, and Andrew J Davison. 4dtam: Non-rigid tracking and mapping via dynamic surface gaussians. InProceedingsofthe ComputerVisionandPatternRecognitionConference, pages 26921–26932, 2025

work page 2025
[10]

Building rome in a day.Communications ofthe ACM, 54(10):105–112, 2011

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.Communications ofthe ACM, 54(10):105–112, 2011

work page 2011
[11]

Building rome on a cloudless day

Jan-Michael Frahm, Pierre Fite-Georgel, David Gallup, Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, Svetlana Lazebnik, et al. Building rome on a cloudless day. InEuropeanconferenceon computervision, pages 368–381. Springer, 2010

work page 2010
[12]

Robust incremental structure-from-motion with hybrid features

Shaohui Liu, Yidan Gao, Tianyi Zhang, Rémi Pautrat, Johannes L Schönberger, Viktor Larsson, and Marc Pollefeys. Robust incremental structure-from-motion with hybrid features. InEuropean Conference on Computer Vision, pages 249–269. Springer, 2024

work page 2024
[13]

Structure-from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceedings of the IEEE conferenceon computervision and pattern recognition, pages 4104–4113, 2016

work page 2016
[14]

Towards linear-time incremental structure from motion

Changchang Wu. Towards linear-time incremental structure from motion. In2013International Conferenceon3D Vision-3DV2013, pages 127–134. IEEE, 2013

work page 2013
[15]

Pixel-perfect structure-from-motion withfeaturemetricrefinement

Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Larsson, and Marc Pollefeys. Pixel-perfect structure-from-motion withfeaturemetricrefinement. In ProceedingsoftheIEEE/CVFinternationalconferenceoncomputervision,pages 5987–5997, 2021

work page 2021
[16]

Mvsnet: Depth inference for unstructured multi-view stereo

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InProceedings ofthe Europeanconferenceon computervision (ECCV), pages 767–783, 2018. 12

work page 2018
[17]

Recurrent mvsnet for high-resolution multi-view stereo depth inference

Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi-view stereo depth inference. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5525–5534, 2019

work page 2019
[18]

Accurate, dense, and robust multiview stereopsis.IEEEtransactionsonpattern analysisand machineintelligence, 32(8):1362–1376, 2009

Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis.IEEEtransactionsonpattern analysisand machineintelligence, 32(8):1362–1376, 2009

work page 2009
[19]

Cascade cost volume for high- resolution multi-view stereo and stereo matching

Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume for high- resolution multi-view stereo and stereo matching. InProceedingsofthe IEEE/CVFconferenceoncomputervision and pattern recognition, pages 2495–2504, 2020

work page 2020
[20]

Point-based multi-view stereo network

Rui Chen, Songfang Han, Jing Xu, and Hao Su. Point-based multi-view stereo network. InProceedings of the IEEE/CVF international conferenceon computervision, pages 1538–1547, 2019

work page 2019
[21]

A surface-growing approach to multi-view stereo reconstruction

Martin Habbecke and Leif Kobbelt. A surface-growing approach to multi-view stereo reconstruction. In2007IEEE Conferenceon ComputerVisionand PatternRecognition, pages 1–8. IEEE, 2007

work page 2007
[22]

Vggt: Visualgeometrygroundedtransformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visualgeometrygroundedtransformer. In ProceedingsoftheComputerVisionandPatternRecognitionConference, pages 5294–5306, 2025

work page 2025
[23]

3d reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. In2025 International Conferenceon 3D Vision(3DV), pages 78–89. IEEE, 2025

work page 2025
[24]

Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025

YuqiWu, WenzhaoZheng, JieZhou, andJiwenLu. Point3r: Streaming3dreconstructionwithexplicitspatialpointer memory. arXivpreprintarXiv:2507.02863, 2025

work page arXiv 2025
[25]

Continuous3dperception model with persistent state

QianqianWang,YifeiZhang,AleksanderHolynski,AlexeiAEfros,andAngjooKanazawa. Continuous3dperception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

work page 2025
[26]

Ttt3r: 3d reconstruction as test-time training

XingyuChen,YueChen,YuliangXiu,AndreasGeiger,andAnpeiChen. Ttt3r: 3dreconstructionastest-timetraining. arXiv preprintarXiv:2509.26645, 2025

work page arXiv 2025
[27]

GA”. The “Type

DongZhuo,WenzhaoZheng,JiaheGuo,YuqiWu,JieZhou,andJiwenLu. Streaming4dvisualgeometrytransformer. arXiv preprintarXiv:2507.11539, 2025

work page arXiv 2025
[28]

InfiniteVGGT: Visual geometry grounded transformer for endless streams

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. Infinitevggt: Visual geometry grounded transformer for endless streams.arXivpreprint arXiv:2601.02281, 2026

work page arXiv 2026
[29]

D2-net: A trainable cnn for joint description and detection of local features

Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint description and detection of local features. InProceedings of the ieee/cvf conferenceon computervision andpattern recognition, pages 8092–8101, 2019

work page 2019
[30]

Distinctive image features from scale-invariant keypoints.International journal ofcomputervision, 60(2):91–110, 2004

David G Lowe. Distinctive image features from scale-invariant keypoints.International journal ofcomputervision, 60(2):91–110, 2004

work page 2004
[31]

Orb: An efficient alternative to sift or surf

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In2011 International conferenceon computervision, pages 2564–2571. Ieee, 2011

work page 2011
[32]

Learning to match features with seeded graph matching network

Hongkai Chen, Zixin Luo, Jiahui Zhang, Lei Zhou, Xuyang Bai, Zeyu Hu, Chiew-Lan Tai, and Long Quan. Learning to match features with seeded graph matching network. InProceedings of the IEEE/CVF international conference on computervision, pages 6301–6310, 2021

work page 2021
[33]

Lightglue: Local feature matching at light speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. In Proceedings ofthe IEEE/CVF international conferenceon computervision, pages 17627–17638, 2023

work page 2023
[34]

Clustergnn: Cluster-based coarse-to-fine graph neural network for efficient feature matching

Yan Shi, Jun-Xiong Cai, Yoli Shavit, Tai-Jiang Mu, Wensen Feng, and Kai Zhang. Clustergnn: Cluster-based coarse-to-fine graph neural network for efficient feature matching. InProceedingsofthe IEEE/CVFconferenceon computervision andpattern recognition, pages 12517–12526, 2022

work page 2022
[35]

Bundle adjustment in the large

Sameer Agarwal, Noah Snavely, Steven M Seitz, and Richard Szeliski. Bundle adjustment in the large. InEuropean conferenceon computervision, pages 29–42. Springer, 2010. 13

work page 2010
[36]

Bundle adjustment—a modern synthesis

Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. InInternational workshopon vision algorithms, pages 298–372. Springer, 1999

work page 1999
[37]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022

work page 2022
[38]

Tensorf: Tensorial radiance fields

Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean conferenceon computervision, pages 333–350. Springer, 2022

work page 2022
[39]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communicationsofthe ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communicationsofthe ACM, 65(1):99–106, 2021

work page 2021
[40]

Nerf++: Analyzing and improving neural radiance fields

Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. Nerf++: Analyzing and improving neural radiance fields. arXivpreprintarXiv:2010.07492, 2020

work page arXiv 2010
[41]

arXiv preprint arXiv:2106.10689 , year=

Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction.arXiv preprintarXiv:2106.10689, 2021

work page arXiv 2021
[42]

3d gaussian splatting for real-time radiance field rendering.ACMTrans.Graph., 42(4):139–1, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACMTrans.Graph., 42(4):139–1, 2023

work page 2023
[43]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InProceedings of the IEEE/CVF conferenceon computer vision and pattern recognition, pages 19457–19467, 2024

work page 2024
[44]

Gaussianroom: Improving 3d gaussian splatting with sdf guidance and monocular cues for indoor scene reconstruction

Haodong Xiang, Xinghui Li, Kai Cheng, Xiansong Lai, Wanting Zhang, Zhichao Liao, Long Zeng, and Xueping Liu. Gaussianroom: Improving 3d gaussian splatting with sdf guidance and monocular cues for indoor scene reconstruction. In2025IEEEInternational ConferenceonRoboticsandAutomation(ICRA), pages 2686–2693. IEEE, 2025

work page 2025
[45]

Robust and efficient 3d gaussian splatting for urban scene reconstruction

Zhensheng Yuan, Haozhi Huang, Zhen Xiong, Di Wang, and Guanghua Yang. Robust and efficient 3d gaussian splatting for urban scene reconstruction. InProceedings of the IEEE/CVF International Conferenceon Computer Vision, pages 26209–26219, 2025

work page 2025
[46]

Monoslam: Real-time single camera slam

Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-time single camera slam. IEEE transactionson pattern analysisandmachineintelligence, 29(6):1052–1067, 2007

work page 2007
[47]

Lsd-slam: Large-scale direct monocular slam

Jakob Engel, Thomas Schöps, and Daniel Cremers. Lsd-slam: Large-scale direct monocular slam. InEuropean conferenceon computervision, pages 834–849. Springer, 2014

work page 2014
[48]

Parallel tracking and mapping for small ar workspaces

Georg Klein and David Murray. Parallel tracking and mapping for small ar workspaces. In20076th IEEEandACM international symposiumon mixedandaugmented reality, pages 225–234. IEEE, 2007

work page 2007
[49]

Dtam: Dense tracking and mapping in real-time

Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking and mapping in real-time. In 2011international conferenceon computervision, pages 2320–2327. IEEE, 2011

work page 2011
[50]

Kinectfusion: Real-time dense surface mapping and tracking

Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In2011 10th IEEEinternational symposiumonmixedand augmented reality, pages 127–136. Ieee, 2011

work page 2011
[51]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedingsofthe IEEE/CVFconferenceoncomputervisionandpattern recognition, pages 20697–20709, 2024

work page 2024
[52]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean conferenceon computervision, pages 71–91. Springer, 2024

work page 2024
[53]

Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.arXivpreprint arXiv:2410.03825, 2024. 14

work page arXiv 2024
[54]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the ComputerVisionand PatternRecognitionConference, pages 21924–21935, 2025

work page 2025
[55]

Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXivpreprintarXiv:2509.02560, 2025

work page arXiv 2025
[56]

Vggt-long: Chunk it, loop it, align it–pushing vggt’s lim- its on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences.arXivpreprintarXiv:2507.16443, 2025

work page arXiv 2025
[57]

Ttsa3r: Training-free temporal-spatial adaptive persistent state for streaming 3d reconstruction.arXiv preprintarXiv:2601.22615, 2026

Zhĳie Zheng, Xinhao Xiang, and Jiawei Zhang. Ttsa3r: Training-free temporal-spatial adaptive persistent state for streaming 3d reconstruction.arXiv preprintarXiv:2601.22615, 2026

work page arXiv 2026
[58]

Mut3r: Motion-aware updating transformer for dynamic 3d reconstruction.arXiv preprint arXiv:2512.03939, 2025

Guole Shen, Tianchen Deng, Xingrui Qin, Nailin Wang, Jianyu Wang, Yanbo Wang, Yongtao Chen, Hesheng Wang, and Jingchuan Wang. Mut3r: Motion-aware updating transformer for dynamic 3d reconstruction.arXiv preprint arXiv:2512.03939, 2025

work page arXiv 2025
[59]

Stream3r: Scalable sequential 3d reconstruction with causal transformer.arXiv preprint arXiv:2508.10893, 2025

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer.arXiv preprint arXiv:2508.10893, 2025

work page arXiv 2025
[60]

Wint3r: Window-based streaming reconstruction with camera token pool.arXiv preprint arXiv:2509.05296, 2025

Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, and Tong He. Wint3r: Window-based streaming reconstruction with camera token pool.arXiv preprint arXiv:2509.05296, 2025

work page arXiv 2025
[61]

Scene coordinate regression forests for camera relocalization in rgb-d images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. InProceedingsofthe IEEEconferenceon computervision andpattern recognition, pages 2930–2937, 2013

work page 2013
[62]

Neural rgb-d surface reconstruction

Dejan Azinović, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6290–6301, 2022

work page 2022
[63]

Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals

Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals. In2019 IEEE/RSJ International Conference on IntelligentRobotsandSystems(IROS), pages 7855–7862. IEEE, 2019

work page 2019
[64]

A benchmark for the evaluation of rgb-d slam systems

Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 573–580. IEEE, 2012

work page 2012
[65]

arXiv preprint arXiv:2406.10774 , year=

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context llm inference.arXivpreprintarXiv:2406.10774, 2024

work page arXiv 2024
[66]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advancesin NeuralInformationProcessingSystems, 36:34661–34710, 2023

work page 2023
[67]

Snapkv: Llmknowswhatyouarelookingforbeforegeneration

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, andDemingChen. Snapkv: Llmknowswhatyouarelookingforbeforegeneration. AdvancesinNeuralInformation ProcessingSystems, 37:22947–22970, 2024

work page 2024
[68]

Attention is all you need.Advancesinneuralinformationprocessingsystems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesinneuralinformationprocessingsystems, 30, 2017

work page 2017
[69]

A naturalistic open source movie for optical flow evaluation

Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. InEuropean conferenceoncomputervision, pages 611–625. Springer, 2012

work page 2012
[70]

Vision meets robotics: The kitti dataset.The international journal ofroboticsresearch, 32(11):1231–1237, 2013

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The international journal ofroboticsresearch, 32(11):1231–1237, 2013. 15 Appendix A Retrieved Keyframe Visualization We visualize keyframes retrieved by RetrieveVGGT on five NRGBD [62] (Fig. 7) and 7-Scenes [61] (Fig. 8) scenes. For each, we show t...

work page 2013
[71]

The standard scaling factor1/ √ 𝑑ℎ [68] is frame-independent and preserves relative rankings, so scaled dot product yields identical selections and is excluded

Penalizes both directional misalignment and magnitude discrepancy. The standard scaling factor1/ √ 𝑑ℎ [68] is frame-independent and preserves relative rankings, so scaled dot product yields identical selections and is excluded. Analysis(Tab. 4).(1) Magnitude encodes geometric importance.VGGT’s key descriptors encode both viewing direction and geometric in...

work page arXiv