arxiv: 2603.20284 · v2 · submitted 2026-03-18 · 💻 cs.CV · cs.GR· eess.IV

Recognition: 1 theorem link

· Lean Theorem

STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction

Runze Wang , Yuxuan Song , Youcheng Cai , Ligang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:01 UTC · model grok-4.3

classification 💻 cs.CV cs.GReess.IV

keywords streaming 3D reconstructionKV cache compressioncausal transformersspatio-temporal sparsitymemory efficiencyreal-time inferencetemporal consistencyvoxel caching

0 comments

The pith

STAC reduces memory use nearly 10x and speeds inference 4x in streaming 3D reconstruction by compressing the growing KV cache while preserving quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to remove the linear memory growth that limits long streaming 3D reconstruction with causal transformers. It does so by showing that attention inside these models already carries exploitable spatio-temporal sparsity, then building three compression components on top of that sparsity: decayed-score temporal token retention, voxel-based spatial compression, and chunk-wise multi-frame processing. A reader would care because the result turns an otherwise unscalable memory wall into a practical system that still delivers state-of-the-art geometry and temporal coherence. The method is presented as plug-and-play, requiring no retraining of the underlying transformer.

Core claim

STAC is a plug-and-play Spatio-Temporally Aware Cache Compression framework that keeps long-term informative tokens via decayed cumulative attention scores, folds spatially redundant tokens into voxel-aligned representations, and jointly optimizes consecutive frames in chunks; together these steps cut memory consumption by nearly 10x and raise inference speed by 4x without degrading reconstruction quality or temporal consistency.

What carries the argument

The central mechanism is the three-part STAC cache compressor: Working Temporal Token Caching that ranks tokens by decayed cumulative attention, Long-term Spatial Token Caching that collapses redundant tokens into voxel grids, and Chunk-based Multi-frame Optimization that processes frame groups together.

If this is right

Long input streams can be processed under fixed memory budgets without early eviction.
Temporal coherence across frames remains high because informative tokens are retained by attention history.
GPU throughput rises because both cache size and attention computation shrink.
The same compression layers can be inserted into any existing causal 3D transformer without retraining.
Real-time deployment on resource-limited devices becomes feasible for extended sessions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The sparsity observation could be tested in other causal transformer pipelines such as video generation or long-horizon robotics perception.
Adaptive compression thresholds might further reduce memory when scene complexity is low.
The voxel-aligned storage format may lend itself to direct integration with downstream meshing or rendering pipelines.
Similar decay-and-compress logic could be applied to non-3D streaming tasks that already use KV caches.
The chunk optimization strategy might generalize to any multi-frame causal model to improve both speed and coherence.

Load-bearing premise

The assumption that attention patterns inside causal transformers for 3D reconstruction contain intrinsic spatio-temporal sparsity that can be removed without harming geometry accuracy or frame-to-frame consistency.

What would settle it

Running STAC on a long video stream and measuring a clear rise in reconstruction error or loss of temporal coherence relative to the uncompressed KV cache would show the sparsity claim does not hold.

Figures

Figures reproduced from arXiv: 2603.20284 by Ligang Liu, Runze Wang, Youcheng Cai, Yuxuan Song.

**Figure 1.** Figure 1: Runtime–memory scaling in streaming 3D reconstruction. Bars show per-frame runtime (ms) and lines show KV cache memory (GB, log scale) as the stream grows. Compared with Causal-VGGT, STAC reduces KV cache growth and stabilizes per-frame runtime as the stream length increases. (MVS) [12] rely on explicit geometric constraints and optimization to estimate geometry and camera motion. Although these method… view at source ↗

**Figure 2.** Figure 2: Spatio-temporal attention sparsity. Representative attention patterns in the global causal attention map of CausalVGGT, retaining the top 1024 keys per query for visualization. (a) Spatial attention aligned with camera motion; (b) Persistent focus on first-frame tokens as global references; (c) Temporal anchoring via tokens from semantically stable landmark frames; (d) Longrange attention to camera token… view at source ↗

**Figure 3.** Figure 3: Overview of STAC. Our framework reconstructs 3D scenes online using spatio-temporal token caching and chunk-based causal inference. (a) The Causal-VGGT module processes ViT-tokenized frames in each chunk using causal attention over the working temporal cache Mtemp and spatial cache Mspat retrieved from a 3D voxel grid. (b) During inference, Working Temporal Token Caching updates token scores after each KV … view at source ↗

**Figure 4.** Figure 4: Qualitative results on streaming inputs from the 7-scenes and NRGBD datasets. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of a spatial attention head, where the top [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 8.** Figure 8: Layer-wise intra-voxel similarity in Causal-VGGT. Average intra-voxel similarity scores computed across all attention heads per layer. with 3D distance. Given two voxels ua and ub with associated token sets Mua and Mub , we define their inter-voxel similarity in the key and value embedding spaces as follows: InterSimk(ua, ub) = 1 |Mua | |Mub | X ma∈Mua mb∈Mub cosk(ma, mb), InterSimv(ua, ub) = 1 |Mua | |Mub… view at source ↗

**Figure 7.** Figure 7: Inter-voxel similarity vs. 3D distance. Mean and samples of InterSim(ua, ub) are shown as a function of Euclidean distance between voxel centers, averaged over sampled voxel pairs across scenes. insufficient token samples. Empirically, we find that n = 4 provides a good trade-off between representational granularity and robustness. Notably, value-state tokens follow a similar trend, showing consistent be… view at source ↗

**Figure 9.** Figure 9: Additional qualitative comparisons with baselines. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Behavior of STAC on a long indoor stream from the [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

Online 3D reconstruction from streaming inputs requires both long-term temporal consistency and efficient memory usage. Although causal variants of VGGT address this challenge through a key-value (KV) cache mechanism, the cache grows linearly with the stream length, creating a major memory bottleneck. Under limited memory budgets, early cache eviction significantly degrades reconstruction quality and temporal consistency. In this work, we observe that attention in causal transformers for 3D reconstruction exhibits intrinsic spatio-temporal sparsity. Based on this insight, we propose STAC, a Spatio-Temporally Aware Cache Compression framework for streaming 3D reconstruction with large causal transformers. STAC consists of three key components: (1) a Working Temporal Token Caching mechanism that preserves long-term informative tokens using decayed cumulative attention scores; (2) a Long-term Spatial Token Caching scheme that compresses spatially redundant tokens into voxel-aligned representations for memory-efficient storage; and (3) a Chunk-based Multi-frame Optimization strategy that jointly processes consecutive frames to improve temporal coherence and GPU efficiency. Extensive experiments show that STAC achieves state-of-the-art reconstruction quality while reducing memory consumption by nearly 10x and accelerating inference by 4x, substantially improving the scalability of real-time 3D reconstruction in streaming settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STAC introduces targeted cache compression for streaming 3D recon transformers via sparsity, but validation of the quality claims under stress is still needed.

read the letter

The main point to know is that STAC is a new cache compression approach for streaming 3D reconstruction using large causal transformers. It builds on the observation that attention shows intrinsic spatio-temporal sparsity, and then applies three mechanisms: decayed cumulative attention for keeping important temporal tokens, voxel-aligned compression for spatial redundancy in long-term storage, and chunk-based multi-frame optimization to keep things coherent and efficient. What the paper does well is identify a real practical problem with KV caches growing linearly and propose specific fixes that seem adapted to the 3D reconstruction setting rather than off-the-shelf methods. The combination of temporal decay and spatial voxel compression looks like it could work for maintaining geometry accuracy while cutting memory. The chunk strategy for joint processing of frames is a sensible way to handle temporal consistency. On the soft spots, the central claims of nearly 10x memory reduction and 4x speedup with state-of-the-art quality are strong, but they depend on the sparsity being stable enough not to lose important information in challenging cases like rapid motion or detailed surfaces. The stress-test note raises a fair point here: if the attention scores don't catch all salient tokens, the quality could suffer, and there's no parameter-free proof or bound given to guarantee the information loss is negligible. Since the full experiments aren't summarized in the abstract, it's hard to assess how well the ablations support the claims or what the exact baselines were. The invented entities like Working Temporal Token Caching seem to be the core new parts. This paper is aimed at people building real-time 3D reconstruction systems, especially those using transformer models on streaming data in computer vision and graphics. A reader working on efficient inference for vision transformers would find the concrete mechanisms useful to consider. It deserves a serious referee because the problem is important and the proposed solution is detailed enough to evaluate, even if it might need more evidence on the robustness of the sparsity assumption. Recommendation: Send it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes STAC, a plug-and-play Spatio-Temporally Aware Cache Compression framework for streaming 3D reconstruction with large causal transformers. It observes intrinsic spatio-temporal sparsity in attention and introduces three components: Working Temporal Token Caching that retains long-term tokens via decayed cumulative attention scores, Long-term Spatial Token Caching that compresses spatially redundant tokens into voxel-aligned representations, and Chunk-based Multi-frame Optimization to jointly process frames for coherence and efficiency. The central claim is that STAC achieves state-of-the-art reconstruction quality while reducing memory consumption by nearly 10x and accelerating inference by 4x.

Significance. If the empirical results hold, the work would meaningfully advance scalable real-time 3D reconstruction by addressing the linear KV-cache growth bottleneck in streaming settings, enabling longer sequences under fixed memory budgets without compromising geometry accuracy or temporal consistency.

major comments (2)

[§3.1] §3.1 (Working Temporal Token Caching): The mechanism relies on decayed cumulative attention scores to evict tokens, but the manuscript provides no formal bound or analysis demonstrating that this preserves all salient tokens under rapid viewpoint changes or fine surface details; this assumption is load-bearing for the claim of negligible quality loss.
[§4] §4 (Experiments): The reported 10x memory reduction and 4x speedup with SOTA quality lack detailed ablations on the sparsity exploitation, comparisons to standard KV eviction baselines across varying stream lengths, and quantitative temporal consistency metrics, making it impossible to verify that compression does not degrade reconstruction under complex motion.

minor comments (2)

[Abstract] The abstract asserts 'extensive experiments' without naming the specific datasets, metrics (e.g., Chamfer distance, temporal coherence scores), or baseline methods, which reduces clarity.
[§3.2] Notation for voxel-aligned representations in the Long-term Spatial Token Caching description could be made more precise to avoid ambiguity with standard voxel grid definitions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the theoretical justification and experimental rigor. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [§3.1] §3.1 (Working Temporal Token Caching): The mechanism relies on decayed cumulative attention scores to evict tokens, but the manuscript provides no formal bound or analysis demonstrating that this preserves all salient tokens under rapid viewpoint changes or fine surface details; this assumption is load-bearing for the claim of negligible quality loss.

Authors: We agree that a formal bound would strengthen the claims. The current design is motivated by observed attention sparsity patterns in 3D reconstruction transformers, and our experiments across sequences with rapid viewpoint changes and fine details show negligible quality degradation. In revision, we will add a dedicated analysis subsection in §3.1 providing further justification for the decay mechanism, including sensitivity studies on eviction thresholds and their effect on surface details. revision: yes
Referee: [§4] §4 (Experiments): The reported 10x memory reduction and 4x speedup with SOTA quality lack detailed ablations on the sparsity exploitation, comparisons to standard KV eviction baselines across varying stream lengths, and quantitative temporal consistency metrics, making it impossible to verify that compression does not degrade reconstruction under complex motion.

Authors: We acknowledge these gaps in the experimental section. The manuscript currently reports overall performance gains and qualitative results, but we will expand §4 with: (i) component-wise ablations isolating spatio-temporal sparsity exploitation, (ii) comparisons against standard KV eviction baselines (such as attention-score-based eviction) across short-to-long stream lengths, and (iii) new quantitative temporal consistency metrics including frame-to-frame geometry alignment error and surface normal variance to verify robustness under complex motion. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on empirical observation and new mechanisms

full rationale

The paper grounds its central claim in an empirical observation that attention in causal transformers for 3D reconstruction exhibits intrinsic spatio-temporal sparsity, then introduces three new mechanisms (Working Temporal Token Caching via decayed cumulative scores, Long-term Spatial Token Caching via voxel-aligned compression, and Chunk-based Multi-frame Optimization) whose design is not derived from or equivalent to any fitted parameter or prior self-citation. No equations or uniqueness theorems are invoked that reduce the proposed compression ratios or quality preservation to the input data by construction; the 10x memory and 4x speedup claims are presented as outcomes of experimental validation rather than tautological re-derivations. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on one domain assumption about attention sparsity and introduces three new mechanisms without external independent evidence or reduction to prior published results.

axioms (1)

domain assumption Attention in causal transformers for 3D reconstruction exhibits intrinsic spatio-temporal sparsity.
This observation is stated as the foundation for the entire STAC framework in the abstract.

invented entities (3)

Working Temporal Token Caching no independent evidence
purpose: Preserves long-term informative tokens using decayed cumulative attention scores
New mechanism introduced to handle temporal consistency under memory constraints.
Long-term Spatial Token Caching no independent evidence
purpose: Compresses spatially redundant tokens into voxel-aligned representations for memory-efficient storage
New scheme for handling spatial redundancy in the cache.
Chunk-based Multi-frame Optimization no independent evidence
purpose: Jointly processes consecutive frames to improve temporal coherence and GPU efficiency
Strategy proposed to enhance both quality and runtime performance.

pith-pipeline@v0.9.0 · 5537 in / 1472 out tokens · 68393 ms · 2026-05-15T10:01:12.965760+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we observe that attention in causal transformers for 3D reconstruction exhibits intrinsic spatio-temporal sparsity... Working Temporal Token Caching mechanism that preserves long-term informative tokens using decayed cumulative attention scores; Long-term Spatial Token Caching scheme that compresses spatially redundant tokens into voxel-aligned representations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 4 internal anchors

[1]

Agarwal, N

S. Agarwal, N. Snavely, S. M. Seitz, and R. Szeliski. Bundle adjustment in the large. InECCV, pages 29–42, 2010. 2

work page 2010
[2]

Neural rgb-d surface reconstruction

Dejan Azinovi ´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6290– 6301, 2022. 6, 8

work page 2022
[3]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 3

work page 1901
[5]

A naturalistic open source movie for op- tical flow evaluation

Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for op- tical flow evaluation. InEuropean Conference on Computer Vision, pages 611–625, 2012. 7, 4, 5

work page 2012
[6]

Learning to match features with seeded graph matching network

Hongkai Chen, Zixin Luo, Jiahui Zhang, Lei Zhou, Xuyang Bai, Zeyu Hu, Chiew-Lan Tai, and Long Quan. Learning to match features with seeded graph matching network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6301–6310, 2021. 2

work page 2021
[7]

Fast construction of k-nearest neighbor graphs for point clouds.IEEE Transac- tions on Visualization and Computer Graphics, 16(4):599– 608, 2010

Michael Connor and Piyush Kumar. Fast construction of k-nearest neighbor graphs for point clouds.IEEE Transac- tions on Visualization and Computer Graphics, 16(4):599– 608, 2010. 6

work page 2010
[8]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017. 7

work page 2017
[9]

Vggt-long: Chunk it, loop it, align it–pushing vggt’s lim- its on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it–pushing vggt’s lim- its on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025. 2

work page arXiv 2025
[10]

Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025

Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, and Hao Jiang. Streaming video question-answering with in-context video kv-cache retrieval.arXiv preprint arXiv:2503.00540, 2025. 3

work page arXiv 2025
[11]

D2- net: A trainable cnn for joint description and detection of local features

Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Polle- feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2- net: A trainable cnn for joint description and detection of local features. InProceedings of the ieee/cvf Conference on Computer Vision and Pattern Recognition, pages 8092– 8101, 2019. 2

work page 2019
[12]

Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015

Yasutaka Furukawa, Carlos Hern ´andez, et al. Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015. 1, 2

work page 2015
[13]

Vision meets robotics: The kitti dataset.The in- ternational journal of robotics research, 32(11):1231–1237,

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The in- ternational journal of robotics research, 32(11):1231–1237,

work page
[14]

Vr-gs: A physical dynamics-aware interactive gaussian splatting system in virtual reality

Ying Jiang, Chang Yu, Tianyi Xie, Xuan Li, Yutao Feng, Huamin Wang, Minchen Li, Henry Lau, Feng Gao, Yin Yang, et al. Vr-gs: A physical dynamics-aware interactive gaussian splatting system in virtual reality. InACM SIG- GRAPH 2024 Conference Papers, pages 1–1, 2024. 1

work page 2024
[15]

Xin Jin and Jiawei Han.K-Means Clustering, pages 563–

work page
[16]

Stream3r: Scalable sequential 3d reconstruction with causal transformer.arXiv preprint arXiv:2508.10893, 2025

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer.arXiv preprint arXiv:2508.10893, 2025. 1, 2, 3, 4, 5, 6, 7, 8

work page arXiv 2025
[17]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean confer- ence on computer vision, pages 71–91. Springer, 2024. 2, 6, 7, 8, 5

work page 2024
[18]

Distinctive image features from scale- invariant keypoints.International Journal of Computer Vi- sion, 60(2):91–110, 2004

David G Lowe. Distinctive image features from scale- invariant keypoints.International Journal of Computer Vi- sion, 60(2):91–110, 2004. 2

work page 2004
[19]

International Business Machines Company, 1966

Guy M Morton.A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company, 1966. 6

work page 1966
[20]

LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval

Zhenyu Ning, Guangda Liu, Qihao Jin, Wenchao Ding, Minyi Guo, and Jieru Zhao. Livevlm: Efficient online video understanding via streaming-oriented kv cache and retrieval. arXiv preprint arXiv:2505.15269, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting resid- uals

Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting resid- uals. In2019 IEEE/RSJ International Conference on Intel- ligent Robots and Systems (IROS), pages 7855–7862. IEEE,

work page
[23]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 12179–12188, 2021. 3

work page 2021
[24]

Path planning algorithms in the autonomous driv- ing system: A comprehensive review.Robotics and Au- tonomous Systems, 174:104630, 2024

Mohamed Reda, Ahmed Onsy, Amira Y Haikal, and Ali Ghanbari. Path planning algorithms in the autonomous driv- ing system: A comprehensive review.Robotics and Au- tonomous Systems, 174:104630, 2024. 1

work page 2024
[25]

Virtual, augmented reality and learning analytics impact on learners, and educators: A systematic review.Education and Information Technologies, 29(15):19913–19962, 2024

Asmaa Sakr and Tariq Abdullah. Virtual, augmented reality and learning analytics impact on learners, and educators: A systematic review.Education and Information Technologies, 29(15):19913–19962, 2024. 1 9

work page 2024
[26]

Superglue: Learning feature matching with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4938–4947, 2020. 2

work page 2020
[27]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016. 1, 2

work page 2016
[28]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEuropean Conference on Computer Vision, pages 501–518, 2016. 2

work page 2016
[29]

Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025. 2

work page arXiv 2025
[30]

Scene co- ordinate regression forests for camera relocalization in rgb-d images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene co- ordinate regression forests for camera relocalization in rgb-d images. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 2930–2937,

work page
[31]

A benchmark for the eval- uation of rgb-d slam systems

J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the eval- uation of rgb-d slam systems. In2012 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems, pages 573–580. IEEE, 2012. 7, 6

work page 2012
[32]

Optimizing the viewing graph for structure-from-motion

Chris Sweeney, Torsten Sattler, Tobias Hollerer, Matthew Turk, and Marc Pollefeys. Optimizing the viewing graph for structure-from-motion. InProceedings of the IEEE interna- tional conference on computer vision, pages 801–809, 2015. 2

work page 2015
[33]

Keepkv: Eliminating output perturbation in kv cache compression for efficient llms inference.arXiv preprint arXiv:2504.09936, 2025

Yuxuan Tian, Zihan Wang, Yebo Peng, Aomufei Yuan, Zhiming Wang, Bairen Yi, Xin Liu, Yong Cui, and Tong Yang. Keepkv: Eliminating output perturbation in kv cache compression for efficient llms inference.arXiv preprint arXiv:2504.09936, 2025. 6

work page arXiv 2025
[34]

Bundle adjustment—a modern synthe- sis

Bill Triggs, Philip F McLauchlan, Richard I Hartley, and An- drew W Fitzgibbon. Bundle adjustment—a modern synthe- sis. InInternational Workshop on Vision Algorithms, pages 298–372, 1999. 2

work page 1999
[35]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 3

work page 2017
[36]

Gc-mvsnet: Multi-view, multi- scale, geometrically-consistent multi-view stereo

Vibhas K Vats, Sripad Joshi, David J Crandall, Md Alimoor Reza, and Soon-heung Jung. Gc-mvsnet: Multi-view, multi- scale, geometrically-consistent multi-view stereo. InPro- ceedings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision, pages 3242–3252, 2024. 2

work page 2024
[37]

D2o: Dynamic discriminative oper- ations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035, 2024

Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chao- fan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, et al. D2o: Dynamic discriminative oper- ations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035, 2024. 3, 5

work page arXiv 2024
[38]

3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061, 2024

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061, 2024. 2, 6, 7, 8, 5

work page arXiv 2024
[39]

Large language models for robotics: Opportunities, challenges, and perspectives.Journal of Automation and In- telligence, 2024

Jiaqi Wang, Enze Shi, Huawen Hu, Chong Ma, Yiheng Liu, Xuhui Wang, Yincheng Yao, Xuan Liu, Bao Ge, and Shu Zhang. Large language models for robotics: Opportunities, challenges, and perspectives.Journal of Automation and In- telligence, 2024. 1

work page 2024
[40]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 1, 2, 3, 7, 8, 5

work page 2025
[41]

Continuous 3d per- ception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025. 2, 6, 7, 8, 4, 5

work page 2025
[42]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2, 6, 7, 8, 5

work page 2024
[43]

Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks.arXiv preprint arXiv:2407.08454, 2024

Zheng Wang, Boxiao Jin, Zhongzhi Yu, and Minjia Zhang. Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks.arXiv preprint arXiv:2407.08454, 2024. 3

work page arXiv 2024
[44]

Robust global translations with 1dsfm

Kyle Wilson and Noah Snavely. Robust global translations with 1dsfm. InEuropean Conference on Computer Vision, pages 61–75, 2014. 2

work page 2014
[45]

Towards linear-time incremental struc- ture from motion

Changchang Wu. Towards linear-time incremental struc- ture from motion. In2013 International Conference on 3D Vision-3DV 2013, pages 127–134. IEEE, 2013. 2

work page 2013
[46]

Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025

Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025. 2, 6, 7, 8, 4, 5

work page arXiv 2025
[47]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21924–21935,

work page
[49]

Streammem: Query-agnostic kv cache memory for stream- ing video understanding.arXiv preprint arXiv:2508.15717,

Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, and Mengye Ren. Streammem: Query-agnostic kv cache memory for stream- ing video understanding.arXiv preprint arXiv:2508.15717,

work page arXiv
[50]

Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024. 6, 7, 8, 5

work page arXiv 2024
[51]

H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- 10 els.Advances in Neural Information Processing Systems, 36: 34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- 10 els.Advances in Neural Information Processing Systems, 36: 34661–34710, 2023. 3, 6

work page 2023
[52]

GA”. The “Type

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer. arXiv preprint arXiv:2507.11539, 2025. 1, 2, 3, 4, 5, 6, 7, 8 11 Key Frame Idx Query Frame Idx … … … … … … … … Figure 5. Visualization of a spatial attention head, where the top 1024 relevant key tokens are retained for each query. The axes are ...

work page arXiv 2025