arxiv: 2509.02560 · v2 · submitted 2025-09-02 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

You Shen , Zhipeng Zhang , Yansong Qu , Xiawu Zheng , Jiayi Ji , Shengchuan Zhang , Liujuan Cao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 23:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords token mergingvisual geometry transformer3D reconstructionmodel accelerationtraining-freelong-sequence inferenceattention maps

0 comments

The pith

A 3D-specific token partitioning strategy lets token merging accelerate VGGT fourfold on thousand-image sequences without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies the main compute bottleneck in VGGT and the token collapse visible in its attention maps. It then shows that a custom partitioning of tokens, designed around the model's 3D geometry task, allows safe merging of redundant tokens. The resulting FastVGGT runs without any training yet keeps the original reconstruction quality. A reader should care because accurate 3D perception from many views has been limited by slow inference and growing errors as sequence length increases. The work demonstrates that token merging, once adapted to the 3D setting, can scale feed-forward geometry models to longer inputs.

Core claim

By introducing a unique token partitioning strategy tailored to 3D architectures and tasks, FastVGGT applies token merging inside VGGT in a training-free manner, thereby eliminating redundant computation while preserving the model's full reconstruction capacity and reducing error accumulation on long image sequences.

What carries the argument

The 3D-specific token partitioning strategy that groups tokens according to the model's architectural and geometric-task properties so that merging removes only redundant work.

If this is right

Inference on 1000-image inputs becomes four times faster than the original VGGT.
Error accumulation is reduced in long-sequence 3D reconstruction.
The same training-free merging can be used directly on any pretrained VGGT checkpoint.
Token merging becomes a practical route for scaling feed-forward 3D geometry models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same partitioning idea could be tested on other feed-forward 3D vision transformers to see whether the speedup generalizes.
Combining the merging step with quantization or pruning might yield still larger gains on very long sequences.
Practical multi-view 3D pipelines could now process hundreds of additional frames without retraining the core model.

Load-bearing premise

The custom partitioning removes only redundant tokens and leaves the full geometric reconstruction capacity of VGGT intact.

What would settle it

Measure reconstruction error on a standard multi-view 3D benchmark using exactly 1000 input images; if FastVGGT's error rises substantially above VGGT's error, the preservation claim does not hold.

read the original abstract

Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, scaling these models to long-sequence image inputs remains a significant challenge due to inference-time inefficiency. In this work, we present a detailed analysis of VGGT, a state-of-the-art feed-forward visual geometry model and identify its primary bottleneck. Visualization further reveals a token collapse phenomenon in the attention maps. Motivated by these findings, we explore the potential of token merging in the feed-forward visual geometry model. Owing to the unique architectural and task-specific properties of 3D models, directly applying existing merging techniques proves challenging. To this end, we propose FastVGGT, which, for the first time, leverages token merging in the 3D domain through a training-free mechanism for accelerating VGGT. we devise a unique token partitioning strategy tailored to 3D architectures and tasks, effectively eliminating redundant computation while preserving VGGT's powerful reconstruction capacity. Extensive experiments on multiple 3D geometry benchmarks validate the effectiveness of our approach. Notably, with 1000 input images, FastVGGT achieves a 4x speedup over VGGT while mitigating error accumulation in long-sequence scenarios. These findings underscore the potential of token merging as a principled solution for scalable 3D vision systems. Code is available at: https://mystorm16.github.io/fastvggt/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FastVGGT adapts token merging to VGGT via a custom 3D partitioning strategy and reports a concrete 4x speedup at 1000 images, but the claim that reconstruction quality is fully preserved rests on limited verification.

read the letter

The main takeaway is that this work takes token merging, which has been used in 2D vision, and makes it work for a feed-forward 3D geometry model without any retraining. They identify attention collapse in VGGT, explain why standard merging fails on 3D tasks, and introduce a tailored partitioning rule that cuts computation while claiming to keep the model's output intact. At 1000 images the method delivers roughly 4x faster inference and appears to reduce error buildup over long sequences. Code is released, which makes the result easy to test directly.

Referee Report

2 major / 2 minor

Summary. The paper proposes FastVGGT, a training-free acceleration method for the VGGT feed-forward visual geometry model. It identifies attention bottlenecks and token collapse, then introduces a custom 3D-specific token partitioning strategy to enable token merging. The central claim is that this approach yields a 4x speedup on 1000-image sequences while preserving reconstruction capacity and mitigating error accumulation across multiple 3D geometry benchmarks.

Significance. If the partitioning strategy is shown to be lossless for geometrically salient features, the work offers a practical, training-free route to scaling feed-forward 3D models to long sequences, addressing a key inference bottleneck in the field. The empirical results on speedup and error mitigation are potentially impactful for applications requiring dense 3D reconstruction from many views, though the contribution remains primarily engineering-oriented without new theoretical derivations.

major comments (2)

[Section 3.2] Section 3.2 (token partitioning strategy): The claim that the 3D-specific partitioning removes only redundant tokens while fully preserving VGGT's reconstruction capacity is load-bearing for the 4x speedup result. The manuscript should include an explicit ablation or metric (e.g., token retention rate for depth-discontinuity or long-range correspondence tokens) demonstrating that the criterion does not discard information that would compound across 1000-image sequences; without this, the preservation of quality remains unverified.
[Section 4.3] Section 4.3 (long-sequence experiments): The headline result of 4x speedup with error mitigation at 1000 images requires more detailed controls on the exact merging thresholds and partitioning rules applied per layer. Current reporting leaves open the possibility of hidden quality degradation; per-sequence error curves or direct comparison against VGGT with naive merging would be needed to substantiate that the custom strategy is responsible for both the speedup and the error reduction.

minor comments (2)

[Abstract] Abstract and Section 2: The description of the 'token collapse phenomenon' would benefit from a quantitative definition or reference to the specific attention-map statistic used for visualization.
[Section 4] Figure captions and experimental tables: Ensure all reported speedups include the precise hardware, batch size, and sequence-length settings to allow direct reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will incorporate the requested analyses in the revised version.

read point-by-point responses

Referee: [Section 3.2] Section 3.2 (token partitioning strategy): The claim that the 3D-specific partitioning removes only redundant tokens while fully preserving VGGT's reconstruction capacity is load-bearing for the 4x speedup result. The manuscript should include an explicit ablation or metric (e.g., token retention rate for depth-discontinuity or long-range correspondence tokens) demonstrating that the criterion does not discard information that would compound across 1000-image sequences; without this, the preservation of quality remains unverified.

Authors: We agree that an explicit ablation would strengthen verification of the claim. In the revised manuscript we will add an ablation reporting token retention rates for depth-discontinuity and long-range correspondence tokens across layers, together with reconstruction metrics on sequences of increasing length to confirm absence of compounding loss. revision: yes
Referee: [Section 4.3] Section 4.3 (long-sequence experiments): The headline result of 4x speedup with error mitigation at 1000 images requires more detailed controls on the exact merging thresholds and partitioning rules applied per layer. Current reporting leaves open the possibility of hidden quality degradation; per-sequence error curves or direct comparison against VGGT with naive merging would be needed to substantiate that the custom strategy is responsible for both the speedup and the error reduction.

Authors: We will expand Section 4.3 to specify the exact merging thresholds and per-layer partitioning rules used for the 1000-image experiments. We will also add per-sequence error curves and a direct comparison against VGGT with naive merging to demonstrate that the 3D-specific strategy accounts for both the speedup and error mitigation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering technique validated by experiments

full rationale

The paper performs an analysis of VGGT bottlenecks via visualization of attention maps and token collapse, then introduces a training-free token merging method using a custom 3D-specific partitioning strategy. All performance claims, including 4x speedup at 1000 images and error mitigation, rest on benchmark experiments rather than any derivation, equation, or prediction that reduces to its inputs by construction. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the load-bearing steps; the approach is a direct empirical optimization without closed-loop theoretical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on an empirical acceleration technique rather than a closed-form derivation; the key new element is the 3D-specific partitioning rule whose effectiveness is demonstrated experimentally.

axioms (1)

domain assumption Token collapse observed in VGGT attention maps can be exploited for merging without retraining
Identified via visualization and used as motivation for the merging strategy.

invented entities (1)

3D-specific token partitioning strategy no independent evidence
purpose: To enable effective token merging in VGGT while preserving reconstruction capacity
Newly devised partitioning rule tailored to 3D architectures and tasks.

pith-pipeline@v0.9.0 · 5560 in / 1214 out tokens · 22839 ms · 2026-05-15T23:32:44.742619+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers
cs.CV 2026-05 unverdicted novelty 7.0

PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preservin...
Ground4D: Spatially-Grounded Feedforward 4D Reconstruction for Unstructured Off-Road Scenes
cs.CV 2026-05 unverdicted novelty 7.0

Ground4D resolves temporal conflicts in feedforward 4D Gaussian reconstruction for off-road scenes via voxel-grounded temporal aggregation with intra-voxel softmax and surface normal regularization, outperforming prio...
RobotPan: A 360$^\circ$ Surround-View Robotic Vision System for Embodied Perception
cs.RO 2026-04 unverdicted novelty 7.0

RobotPan predicts metric-scaled compact 3D Gaussians from calibrated multi-view inputs via spherical coordinates and hierarchical voxel priors for real-time 360° robotic perception and reconstruction.
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
cs.CV 2026-04 unverdicted novelty 7.0

Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
cs.CV 2026-03 unverdicted novelty 7.0

VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
cs.CV 2026-03 unverdicted novelty 7.0

STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory redu...
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
cs.CV 2026-03 unverdicted novelty 7.0

ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation
cs.CV 2026-05 unverdicted novelty 6.0

A training-free progressive decoupling framework improves dynamic depth estimation in 4D reconstruction via mask-guided pose decoupling, topological subspace surgery, and Bayesian fusion, yielding better point-cloud m...
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval
cs.CV 2026-05 unverdicted novelty 6.0

RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
cs.CV 2026-04 unverdicted novelty 6.0

SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers
cs.LG 2026-04 unverdicted novelty 6.0

ELSA casts online softmax attention as a prefix scan over monoid (m,S,W) to deliver exact FP32 semantics, O(n) memory, O(log n) depth, and Tensor-Core independence as a drop-in kernel.
Geometric Context Transformer for Streaming 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
cs.CV 2026-04 unverdicted novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors
cs.CV 2026-04 unverdicted novelty 6.0

The Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors outperforms prior methods on dynamic benchmarks by cutting Mean Accuracy error 13.43% and raising segmentation F-measure 10.49% via three uncerta...
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
HD-VGGT: High-Resolution Visual Geometry Transformer
cs.CV 2026-03 unverdicted novelty 6.0

HD-VGGT achieves state-of-the-art high-resolution 3D reconstruction from image collections via a dual-branch architecture that predicts coarse geometry at low resolution and refines details at high resolution while mo...
StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
cs.CV 2026-04 unverdicted novelty 5.0

StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 18 Pith papers · 7 internal anchors

[1]

Token merging for fast sta- ble diffusion

Daniel Bolya and Judy Hoffman. Token merging for fast sta- ble diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4599–4603,

work page
[2]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Pumer: Pruning and merging tokens for efficient vision language models.arXiv preprint arXiv:2305.17530,

Qingqing Cao, Bhargavi Paranjape, and Hannaneh Ha- jishirzi. Pumer: Pruning and merging tokens for efficient vision language models.arXiv preprint arXiv:2305.17530,

work page arXiv
[4]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3

work page 2021
[5]

vid-tldr: Training free token merging for light-weight video transformer

Joonmyung Choi, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, and Hyunwoo J Kim. vid-tldr: Training free token merging for light-weight video transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 18771–18781, 2024. 3

work page 2024
[6]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Vggt-long: Chunk it, loop it, align it–pushing vggt’s lim- its on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it–pushing vggt’s lim- its on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025. 3

work page arXiv 2025
[8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[9]

Which tokens to use? investi- gating token reduction in vision transformers

Joakim Bruslund Haurum, Sergio Escalera, Graham W Tay- lor, and Thomas B Moeslund. Which tokens to use? investi- gating token reduction in vision transformers. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 773–783, 2023. 3

work page 2023
[10]

Video token merging for long-form video under- standing.arXiv preprint arXiv:2410.23782, 2024

Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, and Xinyu Li. Video token merging for long-form video under- standing.arXiv preprint arXiv:2410.23782, 2024. 2

work page arXiv 2024
[11]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2, 3

work page 2024
[12]

Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cam- eras.IEEE transactions on robotics, 33(5):1255–1262, 2017

Raul Mur-Artal and Juan D Tard ´os. Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cam- eras.IEEE transactions on robotics, 33(5):1255–1262, 2017. 3

work page 2017
[13]

Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147–1163,

Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147–1163,

work page
[14]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Sg-nerf: Semantic- guided point-based neural radiance fields

Yansong Qu, Yuze Wang, and Yue Qi. Sg-nerf: Semantic- guided point-based neural radiance fields. In2023 IEEE International Conference on Multimedia and Expo (ICME), pages 570–575. IEEE, 2023. 2

work page 2023
[16]

Deocc-1-to-3: 3d de- occlusion from a single image via self-supervised multi-view diffusion.arXiv preprint arXiv:2506.21544, 2025

Yansong Qu, Shaohui Dai, Xinyang Li, Yuze Wang, You Shen, Liujuan Cao, and Rongrong Ji. Deocc-1-to-3: 3d de- occlusion from a single image via self-supervised multi-view diffusion.arXiv preprint arXiv:2506.21544, 2025. 2

work page arXiv 2025
[17]

Learn- ing to merge tokens in vision transformers.arXiv preprint arXiv:2202.12015, 2022

Cedric Renggli, Andr ´e Susano Pinto, Neil Houlsby, Basil Mustafa, Joan Puigcerver, and Carlos Riquelme. Learn- ing to merge tokens in vision transformers.arXiv preprint arXiv:2202.12015, 2022. 3

work page arXiv 2022
[18]

Tokenlearner: Adaptive space-time tokenization for videos.Advances in neural in- formation processing systems, 34:12786–12797, 2021

Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. Tokenlearner: Adaptive space-time tokenization for videos.Advances in neural in- formation processing systems, 34:12786–12797, 2021. 3

work page 2021
[19]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 3

work page 2016
[20]

Tempme: Video temporal token merging for efficient text- video retrieval.arXiv preprint arXiv:2409.01156, 2024

Leqi Shen, Tianxiang Hao, Tao He, Sicheng Zhao, Yifeng Zhang, Pengzhang Liu, Yongjun Bao, and Guiguang Ding. Tempme: Video temporal token merging for efficient text- video retrieval.arXiv preprint arXiv:2409.01156, 2024. 3

work page arXiv 2024
[21]

Evolving high- quality rendering and reconstruction in a unified framework with contribution-adaptive regularization

You Shen, Zhipeng Zhang, Xinyang Li, Yansong Qu, Yu Lin, Shengchuan Zhang, and Liujuan Cao. Evolving high- quality rendering and reconstruction in a unified framework with contribution-adaptive regularization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16346–16355, 2025. 3

work page 2025
[22]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds

Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 5283–5293,

work page
[24]

Accelerating trans- formers with spectrum-preserving token merging.Advances in Neural Information Processing Systems, 37:30772–30810,

Chau Tran, Duy MH Nguyen, Manh-Duy Nguyen, TrungTin Nguyen, Ngan Le, Pengtao Xie, Daniel Sonntag, James Y Zou, Binh Nguyen, and Mathias Niepert. Accelerating trans- formers with spectrum-preserving token merging.Advances in Neural Information Processing Systems, 37:30772–30810,

work page
[25]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3, 7, 8

work page 2025
[26]

Continuous 3d per- ception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025. 6, 7, 8

work page 2025
[27]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2, 3

work page 2024
[28]

Look at the sky: Sky- aware efficient 3d gaussian splatting in the wild.IEEE Trans- actions on Visualization and Computer Graphics, 2025

Yuze Wang, Junyi Wang, Ruicheng Gao, Yansong Qu, Wan- tong Duan, Shuo Yang, and Yue Qi. Look at the sky: Sky- aware efficient 3d gaussian splatting in the wild.IEEE Trans- actions on Visualization and Computer Graphics, 2025. 3

work page 2025
[29]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. Scalable permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025. 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21924–21935,

work page
[31]

Not all tokens are equal: Human-centric visual analysis via token clustering transformer

Wang Zeng, Sheng Jin, Wentao Liu, Chen Qian, Ping Luo, Wanli Ouyang, and Xiaogang Wang. Not all tokens are equal: Human-centric visual analysis via token clustering transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11101– 11111, 2022. 3

work page 2022
[32]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gor- don Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21936–21947, 2025. 2

work page 2025
[33]

Streaming 4D Visual Geometry Transformer

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer. arXiv preprint arXiv:2507.11539, 2025. 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025