hub Canonical reference

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang · 2025 · cs.CV · arXiv 2509.02560

Canonical reference. 86% of citing Pith papers cite this work as background.

37 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 37 citing papers arXiv PDF

abstract

Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, scaling these models to long-sequence image inputs remains a significant challenge due to inference-time inefficiency. In this work, we present a detailed analysis of VGGT, a state-of-the-art feed-forward visual geometry model and identify its primary bottleneck. Visualization further reveals a token collapse phenomenon in the attention maps. Motivated by these findings, we explore the potential of token merging in the feed-forward visual geometry model. Owing to the unique architectural and task-specific properties of 3D models, directly applying existing merging techniques proves challenging. To this end, we propose FastVGGT, which, for the first time, leverages token merging in the 3D domain through a training-free mechanism for accelerating VGGT. we devise a unique token partitioning strategy tailored to 3D architectures and tasks, effectively eliminating redundant computation while preserving VGGT's powerful reconstruction capacity. Extensive experiments on multiple 3D geometry benchmarks validate the effectiveness of our approach. Notably, with 1000 input images, FastVGGT achieves a 4x speedup over VGGT while mitigating error accumulation in long-sequence scenarios. These findings underscore the potential of token merging as a principled solution for scalable 3D vision systems. Code is available at: https://mystorm16.github.io/fastvggt/.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 baseline 1

citation-polarity summary

background 6 baseline 1

representative citing papers

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

cs.CV · 2026-05-26 · unverdicted · novelty 8.0

SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.

DrivingDepth: Sparse-Prompted Pixel-wise Scale Correction for Driving Depth Estimation

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

DrivingDepth achieves SOTA metric depth on nuScenes by residual pixel-wise scale correction on frozen foundation models using sparse LiDAR prompts, preserving geometric consistency.

CasaMaestro: Multi-View Panoramas for House-Scale 3D Reconstruction

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

CasaMaestro predicts metric depth and poses from sparse multi-view panoramas to enable fast house-scale 3D reconstruction.

PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.

Ground4D: Spatially-Grounded Feedforward 4D Reconstruction for Unstructured Off-Road Scenes

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

Ground4D resolves temporal conflicts in feedforward 4D Gaussian reconstruction for off-road scenes via voxel-grounded temporal aggregation with intra-voxel softmax and surface normal regularization, outperforming prior methods on ORAD-3D and RELLIS-3D while generalizing zero-shot.

RobotPan: A 360$^\circ$ Surround-View Robotic Vision System for Embodied Perception

cs.RO · 2026-04-15 · unverdicted · novelty 7.0

RobotPan predicts metric-scaled compact 3D Gaussians from calibrated multi-view inputs via spherical coordinates and hierarchical voxel priors for real-time 360° robotic perception and reconstruction.

Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

cs.CV · 2026-03-19 · unverdicted · novelty 7.0

VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.

STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction

cs.CV · 2026-03-18 · unverdicted · novelty 7.0

STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory reduction and 4x faster inference at SOTA quality.

ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training

cs.CV · 2026-03-04 · unverdicted · novelty 7.0

ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.

FFAvatar: Feed-Forward 4D Head Avatar Reconstruction from Sparse Portrait Images

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

FFAvatar uses a Transformer-based 3D Gaussian model with alternating attention and sparse-to-dense learning to enable feed-forward, incremental reconstruction of animatable 4D head avatars from sparse portrait images.

Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes

cs.CV · 2026-06-29 · unverdicted · novelty 6.0 · 2 refs

Argus introduces a covisibility module and decomposed pixel-to-world mapping to deliver SOTA metric performance on camera pose, depth, and point cloud tasks using the Realsee3D panoramic dataset.

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

cs.CV · 2026-06-16 · unverdicted · novelty 6.0

RegimeVGGT applies layer-wise U-shaped compression via saliency-guided banded merging and selectively protected K/V downsampling to deliver 6.7x speedup on VGGT at matched reconstruction quality.

EPS3D: End-to-End Feed-Forward 3D Panoptic Segmentation

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

EPS3D is an end-to-end architecture for 3D panoptic segmentation from multi-view images that uses distillation and semantic-instance mutual enhancement to achieve higher benchmark performance and speed than prior methods.

Anchor3R: Streaming 3D Reconstruction with Transient Anchors for Long-Horizon Visual Mapping

cs.CV · 2026-06-03 · unverdicted · novelty 6.0

Anchor3R reframes feed-forward 3D reconstruction as current-centric local measurement prediction, using loop-closure and motion averaging to produce coherent global maps from visual streams.

Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

KeyVT improves zero-shot 3D question answering by hierarchically selecting semantically and geometrically relevant views and using optimal transport to extract representative tokens from them.

DeblurNVS: Geometric Latent Diffusion for Novel View Synthesis from Sparse Motion-Blurred Images

cs.CV · 2026-05-31 · unverdicted · novelty 6.0

DeblurNVS restores geometric representations via latent diffusion to enable high-fidelity novel view synthesis directly from sparse motion-blurred inputs.

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

RayDer is a unified transformer backbone for self-supervised static-scene novel view synthesis that absorbs dynamic content as a nuisance factor and shows power-law scaling with data and compute while matching supervised methods in zero-shot settings.

UniT: Unified Geometry Learning with Group Autoregressive Transformer

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.

Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

A closed-form scalar frame-level gate α_t derived from internal feature changes extends effective memory in recurrent 3D reconstruction and improves accuracy on long sequences up to 4541 frames.

Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer

cs.CV · 2026-05-15 · unverdicted · novelty 6.0 · 2 refs

FGQ applies diagonal Fisher information to guide learnable affine transformations in PTQ for multi-task VGGT, yielding up to 39% relative gains over baselines at 4-bit quantization.

4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

A training-free progressive decoupling framework improves dynamic depth estimation in 4D reconstruction via mask-guided pose decoupling, topological subspace surgery, and Bayesian fusion, yielding better point-cloud metrics on benchmarks.

Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.

Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

cs.CV · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Spark3R achieves up to 28x speedup on 1000-frame 3D reconstruction inputs by asymmetrically reducing query and key-value tokens in Vision Transformers while keeping competitive quality.

citing papers explorer

Showing 34 of 34 citing papers after filters.

SpatialBench: Is Your Spatial Foundation Model an All-Round Player? cs.CV · 2026-05-26 · unverdicted · none · ref 87 · internal anchor
SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.
DrivingDepth: Sparse-Prompted Pixel-wise Scale Correction for Driving Depth Estimation cs.CV · 2026-06-30 · unverdicted · none · ref 25 · internal anchor
DrivingDepth achieves SOTA metric depth on nuScenes by residual pixel-wise scale correction on frozen foundation models using sparse LiDAR prompts, preserving geometric consistency.
CasaMaestro: Multi-View Panoramas for House-Scale 3D Reconstruction cs.CV · 2026-06-30 · unverdicted · none · ref 28 · internal anchor
CasaMaestro predicts metric depth and poses from sparse multi-view panoramas to enable fast house-scale 3D reconstruction.
PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers cs.CV · 2026-05-08 · unverdicted · none · ref 10 · internal anchor
PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.
Ground4D: Spatially-Grounded Feedforward 4D Reconstruction for Unstructured Off-Road Scenes cs.CV · 2026-05-06 · unverdicted · none · ref 36 · internal anchor
Ground4D resolves temporal conflicts in feedforward 4D Gaussian reconstruction for off-road scenes via voxel-grounded temporal aggregation with intra-voxel softmax and surface normal regularization, outperforming prior methods on ORAD-3D and RELLIS-3D while generalizing zero-shot.
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training cs.CV · 2026-04-08 · unverdicted · none · ref 45 · internal anchor
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation cs.CV · 2026-03-19 · unverdicted · none · ref 31 · internal anchor
VGGT-360 delivers geometry-consistent zero-shot panoramic depth by converting panoramas into multi-view 3D reconstructions via VGGT models and three plug-and-play correction modules, then reprojecting the result.
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction cs.CV · 2026-03-18 · unverdicted · none · ref 29 · internal anchor
STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory reduction and 4x faster inference at SOTA quality.
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training cs.CV · 2026-03-04 · unverdicted · none · ref 57 · internal anchor
ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
FFAvatar: Feed-Forward 4D Head Avatar Reconstruction from Sparse Portrait Images cs.CV · 2026-06-29 · unverdicted · none · ref 54 · internal anchor
FFAvatar uses a Transformer-based 3D Gaussian model with alternating attention and sparse-to-dense learning to enable feed-forward, incremental reconstruction of animatable 4D head avatars from sparse portrait images.
Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes cs.CV · 2026-06-29 · unverdicted · none · ref 75 · 2 links · internal anchor
Argus introduces a covisibility module and decomposed pixel-to-world mapping to deliver SOTA metric performance on camera pose, depth, and point cloud tasks using the Realsee3D panoramic dataset.
RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer cs.CV · 2026-06-16 · unverdicted · none · ref 25 · internal anchor
RegimeVGGT applies layer-wise U-shaped compression via saliency-guided banded merging and selectively protected K/V downsampling to deliver 6.7x speedup on VGGT at matched reconstruction quality.
EPS3D: End-to-End Feed-Forward 3D Panoptic Segmentation cs.CV · 2026-06-08 · unverdicted · none · ref 16 · internal anchor
EPS3D is an end-to-end architecture for 3D panoptic segmentation from multi-view images that uses distillation and semantic-instance mutual enhancement to achieve higher benchmark performance and speed than prior methods.
Anchor3R: Streaming 3D Reconstruction with Transient Anchors for Long-Horizon Visual Mapping cs.CV · 2026-06-03 · unverdicted · none · ref 53 · internal anchor
Anchor3R reframes feed-forward 3D reconstruction as current-centric local measurement prediction, using loop-closure and motion averaging to produce coherent global maps from visual streams.
Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation cs.CV · 2026-06-02 · unverdicted · none · ref 60 · internal anchor
KeyVT improves zero-shot 3D question answering by hierarchically selecting semantically and geometrically relevant views and using optimal transport to extract representative tokens from them.
DeblurNVS: Geometric Latent Diffusion for Novel View Synthesis from Sparse Motion-Blurred Images cs.CV · 2026-05-31 · unverdicted · none · ref 44 · internal anchor
DeblurNVS restores geometric representations via latent diffusion to enable high-fidelity novel view synthesis directly from sparse motion-blurred inputs.
RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video cs.CV · 2026-05-29 · unverdicted · none · ref 60 · internal anchor
RayDer is a unified transformer backbone for self-supervised static-scene novel view synthesis that absorbs dynamic content as a nuisance factor and shows power-law scaling with data and compute while matching supervised methods in zero-shot settings.
UniT: Unified Geometry Learning with Group Autoregressive Transformer cs.CV · 2026-05-20 · unverdicted · none · ref 20 · internal anchor
UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.
Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction cs.CV · 2026-05-16 · unverdicted · none · ref 16 · internal anchor
A closed-form scalar frame-level gate α_t derived from internal feature changes extends effective memory in recurrent 3D reconstruction and improves accuracy on long sequences up to 4541 frames.
Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer cs.CV · 2026-05-15 · unverdicted · none · ref 8 · 2 links · internal anchor
FGQ applies diagonal Fisher information to guide learnable affine transformations in PTQ for multi-task VGGT, yielding up to 39% relative gains over baselines at 4-bit quantization.
4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation cs.CV · 2026-05-12 · unverdicted · none · ref 17 · internal anchor
A training-free progressive decoupling framework improves dynamic depth estimation in 4D reconstruction via mask-guided pose decoupling, topological subspace surgery, and Bayesian fusion, yielding better point-cloud metrics on benchmarks.
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval cs.CV · 2026-05-10 · unverdicted · none · ref 55 · internal anchor
RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction cs.CV · 2026-05-07 · unverdicted · none · ref 22 · 2 links · internal anchor
Spark3R achieves up to 28x speedup on 1000-frame 3D reconstruction inputs by asymmetrically reducing query and key-value tokens in Vision Transformers while keeping competitive quality.
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness cs.CV · 2026-04-29 · unverdicted · none · ref 43 · internal anchor
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
Geometric Context Transformer for Streaming 3D Reconstruction cs.CV · 2026-04-15 · unverdicted · none · ref 59 · internal anchor
LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20 FPS over sequences longer than 10,000 frames.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective cs.CV · 2026-04-15 · unverdicted · none · ref 166 · internal anchor
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors cs.CV · 2026-04-10 · unverdicted · none · ref 15 · internal anchor
The Robust 4D Visual Geometry Transformer with Uncertainty-Aware Priors outperforms prior methods on dynamic benchmarks by cutting Mean Accuracy error 13.43% and raising segmentation F-measure 10.49% via three uncertainty mechanisms while keeping feed-forward speed.
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction cs.CV · 2026-04-09 · unverdicted · none · ref 61 · internal anchor
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
HD-VGGT: High-Resolution Visual Geometry Transformer cs.CV · 2026-03-28 · unverdicted · none · ref 7 · internal anchor
HD-VGGT achieves state-of-the-art high-resolution 3D reconstruction from image collections via a dual-branch architecture that predicts coarse geometry at low resolution and refines details at high resolution while modulating unreliable features.
High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians cs.CV · 2026-06-14 · unverdicted · none · ref 46 · internal anchor
A multi-view feed-forward transformer provides initial poses and geometry from calibrated videos, followed by physics-aware Gaussian optimization with tetrahedral and collision constraints to produce robust 4D hand-object reconstructions.
$R^3$: 3D Reconstruction via Relative Regression cs.CV · 2026-05-26 · unverdicted · none · ref 54 · internal anchor
R³ uses relative regression with confidence-weighted constraints from an MLP to support long-context offline and streaming 3D reconstruction without global coordinate assumptions.
Global Structure-from-Motion Meets Feedforward Reconstruction cs.CV · 2026-05-25 · unverdicted · none · ref 40 · internal anchor
A new SfM pipeline combining classical and feedforward methods reports state-of-the-art results across multiple datasets and is released as open source.
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction cs.CV · 2026-05-22 · unverdicted · none · ref 33 · internal anchor
HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test
StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression cs.CV · 2026-04-16 · unverdicted · none · ref 18 · internal anchor
StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer