hub Canonical reference

Streaming 4D Visual Geometry Transformer

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu · 2025 · cs.CV · arXiv 2507.11539

Canonical reference. 71% of citing Pith papers cite this work as background.

37 Pith papers citing it

Background 71% of classified citations

open full Pith review browse 37 citing papers arXiv PDF

abstract

Perceiving and reconstructing 3D geometry from videos is a fundamental yet challenging computer vision task. To facilitate interactive and low-latency applications, we propose a streaming visual geometry transformer that shares a similar philosophy with autoregressive large language models. We explore a simple and efficient design and employ a causal transformer architecture to process the input sequence in an online manner. We use temporal causal attention and cache the historical keys and values as implicit memory to enable efficient streaming long-term 3D reconstruction. This design can handle low-latency 3D reconstruction by incrementally integrating historical information while maintaining high-quality spatial consistency. For efficient training, we propose to distill knowledge from the dense bidirectional visual geometry grounded transformer (VGGT) to our causal model. For inference, our model supports the migration of optimized efficient attention operators (e.g., FlashAttention) from large language models. Extensive experiments on various 3D geometry perception benchmarks demonstrate that our model enhances inference speed in online scenarios while maintaining competitive performance, thereby facilitating scalable and interactive 3D vision systems. Code is available at: https://github.com/wzzheng/StreamVGGT.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 2

citation-polarity summary

background 5 baseline 2

representative citing papers

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

cs.CV · 2026-05-26 · unverdicted · novelty 8.0

SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.

Geo-Align: Video Generation Alignment via Metric Geometry Reward

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.

3D-Belief: Embodied Belief Inference via Generative 3D World Modeling

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.

PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.

GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.

AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation

cs.RO · 2026-04-07 · unverdicted · novelty 7.0

AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM3D benchmarks.

STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction

cs.CV · 2026-03-18 · unverdicted · novelty 7.0

STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory reduction and 4x faster inference at SOTA quality.

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

cs.CV · 2025-09-02 · conditional · novelty 7.0

FastVGGT achieves 4x speedup on VGGT for 1000-image inputs using training-free token merging tailored to 3D architectures while reducing error accumulation.

Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

Argus is a feed-forward network for metric panoramic 3D reconstruction, trained on the new Realsee3D dataset of 10K indoor scenes and using a learned covisibility module plus decomposed mapping supervision to achieve SOTA on camera pose, depth, and point cloud tasks.

HiReFF: High-Resolution Feedforward Human Reconstruction from Uncalibrated Sparse-View Video

cs.CV · 2026-06-28 · unverdicted · novelty 6.0

HiReFF presents a feed-forward framework for 2K human video reconstruction from uncalibrated sparse-view videos via scale-synchronized calibration, Gaussian masking, and high-resolution side-tuning.

UniT: Unified Geometry Learning with Group Autoregressive Transformer

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.

LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

LongDPM introduces an overlap-aware chunk-based framework that registers and fuses local dynamic reconstructions to achieve coherent long-range 4D geometry and tracking from monocular video.

Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

A closed-form scalar frame-level gate α_t derived from internal feature changes extends effective memory in recurrent 3D reconstruction and improves accuracy on long sequences up to 4541 frames.

VGGT-CD: Training-Free Robust Registration for 3D Change Detection

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

VGGT-CD decouples cross-temporal registration from dynamic changes using VGGT reconstructions, achieving 44% and 59% lower Absolute Trajectory Error outdoors and indoors on an 11-scene benchmark while running over 6 times faster.

Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.

Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

cs.CV · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Spark3R achieves up to 28x speedup on 1000-frame 3D reconstruction inputs by asymmetrically reducing query and key-value tokens in Vision Transformers while keeping competitive quality.

Vista4D: Video Reshooting with 4D Point Clouds

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.

Geometry-Guided 3D Visual Token Pruning for Video-Language Models

cs.CV · 2026-04-20 · conditional · novelty 6.0

Geo3DPruner uses geometry-aware global attention and two-stage voxel pruning to remove 90% of visual tokens from spatial videos while keeping over 90% of original performance on 3D scene benchmarks.

Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

citing papers explorer

Showing 37 of 37 citing papers.

SpatialBench: Is Your Spatial Foundation Model an All-Round Player? cs.CV · 2026-05-26 · unverdicted · none · ref 142 · internal anchor
SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.
Geo-Align: Video Generation Alignment via Metric Geometry Reward cs.CV · 2026-05-22 · unverdicted · none · ref 33 · internal anchor
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory cs.CV · 2026-05-17 · unverdicted · none · ref 41 · internal anchor
Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
3D-Belief: Embodied Belief Inference via Generative 3D World Modeling cs.CV · 2026-05-12 · unverdicted · none · ref 27 · internal anchor
3D-Belief maintains and updates explicit 3D beliefs about partially observed environments to enable multi-hypothesis imagination and improved performance on embodied tasks.
PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers cs.CV · 2026-05-08 · unverdicted · none · ref 19 · internal anchor
PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens cs.CV · 2026-04-16 · unverdicted · none · ref 44 · internal anchor
GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation cs.LG · 2026-04-11 · unverdicted · none · ref 195 · internal anchor
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training cs.CV · 2026-04-08 · unverdicted · none · ref 80 · internal anchor
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation cs.RO · 2026-04-07 · unverdicted · none · ref 44 · internal anchor
AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM3D benchmarks.
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction cs.CV · 2026-03-18 · unverdicted · none · ref 52 · internal anchor
STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory reduction and 4x faster inference at SOTA quality.
FastVGGT: Training-Free Acceleration of Visual Geometry Transformer cs.CV · 2025-09-02 · conditional · none · ref 33 · internal anchor
FastVGGT achieves 4x speedup on VGGT for 1000-image inputs using training-free token merging tailored to 3D architectures while reducing error accumulation.
Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes cs.CV · 2026-06-29 · unverdicted · none · ref 107 · internal anchor
Argus is a feed-forward network for metric panoramic 3D reconstruction, trained on the new Realsee3D dataset of 10K indoor scenes and using a learned covisibility module plus decomposed mapping supervision to achieve SOTA on camera pose, depth, and point cloud tasks.
HiReFF: High-Resolution Feedforward Human Reconstruction from Uncalibrated Sparse-View Video cs.CV · 2026-06-28 · unverdicted · none · ref 79 · internal anchor
HiReFF presents a feed-forward framework for 2K human video reconstruction from uncalibrated sparse-view videos via scale-synchronized calibration, Gaussian masking, and high-resolution side-tuning.
UniT: Unified Geometry Learning with Group Autoregressive Transformer cs.CV · 2026-05-20 · unverdicted · none · ref 7 · internal anchor
UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond cs.LG · 2026-05-19 · unverdicted · none · ref 78 · internal anchor
OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos cs.CV · 2026-05-17 · unverdicted · none · ref 51 · internal anchor
LongDPM introduces an overlap-aware chunk-based framework that registers and fuses local dynamic reconstructions to achieve coherent long-range 4D geometry and tracking from monocular video.
Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction cs.CV · 2026-05-16 · unverdicted · none · ref 31 · internal anchor
A closed-form scalar frame-level gate α_t derived from internal feature changes extends effective memory in recurrent 3D reconstruction and improves accuracy on long sequences up to 4541 frames.
VGGT-CD: Training-Free Robust Registration for 3D Change Detection cs.CV · 2026-05-16 · unverdicted · none · ref 16 · internal anchor
VGGT-CD decouples cross-temporal registration from dynamic changes using VGGT reconstructions, achieving 44% and 59% lower Absolute Trajectory Error outdoors and indoors on an 11-scene benchmark while running over 6 times faster.
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval cs.CV · 2026-05-10 · unverdicted · none · ref 27 · internal anchor
RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction cs.CV · 2026-05-07 · unverdicted · none · ref 27 · 2 links · internal anchor
Spark3R achieves up to 28x speedup on 1000-frame 3D reconstruction inputs by asymmetrically reducing query and key-value tokens in Vision Transformers while keeping competitive quality.
Vista4D: Video Reshooting with 4D Point Clouds cs.CV · 2026-04-23 · unverdicted · none · ref 46 · internal anchor
Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.
Geometry-Guided 3D Visual Token Pruning for Video-Language Models cs.CV · 2026-04-20 · conditional · none · ref 53 · internal anchor
Geo3DPruner uses geometry-aware global attention and two-stage voxel pruning to remove 90% of visual tokens from spatial videos while keeping over 90% of original performance on 3D scene benchmarks.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective cs.CV · 2026-04-15 · unverdicted · none · ref 103 · internal anchor
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction cs.CV · 2026-04-09 · unverdicted · none · ref 101 · internal anchor
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
Fast Spatial Memory with Elastic Test-Time Training cs.CV · 2026-04-08 · unverdicted · none · ref 78 · internal anchor
Elastic Test-Time Training stabilizes test-time updates via an elastic prior and moving-average anchor, enabling Fast Spatial Memory for scalable long-sequence 4D reconstruction with reduced memory use and fewer shortcuts.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale cs.CV · 2026-04-01 · unverdicted · none · ref 92 · internal anchor
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.
Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers cs.CV · 2025-11-18 · unverdicted · none · ref 44 · internal anchor
Co-Me distills a confidence predictor to selectively merge low-confidence tokens in visual geometric transformers, delivering up to 21.5x speedup on VGGT and 20.4x on Pi3 while preserving spatial coverage and performance.
Depth Anything 3: Recovering the Visual Space from Any Views cs.CV · 2025-11-13 · unverdicted · none · ref 19 · internal anchor
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
Towards Consistent Video Geometry Estimation cs.CV · 2026-05-28 · unverdicted · none · ref 94 · internal anchor
ViGeo is a feed-forward transformer for video geometry that introduces dynamic chunking attention and a completion-based data refinement framework to achieve SOTA on depth, normals, and point map estimation.
$R^3$: 3D Reconstruction via Relative Regression cs.CV · 2026-05-26 · unverdicted · none · ref 82 · internal anchor
R³ uses relative regression with confidence-weighted constraints from an MLP to support long-context offline and streaming 3D reconstruction without global coordinate assumptions.
Global Structure-from-Motion Meets Feedforward Reconstruction cs.CV · 2026-05-25 · unverdicted · none · ref 58 · internal anchor
A new SfM pipeline combining classical and feedforward methods reports state-of-the-art results across multiple datasets and is released as open source.
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction cs.CV · 2026-05-22 · unverdicted · none · ref 62 · internal anchor
HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test
StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression cs.CV · 2026-04-16 · unverdicted · none · ref 6 · internal anchor
StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.
TTT3R: 3D Reconstruction as Test-Time Training cs.CV · 2025-09-30 · unverdicted · none · ref 106 · internal anchor
TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
Stream3D: Sequential Multi-View 3D Generation via Evidential Memory cs.CV · 2026-05-20 · unreviewed · ref 89 · internal anchor
FrameVGGT: Coherence-Preserving Memory for Bounded Streaming Geometry cs.CV · 2026-03-08 · unreviewed · ref 14 · internal anchor
PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation cs.CV · 2025-10-20 · unreviewed · ref 20 · internal anchor

Streaming 4D Visual Geometry Transformer

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer