hub

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, Jin Xie · 2025 · cs.CV · arXiv 2507.16443

23 Pith papers cite this work. Polarity classification is still indexing.

23 Pith papers citing it

open full Pith review browse 23 citing papers arXiv PDF

abstract

Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios. Code is available at https://github.com/DengKaiCQ/VGGT-Long.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Geo-Align: Video Generation Alignment via Metric Geometry Reward

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.

LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

LEXI-SG is the first monocular RGB system for dense open-vocabulary 3D scene graphs that partitions scenes into rooms and performs feed-forward reconstruction per room before global factor-graph alignment.

PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.

Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye

cs.RO · 2026-04-16 · unverdicted · novelty 7.0

CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.

STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction

cs.CV · 2026-03-18 · unverdicted · novelty 7.0

STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory reduction and 4x faster inference at SOTA quality.

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

cs.CV · 2025-09-02 · conditional · novelty 7.0

FastVGGT achieves 4x speedup on VGGT for 1000-image inputs using training-free token merging tailored to 3D architectures while reducing error accumulation.

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

RayDer is a unified transformer backbone for self-supervised static-scene novel view synthesis that absorbs dynamic content as a nuisance factor and shows power-law scaling with data and compute while matching supervised methods in zero-shot settings.

UniT: Unified Geometry Learning with Group Autoregressive Transformer

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.

LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

LongDPM introduces an overlap-aware chunk-based framework that registers and fuses local dynamic reconstructions to achieve coherent long-range 4D geometry and tracking from monocular video.

Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.

Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

cs.CV · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Spark3R achieves up to 28x speedup on 1000-frame 3D reconstruction inputs by asymmetrically reducing query and key-value tokens in Vision Transformers while keeping competitive quality.

Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

cs.CV · 2026-05-07 · unverdicted · novelty 6.0 · 3 refs

The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.

Geometric Context Transformer for Streaming 3D Reconstruction

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20 FPS over sequences longer than 10,000 frames.

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test

Real-Scale Island Area and Coastline Estimation using Only its Place Name or Coordinates

cs.CV · 2026-05-11 · unverdicted · novelty 5.0

A monocular vision system estimates real-scale island area and coastline length with around 10% error using only place name or coordinates input via automated image capture, point cloud generation, and trajectory alignment.

ReorgGS: Equivalent Distribution Reorganization for 3D Gaussian Splatting

cs.CV · 2026-05-09 · unverdicted · novelty 5.0

ReorgGS reorganizes the Gaussian distribution in converged 3DGS models by resampling centers and covariances to reduce parameterization degeneration and enable better subsequent optimization.

MR.ScaleMaster: Scale-Consistent Collaborative Mapping from Crowd-Sourced Monocular Videos

cs.RO · 2026-04-13 · unverdicted · novelty 5.0

MR.ScaleMaster adds a false-loop alarm and per-session Sim(3) scale estimation to enable accurate multi-agent monocular mapping, showing 7.2x ATE improvement on KITTI with up to 15 agents.

MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM

cs.RO · 2026-04-12 · unverdicted · novelty 5.0

MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.

TTT3R: 3D Reconstruction as Test-Time Training

cs.CV · 2025-09-30 · unverdicted · novelty 5.0

TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.

Rapid Forest Fuel Load Estimation via Virtual Remote Sensing and Metric-Scale Feed-Forward 3D Reconstruction

cs.CV · 2026-05-11 · unverdicted · novelty 4.0

A pipeline using virtual remote sensing from Google Earth Studio, Pi-Long 3D reconstruction, metric alignment, and watershed segmentation estimates forest fuel load as a scalable alternative to traditional surveys.

FrameVGGT: Coherence-Preserving Memory for Bounded Streaming Geometry

cs.CV · 2026-03-08

citing papers explorer

Showing 23 of 23 citing papers.

Geo-Align: Video Generation Alignment via Metric Geometry Reward cs.CV · 2026-05-22 · unverdicted · none · ref 39 · internal anchor
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory cs.CV · 2026-05-17 · unverdicted · none · ref 5 · internal anchor
Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction cs.RO · 2026-05-13 · unverdicted · none · ref 26 · internal anchor
LEXI-SG is the first monocular RGB system for dense open-vocabulary 3D scene graphs that partitions scenes into rooms and performs feed-forward reconstruction per room before global factor-graph alignment.
PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers cs.CV · 2026-05-08 · unverdicted · none · ref 18 · internal anchor
PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.
Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye cs.RO · 2026-04-16 · unverdicted · none · ref 17 · internal anchor
CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction cs.CV · 2026-03-18 · unverdicted · none · ref 9 · internal anchor
STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory reduction and 4x faster inference at SOTA quality.
FastVGGT: Training-Free Acceleration of Visual Geometry Transformer cs.CV · 2025-09-02 · conditional · none · ref 7 · internal anchor
FastVGGT achieves 4x speedup on VGGT for 1000-image inputs using training-free token merging tailored to 3D architectures while reducing error accumulation.
RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video cs.CV · 2026-05-29 · unverdicted · none · ref 9 · internal anchor
RayDer is a unified transformer backbone for self-supervised static-scene novel view synthesis that absorbs dynamic content as a nuisance factor and shows power-law scaling with data and compute while matching supervised methods in zero-shot settings.
UniT: Unified Geometry Learning with Group Autoregressive Transformer cs.CV · 2026-05-20 · unverdicted · none · ref 37 · internal anchor
UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.
LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos cs.CV · 2026-05-17 · unverdicted · none · ref 7 · internal anchor
LongDPM introduces an overlap-aware chunk-based framework that registers and fuses local dynamic reconstructions to achieve coherent long-range 4D geometry and tracking from monocular video.
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval cs.CV · 2026-05-10 · unverdicted · none · ref 56 · internal anchor
RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction cs.CV · 2026-05-07 · unverdicted · none · ref 20 · 2 links · internal anchor
Spark3R achieves up to 28x speedup on 1000-frame 3D reconstruction inputs by asymmetrically reducing query and key-value tokens in Vision Transformers while keeping competitive quality.
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction cs.CV · 2026-05-07 · unverdicted · none · ref 9 · 3 links · internal anchor
The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.
Geometric Context Transformer for Streaming 3D Reconstruction cs.CV · 2026-04-15 · unverdicted · none · ref 11 · internal anchor
LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20 FPS over sequences longer than 10,000 frames.
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction cs.CV · 2026-04-09 · unverdicted · none · ref 17 · internal anchor
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction cs.CV · 2026-05-22 · unverdicted · none · ref 12 · internal anchor
HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test
Real-Scale Island Area and Coastline Estimation using Only its Place Name or Coordinates cs.CV · 2026-05-11 · unverdicted · none · ref 5 · internal anchor
A monocular vision system estimates real-scale island area and coastline length with around 10% error using only place name or coordinates input via automated image capture, point cloud generation, and trajectory alignment.
ReorgGS: Equivalent Distribution Reorganization for 3D Gaussian Splatting cs.CV · 2026-05-09 · unverdicted · none · ref 50 · internal anchor
ReorgGS reorganizes the Gaussian distribution in converged 3DGS models by resampling centers and covariances to reduce parameterization degeneration and enable better subsequent optimization.
MR.ScaleMaster: Scale-Consistent Collaborative Mapping from Crowd-Sourced Monocular Videos cs.RO · 2026-04-13 · unverdicted · none · ref 14 · internal anchor
MR.ScaleMaster adds a false-loop alarm and per-session Sim(3) scale estimation to enable accurate multi-agent monocular mapping, showing 7.2x ATE improvement on KITTI with up to 15 agents.
MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM cs.RO · 2026-04-12 · unverdicted · none · ref 19 · internal anchor
MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.
TTT3R: 3D Reconstruction as Test-Time Training cs.CV · 2025-09-30 · unverdicted · none · ref 24 · internal anchor
TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
Rapid Forest Fuel Load Estimation via Virtual Remote Sensing and Metric-Scale Feed-Forward 3D Reconstruction cs.CV · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
A pipeline using virtual remote sensing from Google Earth Studio, Pi-Long 3D reconstruction, metric alignment, and watershed segmentation estimates forest fuel load as a scalable alternative to traditional surveys.
FrameVGGT: Coherence-Preserving Memory for Bounded Streaming Geometry cs.CV · 2026-03-08 · unreviewed · ref 27 · internal anchor

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer