pith. sign in

hub

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

34 Pith papers cite this work. Polarity classification is still indexing.

34 Pith papers citing it
abstract

Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios. Code is available at https://github.com/DengKaiCQ/VGGT-Long.

hub tools

citation-role summary

background 4

citation-polarity summary

fields

cs.CV 28 cs.RO 6

years

2026 32 2025 2

roles

background 4

polarities

background 4

clear filters

representative citing papers

VOCA: Visual Odometry with Codec Awareness

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

VOCA is a causal stereo visual odometry system that achieves state-of-the-art performance on compressed streams by exploiting codec awareness.

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

RayDer is a unified transformer backbone for self-supervised static-scene novel view synthesis that absorbs dynamic content as a nuisance factor and shows power-law scaling with data and compute while matching supervised methods in zero-shot settings.

Geometric Context Transformer for Streaming 3D Reconstruction

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20 FPS over sequences longer than 10,000 frames.

LIST3R: Long-sequence Instance-aware 3D Reconstruction

cs.CV · 2026-07-01 · unverdicted · novelty 5.0

LIST3R reconnects fragmented video subsequences using persistent instance anchors with semantic and geometric evidence to produce consistent global 3D reconstructions.

citing papers explorer

Showing 4 of 4 citing papers after filters.