VGGT-$\Omega$

· 2026 · cs.CV · arXiv 2605.15195

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open full Pith review browse 6 citing papers arXiv PDF

abstract

Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-$\Omega$, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We simplify VGGT's architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT-$\Omega$ uses only about 30% of the GPU memory of its predecessor, allowing us to train with 15x more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT-$\Omega$ achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, for example, improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding. Project Page: http://vggt-omega.github.io/

representative citing papers

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

cs.CV · 2026-05-26 · unverdicted · novelty 8.0

SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.

Geometric Action Model for Robot Policy Learning

cs.RO · 2026-06-15 · unverdicted · novelty 6.0

GAM splits a geometric foundation model to enable language-conditioned future geometry prediction and action decoding for robot policies, claiming superior performance on manipulation benchmarks.

Modality Forcing for Scalable Spatial Generation

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.

Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning

cs.RO · 2026-06-01 · unverdicted · novelty 6.0

Dexterity-BEV creates 3D vertex-based inputs and BEV-aligned outputs to reduce spatial-temporal misalignments in end-to-end robot policies trained on diverse datasets and embodiments.

$\text{VG}^2$GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer

cs.CV · 2026-06-01 · unverdicted · novelty 5.0

VG²GT regresses Gaussian primitive parameters from multi-scale voxel features of a frozen VFM and uses stochastic solid volume rendering for depth supervision to produce geometrically accurate reconstructions that outperform prior methods on DTU, Replica, TAT, and ScanNet.

Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes

cs.CV · 2026-06-29

citing papers explorer

Showing 4 of 4 citing papers after filters.

SpatialBench: Is Your Spatial Foundation Model an All-Round Player? cs.CV · 2026-05-26 · unverdicted · none · ref 100 · internal anchor
SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.
Modality Forcing for Scalable Spatial Generation cs.CV · 2026-06-11 · unverdicted · none · ref 43 · internal anchor
Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.
$\text{VG}^2$GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer cs.CV · 2026-06-01 · unverdicted · none · ref 38 · internal anchor
VG²GT regresses Gaussian primitive parameters from multi-scale voxel features of a frozen VFM and uses stochastic solid volume rendering for depth supervision to produce geometrically accurate reconstructions that outperform prior methods on DTU, Replica, TAT, and ScanNet.
Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes cs.CV · 2026-06-29 · unreviewed · ref 85 · internal anchor

VGGT-$\Omega$

fields

years

verdicts

representative citing papers

citing papers explorer