hub Canonical reference

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole · 2024 · cs.CV · arXiv 2410.03825

Canonical reference. 88% of citing Pith papers cite this work as background.

42 Pith papers citing it

Background 88% of classified citations

open full Pith review browse 42 citing papers arXiv PDF

abstract

Estimating geometry from dynamic scenes, where objects move and deform over time, remains a core challenge in computer vision. Current approaches often rely on multi-stage pipelines or global optimizations that decompose the problem into subtasks, like depth and flow, leading to complex systems prone to errors. In this paper, we present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes. Our key insight is that by simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes. However, this approach presents a significant challenge: the scarcity of suitable training data, namely dynamic, posed videos with depth labels. Despite this, we show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics, even without an explicit motion representation. Based on this, we introduce new optimizations for several downstream video-specific tasks and demonstrate strong performance on video depth and camera pose estimation, outperforming prior work in terms of robustness and efficiency. Moreover, MonST3R shows promising results for primarily feed-forward 4D reconstruction.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 1

citation-polarity summary

background 7 use method 1

representative citing papers

Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

C4G introduces compact timestamp-conditioned Gaussian query tokens that aggregate full temporal context to decode 3D Gaussians with timestamp-modulated positions for feed-forward 4D reconstruction from monocular video, plus a diffusion-based rendering module and extension to 4D feature fields.

Geo-Align: Video Generation Alignment via Metric Geometry Reward

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.

Ground4D: Spatially-Grounded Feedforward 4D Reconstruction for Unstructured Off-Road Scenes

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

Ground4D resolves temporal conflicts in feedforward 4D Gaussian reconstruction for off-road scenes via voxel-grounded temporal aggregation with intra-voxel softmax and surface normal regularization, outperforming prior methods on ORAD-3D and RELLIS-3D while generalizing zero-shot.

AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision

cs.CV · 2026-04-29 · unverdicted · novelty 7.0 · 2 refs

AirZoo is a new dataset covering 378 regions across 22 countries with pixel-level metric depth and 6-DoF poses, shown via benchmarks to improve SoTA models on aerial image retrieval, cross-view matching, and multi-view 3D reconstruction.

Learning 3D Reconstruction with Priors in Test Time

cs.CV · 2026-04-04 · unverdicted · novelty 7.0

Test-time constrained optimization incorporates priors into pre-trained multiview transformers via self-supervised losses and penalty terms to improve 3D reconstruction accuracy.

STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction

cs.CV · 2026-03-18 · unverdicted · novelty 7.0

STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory reduction and 4x faster inference at SOTA quality.

ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training

cs.CV · 2026-03-04 · unverdicted · novelty 7.0

ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.

3AM: 3egment Anything with Geometric Consistency in Videos

cs.CV · 2026-01-13 · unverdicted · novelty 7.0

3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

cs.CV · 2025-07-17 · conditional · novelty 7.0

π³ is a feed-forward network with full permutation equivariance that outputs affine-invariant poses and scale-invariant local point maps without reference frames, reaching state-of-the-art on camera pose, depth, and dense reconstruction benchmarks.

World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

A generative video model conditioned on pixel-aligned 3D renderings produces consistent dynamic 3D Gaussian splats from monocular video and sets new SOTA in 4D reconstruction.

Revisiting Articulated Parts Perception in Robot Manipulation

cs.RO · 2026-06-06 · unverdicted · novelty 6.0

Proposes GPS representation for articulated parts, uses VR to annotate 41K frames across 234 objects, trains an RGB-D model, and achieves 73% success in heuristic manipulation policies on 9 objects.

Stabilizing Streaming Video Geometry via Dynamic Feature Normalization

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

DyFN is a lightweight recurrent module that dynamically normalizes latent feature statistics to remove scale-shift drift and achieve state-of-the-art temporal consistency in streaming monocular geometry estimation while updating only 2% of parameters.

UniT: Unified Geometry Learning with Group Autoregressive Transformer

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.

Fast 4D Mesh Generation by Spatio-Temporal Attention Chains

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

A training-free Spatio-Temporal Attention Chain framework accelerates 4D mesh generation 13x, improves quality, scales to 16x longer videos, and supports downstream tracking and camera estimation.

LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

LongDPM introduces an overlap-aware chunk-based framework that registers and fuses local dynamic reconstructions to achieve coherent long-range 4D geometry and tracking from monocular video.

CoGE: Sim-to-Real Online Geometric Estimation for Monocular Colonoscopy

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

CoGE achieves state-of-the-art monocular geometric estimation in colonoscopy by training solely on simulated data via an illumination-aware Retinex-based module and a wavelet-based structure-aware module.

Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.

RigidFormer: Learning Rigid Dynamics using Transformers

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.

Generative 3D Gaussians with Learned Density Control

cs.GR · 2026-05-08 · unverdicted · novelty 6.0

DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.

Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Sat3R adapts Depth Anything V2 via RPC-aware metric depth fine-tuning to deliver satellite DSM reconstruction with 38% lower MAE than zero-shot baselines and over 300x speedup versus optimization methods.

Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

cs.CV · 2026-05-07 · unverdicted · novelty 6.0 · 3 refs

The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.

Long-tail Internet photo reconstruction

cs.CV · 2026-04-24 · unverdicted · novelty 6.0

Finetuning 3D foundation models on simulated sparse subsets from MegaDepth-X produces robust reconstructions from extremely sparse, noisy internet photos while preserving performance on dense benchmarks.

Vista4D: Video Reshooting with 4D Point Clouds

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.

citing papers explorer

Showing 36 of 36 citing papers after filters.

Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction cs.CV · 2026-05-29 · unverdicted · none · ref 110 · internal anchor
C4G introduces compact timestamp-conditioned Gaussian query tokens that aggregate full temporal context to decode 3D Gaussians with timestamp-modulated positions for feed-forward 4D reconstruction from monocular video, plus a diffusion-based rendering module and extension to 4D feature fields.
Geo-Align: Video Generation Alignment via Metric Geometry Reward cs.CV · 2026-05-22 · unverdicted · none · ref 42 · internal anchor
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos cs.CV · 2026-05-21 · unverdicted · none · ref 73 · internal anchor
NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.
Ground4D: Spatially-Grounded Feedforward 4D Reconstruction for Unstructured Off-Road Scenes cs.CV · 2026-05-06 · unverdicted · none · ref 59 · internal anchor
Ground4D resolves temporal conflicts in feedforward 4D Gaussian reconstruction for off-road scenes via voxel-grounded temporal aggregation with intra-voxel softmax and surface normal regularization, outperforming prior methods on ORAD-3D and RELLIS-3D while generalizing zero-shot.
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision cs.CV · 2026-04-29 · unverdicted · none · ref 58 · 2 links · internal anchor
AirZoo is a new dataset covering 378 regions across 22 countries with pixel-level metric depth and 6-DoF poses, shown via benchmarks to improve SoTA models on aerial image retrieval, cross-view matching, and multi-view 3D reconstruction.
Learning 3D Reconstruction with Priors in Test Time cs.CV · 2026-04-04 · unverdicted · none · ref 52 · internal anchor
Test-time constrained optimization incorporates priors into pre-trained multiview transformers via self-supervised losses and penalty terms to improve 3D reconstruction accuracy.
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction cs.CV · 2026-03-18 · unverdicted · none · ref 50 · internal anchor
STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory reduction and 4x faster inference at SOTA quality.
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training cs.CV · 2026-03-04 · unverdicted · none · ref 84 · internal anchor
ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
3AM: 3egment Anything with Geometric Consistency in Videos cs.CV · 2026-01-13 · unverdicted · none · ref 115 · internal anchor
3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video cs.CV · 2026-07-01 · unverdicted · none · ref 65 · internal anchor
A generative video model conditioned on pixel-aligned 3D renderings produces consistent dynamic 3D Gaussian splats from monocular video and sets new SOTA in 4D reconstruction.
Revisiting Articulated Parts Perception in Robot Manipulation cs.RO · 2026-06-06 · unverdicted · none · ref 46 · internal anchor
Proposes GPS representation for articulated parts, uses VR to annotate 41K frames across 234 objects, trains an RGB-D model, and achieves 73% success in heuristic manipulation policies on 9 objects.
Stabilizing Streaming Video Geometry via Dynamic Feature Normalization cs.CV · 2026-05-25 · unverdicted · none · ref 58 · internal anchor
DyFN is a lightweight recurrent module that dynamically normalizes latent feature statistics to remove scale-shift drift and achieve state-of-the-art temporal consistency in streaming monocular geometry estimation while updating only 2% of parameters.
UniT: Unified Geometry Learning with Group Autoregressive Transformer cs.CV · 2026-05-20 · unverdicted · none · ref 78 · internal anchor
UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.
Fast 4D Mesh Generation by Spatio-Temporal Attention Chains cs.CV · 2026-05-19 · unverdicted · none · ref 98 · internal anchor
A training-free Spatio-Temporal Attention Chain framework accelerates 4D mesh generation 13x, improves quality, scales to 16x longer videos, and supports downstream tracking and camera estimation.
LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos cs.CV · 2026-05-17 · unverdicted · none · ref 46 · internal anchor
LongDPM introduces an overlap-aware chunk-based framework that registers and fuses local dynamic reconstructions to achieve coherent long-range 4D geometry and tracking from monocular video.
CoGE: Sim-to-Real Online Geometric Estimation for Monocular Colonoscopy cs.CV · 2026-05-13 · unverdicted · none · ref 18 · internal anchor
CoGE achieves state-of-the-art monocular geometric estimation in colonoscopy by training solely on simulated data via an illumination-aware Retinex-based module and a wavelet-based structure-aware module.
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval cs.CV · 2026-05-10 · unverdicted · none · ref 53 · internal anchor
RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
RigidFormer: Learning Rigid Dynamics using Transformers cs.CV · 2026-05-09 · unverdicted · none · ref 48 · internal anchor
RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.
Generative 3D Gaussians with Learned Density Control cs.GR · 2026-05-08 · unverdicted · none · ref 42 · internal anchor
DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.
Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning cs.CV · 2026-05-08 · unverdicted · none · ref 30 · internal anchor
Sat3R adapts Depth Anything V2 via RPC-aware metric depth fine-tuning to deliver satellite DSM reconstruction with 38% lower MAE than zero-shot baselines and over 300x speedup versus optimization methods.
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction cs.CV · 2026-05-07 · unverdicted · none · ref 46 · 3 links · internal anchor
The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.
Long-tail Internet photo reconstruction cs.CV · 2026-04-24 · unverdicted · none · ref 49 · internal anchor
Finetuning 3D foundation models on simulated sparse subsets from MegaDepth-X produces robust reconstructions from extremely sparse, noisy internet photos while preserving performance on dense benchmarks.
Vista4D: Video Reshooting with 4D Point Clouds cs.CV · 2026-04-23 · unverdicted · none · ref 45 · internal anchor
Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective cs.CV · 2026-04-15 · unverdicted · none · ref 187 · internal anchor
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
Self-Improving 4D Perception via Self-Distillation cs.CV · 2026-04-09 · unverdicted · none · ref 75 · internal anchor
SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight benchmarks.
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations cs.CV · 2026-04-09 · unverdicted · none · ref 69 · internal anchor
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.
OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer cs.CV · 2026-03-06 · conditional · none · ref 44 · internal anchor
OVGGT achieves constant O(1) memory and compute for streaming 3D geometry reconstruction by using FFN-residual-based KV cache compression and dynamic anchor protection, matching state-of-the-art accuracy on long sequences.
High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians cs.CV · 2026-06-14 · unverdicted · none · ref 66 · internal anchor
A multi-view feed-forward transformer provides initial poses and geometry from calibrated videos, followed by physics-aware Gaussian optimization with tetrahedral and collision constraints to produce robust 4D hand-object reconstructions.
DarkVGGT: Seeing Through Darkness Using Thermal Geometry without Daylight Tax cs.CV · 2026-06-09 · unverdicted · none · ref 18 · 2 links · internal anchor
DarkVGGT introduces physics-aware thermal factorization and geometry-shared routing modules in an RGB-T feed-forward framework to improve depth and camera pose estimation under degraded RGB conditions.
$R^3$: 3D Reconstruction via Relative Regression cs.CV · 2026-05-26 · unverdicted · none · ref 77 · internal anchor
R³ uses relative regression with confidence-weighted constraints from an MLP to support long-context offline and streaming 3D reconstruction without global coordinate assumptions.
IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation cs.CV · 2026-05-15 · unverdicted · none · ref 17 · 2 links · internal anchor
IVGT implicitly models continuous neural scene representations from pose-free multi-view images to enable coherent surface extraction, novel view synthesis, and related 3D tasks via SDF and color prediction.
VGGT-$\Omega$ cs.CV · 2026-05-14 · unverdicted · none · ref 218 · internal anchor
VGGT-Ω improves feed-forward reconstruction accuracy and efficiency by architectural simplifications, register-based attention, and training on much larger supervised and unlabeled video data.
WildPose: A Unified Framework for Robust Pose Estimation in the Wild cs.CV · 2026-05-12 · unverdicted · none · ref 57 · internal anchor
WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.
LychSim: A Controllable and Interactive Simulation Framework for Vision Research cs.CV · 2026-05-12 · unverdicted · none · ref 62 · internal anchor
LychSim introduces a controllable simulation platform on Unreal Engine 5 with Python API, procedural generation, and LLM integration for vision research tasks.
DINO_4D: Semantic-Aware 4D Reconstruction cs.CV · 2026-04-10 · unverdicted · none · ref 12 · internal anchor
DINO_4D uses frozen DINOv3 features to inject semantic awareness into 4D dynamic scene reconstruction, improving tracking accuracy and completeness on benchmarks while preserving O(T) complexity.
Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond cs.CV · 2026-04-24 · unreviewed · ref 37 · internal anchor

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer