OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.
super hub Mixed citations
Depth Anything 3: Recovering the Visual Space from Any Views
Mixed citation behavior. Most common role is method (42%).
abstract
We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new
authors
co-cited works
representative citing papers
Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.
SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
LIME formulates language-conditioned camera motion as predicting SE(3) target poses from RGB and intent text, using mined multi-intent supervision from egocentric video and a flow-matching pose head.
InvSplat is a feed-forward multi-view model that predicts 3D Gaussians augmented with intrinsic material attributes for inverse rendering and relighting.
QWERTY enables training-free motion control in pretrained image-to-video DiTs by warping the frame-invariant semantic subspace of queries in 3D full attention and using the predicted noise as self-guidance for latent optimization.
MindEdit-Bench introduces six spatial reasoning tasks from 120 private indoor photo triplets, with two new counterfactual editing tasks where VLMs score 8-31% against 81-97% human accuracy.
WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.
CasaMaestro predicts metric depth and poses from sparse multi-view panoramas to enable fast house-scale 3D reconstruction.
NeuWorld uses a transformer VAE to learn compact Neural Implicit Scenes from sparse posed frames and a diffusion transformer to evolve them conditioned on camera trajectories for consistent interactive exploration.
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
OR3 converts OR clips to action-driven digital twins, uses LLM imagination for hypothetical ActDTs, and achieves 57.6 R@1 and 77.3 R@5 on 276 implicit queries from 386 robotic knee procedure clips, outperforming baselines.
World Tracing introduces a multi-layer pixel-aligned 3D point representation instantiated via a diffusion transformer (WT-DiT) trained with pixel-space flow matching to jointly reconstruct visible surfaces and generate occluded geometry.
DepthMaster unifies metric monocular depth estimation for perspective and panoramic images by patching panoramas into perspective views, adding a consistency loss and virtual cameras, and training mostly on perspective data to reach SOTA zero-shot results on 13 datasets.
PhysAgent is a simulator-in-the-loop multi-agent system that automates physically grounded 4D synthesis from multimodal prompts by using trajectory feedback from vision models and LLM reasoning to optimize force fields.
ExMesh introduces a framework for explicit mesh reconstruction from images that integrates adaptive topology updates into differentiable optimization while maintaining UV coordinates.
RigPAPR auto-rigs static PAPR point clouds and drives them via direct LBS from monocular fixed-view video, matching baselines at supervised views and exceeding them by 3+dB PSNR at novel views with cleaner joints.
A transformer model predicts in vivo hip and knee contact forces from uncalibrated monocular video at accuracy matching subject-specific musculoskeletal simulations under leave-one-subject-out validation.
A dedicated geometry opacity parameter per 3D Gaussian decouples appearance from geometry and yields better novel-view rendering plus surface reconstruction on varied datasets.
ZipSplat uses multi-view token extraction followed by k-means clustering and attention to decode compact scene tokens into unconstrained 3D Gaussians, achieving SOTA pose-free results with ~6x fewer primitives.
Z-FLoc performs zero-shot floorplan localization by matching geometric primitives from BEV projections of monocular 3D reconstructions to floorplans using dedicated minimal solvers in a robust framework.
MetricScenes dataset from web photos and stereo imagery, plus a two-stage Poisson depth completion method, allows fine-tuning MoGe-2 to mitigate scale-collapse in metric monocular geometry while preserving benchmark performance.
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
citing papers explorer
-
One Video, One World: Turning Monocular Video into Physical 4D Scenes
OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.
-
Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects
Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.
-
SpatialBench: Is Your Spatial Foundation Model an All-Round Player?
SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.
-
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
-
LIME: Learning Intent-aware Camera Motion from Egocentric Video
LIME formulates language-conditioned camera motion as predicting SE(3) target poses from RGB and intent text, using mined multi-intent supervision from egocentric video and a flow-matching pose head.
-
InvSplat: Inverse Feed-Forward Scene Splatting
InvSplat is a feed-forward multi-view model that predicts 3D Gaussians augmented with intrinsic material attributes for inverse rendering and relighting.
-
QWERTY: Training-Free Motion Control via Query-Warped Video Diffusion Transformers
QWERTY enables training-free motion control in pretrained image-to-video DiTs by warping the frame-invariant semantic subspace of queries in 3D full attention and using the predicted noise as self-guidance for latent optimization.
-
MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos
MindEdit-Bench introduces six spatial reasoning tasks from 120 private indoor photo triplets, with two new counterfactual editing tasks where VLMs score 8-31% against 81-97% human accuracy.
-
WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis
WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.
-
CasaMaestro: Multi-View Panoramas for House-Scale 3D Reconstruction
CasaMaestro predicts metric depth and poses from sparse multi-view panoramas to enable fast house-scale 3D reconstruction.
-
Walking in the Implicit: Interactive World Exploration via Neural Scene Representation
NeuWorld uses a transformer VAE to learn compact Neural Implicit Scenes from sparse posed frames and a diffusion transformer to evolve them conditioned on camera trajectories for consistent interactive exploration.
-
Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
-
Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins
OR3 converts OR clips to action-driven digital twins, uses LLM imagination for hypothetical ActDTs, and achieves 57.6 R@1 and 77.3 R@5 on 276 implicit queries from 386 robotic knee procedure clips, outperforming baselines.
-
World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible
World Tracing introduces a multi-layer pixel-aligned 3D point representation instantiated via a diffusion transformer (WT-DiT) trained with pixel-space flow matching to jointly reconstruct visible surfaces and generate occluded geometry.
-
DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images
DepthMaster unifies metric monocular depth estimation for perspective and panoramic images by patching panoramas into perspective views, adding a consistency loss and virtual cameras, and training mostly on perspective data to reach SOTA zero-shot results on 13 datasets.
-
PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback
PhysAgent is a simulator-in-the-loop multi-agent system that automates physically grounded 4D synthesis from multimodal prompts by using trajectory feedback from vision models and LLM reasoning to optimize force fields.
-
ExMesh: EXplicit Mesh Reconstruction with Topology Adaptation
ExMesh introduces a framework for explicit mesh reconstruction from images that integrates adaptive topology updates into differentiable optimization while maintaining UV coordinates.
-
RigPAPR: Rig-Based Animation of Static Neural Point Clouds from a Fixed-Viewpoint Video
RigPAPR auto-rigs static PAPR point clouds and drives them via direct LBS from monocular fixed-view video, matching baselines at supervised views and exceeding them by 3+dB PSNR at novel views with cleaner joints.
-
From Pixels to Newtons: Predicting In Vivo Joint Contact Forces from Monocular Video
A transformer model predicts in vivo hip and knee contact forces from uncalibrated monocular video at accuracy matching subject-specific musculoskeletal simulations under leave-one-subject-out validation.
-
Geometry Gaussians: Decoupling Appearance and Geometry in Gaussian Splatting
A dedicated geometry opacity parameter per 3D Gaussian decouples appearance from geometry and yields better novel-view rendering plus surface reconstruction on varied datasets.
-
ZipSplat: Fewer Gaussians, Better Splats
ZipSplat uses multi-view token extraction followed by k-means clustering and attention to decode compact scene tokens into unconstrained 3D Gaussians, achieving SOTA pose-free results with ~6x fewer primitives.
-
Z-FLoc: Zero-Shot Floorplan Localization via Geometric Primitives
Z-FLoc performs zero-shot floorplan localization by matching geometric primitives from BEV projections of monocular 3D reconstructions to floorplans using dedicated minimal solvers in a robust framework.
-
Honey, I Shrunk the Arc de Triomphe!
MetricScenes dataset from web photos and stereo imagery, plus a two-stage Poisson depth completion method, allows fine-tuning MoGe-2 to mitigate scale-collapse in metric monocular geometry while preserving benchmark performance.
-
MBench: A Comprehensive Benchmark on Memory Capability for Video World Models
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
-
WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.
-
Geo-Align: Video Generation Alignment via Metric Geometry Reward
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
-
GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction
GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.
-
SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
SpaceDG is the first large-scale benchmark dataset (~1M QA pairs) simulating nine visual degradations in 3DGS-rendered scenes to measure and improve spatial intelligence robustness in MLLMs.
-
No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos
NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.
-
Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation
Anchored Tree Sampling converts horizon-compounding drift into anchor-bounded drift by organizing video generation as a sparse-to-dense tree of imputations instead of left-to-right autoregressive rollout.
-
Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth
Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.
-
StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video
StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.
-
Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model
A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.
-
Probing into Camera Control of Video Models
A training-free method reformulates camera control as geometric displacement fields applied via differentiable latent resampling, enabling control and bias probing in video diffusion models.
-
Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction
AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.
-
PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations
PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.
-
Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images
Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.
-
Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation
Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.
-
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision
AirZoo is a new dataset covering 378 regions across 22 countries with pixel-level metric depth and 6-DoF poses, shown via benchmarks to improve SoTA models on aerial image retrieval, cross-view matching, and multi-view 3D reconstruction.
-
View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity
A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.
-
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates
EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.
-
LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation
A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
-
TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction
TAIHRI is the first task-aware VLM for close-range HRI that localizes metric-scale 3D coordinates of critical keypoints by quantizing space and performing 2D keypoint reasoning via next-token prediction.
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
-
SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation
SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.
-
AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation
AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM3D benchmarks.
-
MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane
MoCA3D formulates monocular 3D box prediction as dense pixel-space tasks using corner heatmaps and depth maps, with a new PAG metric for image-plane evaluation.
-
Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation
Low-rank decoder adaptation enables efficient test-time optimization for zero-shot depth completion by updating only the subspace containing depth-relevant information.