TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
hub Mixed citations
Depth Anything 3: Recovering the Visual Space from Any Views
Mixed citation behavior. Most common role is method (42%).
abstract
We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new
co-cited works
representative citing papers
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.
NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.
Anchored Tree Sampling converts horizon-compounding drift into anchor-bounded drift by organizing video generation as a sparse-to-dense tree of imputations instead of left-to-right autoregressive rollout.
Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.
StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.
A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.
AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.
PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.
Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.
Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.
AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.
A single transformer model jointly predicts depth and normalized canonical coordinates to deliver state-of-the-art 4D facial geometry and tracking with 3x lower correspondence error and 16% better depth accuracy.
URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, tracking, and depth estimation.
A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.
GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.
A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
TAIHRI is the first task-aware VLM for close-range HRI that localizes metric-scale 3D coordinates of critical keypoints by quantizing space and performing 2D keypoint reasoning via next-token prediction.
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.
citing papers explorer
-
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
-
MBench: A Comprehensive Benchmark on Memory Capability for Video World Models
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
-
Geo-Align: Video Generation Alignment via Metric Geometry Reward
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
-
GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction
GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.
-
No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos
NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.
-
Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation
Anchored Tree Sampling converts horizon-compounding drift into anchor-bounded drift by organizing video generation as a sparse-to-dense tree of imputations instead of left-to-right autoregressive rollout.
-
Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth
Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.
-
StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video
StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.
-
Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model
A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.
-
Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction
AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.
-
PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations
PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.
-
Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images
Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.
-
Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation
Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.
-
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision
AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.
-
Face Anything: 4D Face Reconstruction from Any Image Sequence
A single transformer model jointly predicts depth and normalized canonical coordinates to deliver state-of-the-art 4D facial geometry and tracking with 3x lower correspondence error and 16% better depth accuracy.
-
URoPE: Universal Relative Position Embedding across Geometric Spaces
URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, tracking, and depth estimation.
-
View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity
A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.
-
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates
EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.
-
LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation
A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
-
TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction
TAIHRI is the first task-aware VLM for close-range HRI that localizes metric-scale 3D coordinates of critical keypoints by quantizing space and performing 2D keypoint reasoning via next-token prediction.
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
-
SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation
SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.
-
AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation
AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM3D benchmarks.
-
MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane
MoCA3D formulates monocular 3D box prediction as dense pixel-space tasks using corner heatmaps and depth maps, with a new PAG metric for image-plane evaluation.
-
Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation
Low-rank decoder adaptation enables efficient test-time optimization for zero-shot depth completion by updating only the subspace containing depth-relevant information.
-
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
MapAnything is a unified feed-forward transformer that regresses metric 3D scene geometry and cameras from images using a factored representation of depth maps, ray maps, poses, and scale.
-
EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild
EgoWalk supplies 50 hours of real-world multimodal human navigation data in varied indoor/outdoor settings together with open pipelines that auto-generate language goal annotations and traversability masks.
-
Cambrian-P: Pose-Grounded Video Understanding
Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
-
Fast 4D Mesh Generation by Spatio-Temporal Attention Chains
A training-free Spatio-Temporal Attention Chain framework accelerates 4D mesh generation 13x, improves quality, scales to 16x longer videos, and supports downstream tracking and camera estimation.
-
PIXLRelight: Controllable Relighting via Intrinsic Conditioning
A transformer-based neural renderer that transfers arbitrary PBR lighting to single images via shared intrinsic conditioning extracted from both multi-illumination photos and path-traced coarse 3D renders.
-
Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling
Decouples semantic and spatial tokens in NVS transformers to resolve representation ambiguity, yielding consistent gains with near-zero added latency.
-
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
-
LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos
LongDPM introduces an overlap-aware chunk-based framework that registers and fuses local dynamic reconstructions to achieve coherent long-range 4D geometry and tracking from monocular video.
-
Unlocking Dense Metric Depth Estimation in VLMs
DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new indoor-outdoor benchmark.
-
GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction
GeoQuery replaces corrupted rendering features with geometry-aligned proxy queries and restricts cross-view attention to local windows, enabling robust diffusion-based refinement under extreme view sparsity.
-
UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis
UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on novel view synthesis and stereo conversion.
-
Focusable Monocular Depth Estimation
FocusDepth is a prompt-conditioned framework that fuses SAM3 features into Depth Anything models via Multi-Scale Spatial-Aligned Fusion to improve target-region depth accuracy on the new FDE-Bench.
-
Pixal3D: Pixel-Aligned 3D Generation from Images
Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth adds explicit camera-pose geometry embeddings and an alternating spatio-temporal transformer to produce sharper, more temporally consistent video depth maps than prior smoothing-based methods.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.
-
Geometric 4D Stitching for Grounded 4D Generation
Geometric 4D Stitching explicitly complements missing geometric regions in 4D generated scenes with grounded stitches to achieve consistent 4D representations in under 10 minutes on a single GPU.
-
Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning
Sat3R adapts Depth Anything V2 via RPC-aware metric depth fine-tuning to deliver satellite DSM reconstruction with 38% lower MAE than zero-shot baselines and over 300x speedup versus optimization methods.
-
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
Spark3R achieves up to 28x speedup on 1000-frame 3D reconstruction inputs by asymmetrically reducing query and key-value tokens in Vision Transformers while keeping competitive quality.
-
AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs
AnchorD anchors monocular depth priors in metric sensor data via patch-wise affine alignment using factor graph optimization, improving accuracy on non-Lambertian objects and introducing a new benchmark dataset with dense ground truth.
-
MeshReGen: A Unified 3D Geometry Regeneration Framework
MeshReGen introduces a conditioned 3D geometry regenerator with VecSet that learns a regeneration prior via self-supervision and reports state-of-the-art results on controllable generation tasks.
-
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
-
SS3D: End2End Self-Supervised 3D from Web Videos
SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.
-
MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement
MLG-Stereo adds multi-granularity feature extraction, local-global cost volumes, and guided recurrent refinement to ViT stereo matching, yielding competitive results on Middlebury, KITTI-2015, and strong results on KITTI-2012.