hub

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi · 2025 · cs.CV · arXiv 2511.10647

56 Pith papers cite this work. Polarity classification is still indexing.

56 Pith papers citing it

open full Pith review browse 56 citing papers arXiv PDF

abstract

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

claims ledger

abstract We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new

co-cited works

representative citing papers

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.

PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.

Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.

Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation

cs.CV · 2026-05-05 · unverdicted · novelty 7.0

Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.

AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision

cs.CV · 2026-04-29 · conditional · novelty 7.0

AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.

Face Anything: 4D Face Reconstruction from Any Image Sequence

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

A single transformer model jointly predicts depth and normalized canonical coordinates to deliver state-of-the-art 4D facial geometry and tracking with 3x lower correspondence error and 16% better depth accuracy.

URoPE: Universal Relative Position Embedding across Geometric Spaces

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, tracking, and depth estimation.

View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.

GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.

LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.

TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

TAIHRI is the first task-aware VLM for close-range HRI that localizes metric-scale 3D coordinates of critical keypoints by quantizing space and performing 2D keypoint reasoning via next-token prediction.

MoRight: Motion Control Done Right

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.

SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.

AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation

cs.RO · 2026-04-07 · unverdicted · novelty 7.0

AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM3D benchmarks.

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

cs.CV · 2025-09-16 · unverdicted · novelty 7.0

MapAnything is a unified feed-forward transformer that regresses metric 3D scene geometry and cameras from images using a factored representation of depth maps, ray maps, poses, and scale.

GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

GeoQuery replaces corrupted rendering features with geometry-aligned proxy queries and restricts cross-view attention to local windows, enabling robust diffusion-based refinement under extreme view sparsity.

GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

cs.RO · 2026-05-12 · unverdicted · novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.

UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on novel view synthesis and stereo conversion.

Focusable Monocular Depth Estimation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

FocusDepth is a prompt-conditioned framework that fuses SAM3 features into Depth Anything models via Multi-Scale Spatial-Aligned Fusion to improve target-region depth accuracy on the new FDE-Bench.

Pixal3D: Pixel-Aligned 3D Generation from Images

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.

GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

cs.CV · 2026-05-11 · unverdicted · novelty 6.0 · 3 refs

GemDepth predicts inter-frame camera poses to inject geometric embeddings into a spatio-temporal transformer, yielding state-of-the-art 3D-consistent video depth.

citing papers explorer

Showing 50 of 56 citing papers.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking cs.CV · 2026-05-12 · unverdicted · none · ref 46 · internal anchor
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction cs.CV · 2026-05-12 · unverdicted · none · ref 92 · internal anchor
AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.
PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations cs.CV · 2026-05-12 · unverdicted · none · ref 18 · internal anchor
PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.
Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images cs.CV · 2026-05-08 · unverdicted · none · ref 21 · internal anchor
Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.
Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation cs.CV · 2026-05-05 · unverdicted · none · ref 32 · internal anchor
Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision cs.CV · 2026-04-29 · conditional · none · ref 25 · internal anchor
AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.
Face Anything: 4D Face Reconstruction from Any Image Sequence cs.CV · 2026-04-21 · unverdicted · none · ref 33 · internal anchor
A single transformer model jointly predicts depth and normalized canonical coordinates to deliver state-of-the-art 4D facial geometry and tracking with 3x lower correspondence error and 16% better depth accuracy.
URoPE: Universal Relative Position Embedding across Geometric Spaces cs.CV · 2026-04-20 · unverdicted · none · ref 17 · internal anchor
URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, tracking, and depth estimation.
View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity cs.CV · 2026-04-20 · unverdicted · none · ref 45 · internal anchor
A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens cs.CV · 2026-04-16 · unverdicted · none · ref 17 · internal anchor
GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 41 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates cs.CV · 2026-04-13 · unverdicted · none · ref 30 · internal anchor
EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.
LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation cs.CV · 2026-04-10 · unverdicted · none · ref 12 · internal anchor
A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction cs.CV · 2026-04-10 · unverdicted · none · ref 22 · internal anchor
TAIHRI is the first task-aware VLM for close-range HRI that localizes metric-scale 3D coordinates of critical keypoints by quantizing space and performing 2D keypoint reasoning via next-token prediction.
MoRight: Motion Control Done Right cs.CV · 2026-04-08 · unverdicted · none · ref 47 · internal anchor
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation cs.CV · 2026-04-07 · unverdicted · none · ref 13 · internal anchor
SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.
AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation cs.RO · 2026-04-07 · unverdicted · none · ref 18 · internal anchor
AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM3D benchmarks.
MapAnything: Universal Feed-Forward Metric 3D Reconstruction cs.CV · 2025-09-16 · unverdicted · none · ref 31 · internal anchor
MapAnything is a unified feed-forward transformer that regresses metric 3D scene geometry and cameras from images using a factored representation of depth maps, ray maps, poses, and scale.
GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction cs.CV · 2026-05-12 · unverdicted · none · ref 34 · internal anchor
GeoQuery replaces corrupted rendering features with geometry-aligned proxy queries and restricts cross-view attention to local windows, enabling robust diffusion-based refinement under extreme view sparsity.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization cs.RO · 2026-05-12 · unverdicted · none · ref 57 · internal anchor
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis cs.CV · 2026-05-12 · unverdicted · none · ref 27 · internal anchor
UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on novel view synthesis and stereo conversion.
Focusable Monocular Depth Estimation cs.CV · 2026-05-12 · unverdicted · none · ref 16 · internal anchor
FocusDepth is a prompt-conditioned framework that fuses SAM3 features into Depth Anything models via Multi-Scale Spatial-Aligned Fusion to improve target-region depth accuracy on the new FDE-Bench.
Pixal3D: Pixel-Aligned 3D Generation from Images cs.CV · 2026-05-11 · unverdicted · none · ref 37 · internal anchor
Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth cs.CV · 2026-05-11 · unverdicted · none · ref 9 · 3 links · internal anchor
GemDepth predicts inter-frame camera poses to inject geometric embeddings into a spatio-temporal transformer, yielding state-of-the-art 3D-consistent video depth.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving cs.CV · 2026-05-11 · unverdicted · none · ref 69 · 2 links · internal anchor
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.
Geometric 4D Stitching for Grounded 4D Generation cs.CV · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
Geometric 4D Stitching explicitly complements missing geometric regions in 4D generated scenes with grounded stitches to achieve consistent 4D representations in under 10 minutes on a single GPU.
Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning cs.CV · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
Sat3R adapts Depth Anything V2 via RPC-aware metric depth fine-tuning to deliver satellite DSM reconstruction with 38% lower MAE than zero-shot baselines and over 300x speedup versus optimization methods.
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction cs.CV · 2026-05-07 · unverdicted · none · ref 16 · internal anchor
Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.
AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs cs.RO · 2026-05-04 · unverdicted · none · ref 27 · internal anchor
AnchorD anchors monocular depth priors in metric sensor data via patch-wise affine alignment using factor graph optimization, improving accuracy on non-Lambertian objects and introducing a new benchmark dataset with dense ground truth.
3D-ReGen: A Unified 3D Geometry Regeneration Framework cs.CV · 2026-04-30 · unverdicted · none · ref 33 · internal anchor
3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising cs.RO · 2026-04-29 · unverdicted · none · ref 64 · 2 links · internal anchor
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation cs.RO · 2026-04-27 · unverdicted · none · ref 48 · internal anchor
MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
SS3D: End2End Self-Supervised 3D from Web Videos cs.CV · 2026-04-24 · unverdicted · none · ref 31 · 3 links · internal anchor
SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.
MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement cs.CV · 2026-04-22 · unverdicted · none · ref 64 · internal anchor
MLG-Stereo adds multi-granularity feature extraction, local-global cost volumes, and guided recurrent refinement to ViT stereo matching, yielding competitive results on Middlebury, KITTI-2015, and strong results on KITTI-2012.
FurnSet: Exploiting Repeats for 3D Scene Reconstruction cs.CV · 2026-04-22 · unverdicted · none · ref 25 · internal anchor
FurnSet improves single-view 3D scene reconstruction by using per-object CLS tokens and set-aware self-attention to group and jointly reconstruct repeated object instances, with added scene-object conditioning and layout optimization.
Enhancing Glass Surface Reconstruction via Depth Prior for Robot Navigation cs.RO · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
A training-free RANSAC-based fusion of depth foundation model priors with sensor data recovers accurate metric depth on glass, supported by a new GlassRecon RGB-D dataset with derived ground truth.
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects cs.CV · 2026-04-17 · unverdicted · none · ref 63 · internal anchor
VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.
Geometric Context Transformer for Streaming 3D Reconstruction cs.CV · 2026-04-15 · unverdicted · none · ref 37 · internal anchor
LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20 FPS over sequences longer than 10,000 frames.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective cs.CV · 2026-04-15 · unverdicted · none · ref 96 · internal anchor
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
Lyra 2.0: Explorable Generative 3D Worlds cs.CV · 2026-04-14 · unverdicted · none · ref 58 · internal anchor
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions cs.CV · 2026-04-14 · unverdicted · none · ref 27 · internal anchor
GraG reconstructs dynamic 3D hand-object interactions from monocular video 6.4x faster than prior work by using compact Sum-of-Gaussians tracking initialized from large models and refined with 2D losses.
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories cs.CV · 2026-04-10 · unverdicted · none · ref 11 · internal anchor
A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
Self-Improving 4D Perception via Self-Distillation cs.CV · 2026-04-09 · unverdicted · none · ref 33 · internal anchor
SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight benchmarks.
PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing cs.CV · 2026-04-08 · unverdicted · none · ref 27 · internal anchor
PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.
LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows cs.CV · 2026-04-06 · conditional · none · ref 47 · internal anchor
LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.
LoMa: Local Feature Matching Revisited cs.CV · 2026-04-06 · unverdicted · none · ref 31 · internal anchor
Scaling data, model size, and compute for local feature matching produces large performance gains on challenging benchmarks and a new manually annotated HardMatch dataset.
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence cs.CV · 2026-05-11 · unverdicted · none · ref 21 · internal anchor
Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.
Syn4D: A Multiview Synthetic 4D Dataset cs.CV · 2026-05-06 · unverdicted · none · ref 65 · internal anchor
Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
Context Unrolling in Omni Models cs.CV · 2026-04-23 · unverdicted · none · ref 27 · internal anchor
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM cs.RO · 2026-04-12 · unverdicted · none · ref 26 · internal anchor
MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.

Depth Anything 3: Recovering the Visual Space from Any Views

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer