super hub Mixed citations

Depth Anything 3: Recovering the Visual Space from Any Views

Donny Y. Chen, Guang Shi, Haotong Lin, Junhao Liew, Sili Chen, Zhenyu Li · 2025 · cs.CV · arXiv 2511.10647

Mixed citation behavior. Most common role is method (42%).

205 Pith papers citing it

Method 42% of classified citations

open full Pith review browse 205 citing papers more from Donny Y. Chen arXiv PDF

abstract

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 method 13 baseline 4 dataset 1

citation-polarity summary

use method 13 background 12 baseline 4 unclear 1 use dataset 1

claims ledger

abstract We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new

authors

Donny Y. Chen Guang Shi Haotong Lin Junhao Liew Sili Chen Zhenyu Li

co-cited works

representative citing papers

One Video, One World: Turning Monocular Video into Physical 4D Scenes

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.

Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

cs.CV · 2026-05-27 · conditional · novelty 8.0

Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

cs.CV · 2026-05-26 · unverdicted · novelty 8.0

SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

LIME: Learning Intent-aware Camera Motion from Egocentric Video

cs.RO · 2026-07-02 · unverdicted · novelty 7.0

LIME formulates language-conditioned camera motion as predicting SE(3) target poses from RGB and intent text, using mined multi-intent supervision from egocentric video and a flow-matching pose head.

InvSplat: Inverse Feed-Forward Scene Splatting

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

InvSplat is a feed-forward multi-view model that predicts 3D Gaussians augmented with intrinsic material attributes for inverse rendering and relighting.

QWERTY: Training-Free Motion Control via Query-Warped Video Diffusion Transformers

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

QWERTY enables training-free motion control in pretrained image-to-video DiTs by warping the frame-invariant semantic subspace of queries in 3D full attention and using the predicted noise as self-guidance for latent optimization.

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

MindEdit-Bench introduces six spatial reasoning tasks from 120 private indoor photo triplets, with two new counterfactual editing tasks where VLMs score 8-31% against 81-97% human accuracy.

WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.

CasaMaestro: Multi-View Panoramas for House-Scale 3D Reconstruction

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

CasaMaestro predicts metric depth and poses from sparse multi-view panoramas to enable fast house-scale 3D reconstruction.

Walking in the Implicit: Interactive World Exploration via Neural Scene Representation

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

NeuWorld uses a transformer VAE to learn compact Neural Implicit Scenes from sparse posed frames and a diffusion transformer to evolve them conditioned on camera trajectories for consistent interactive exploration.

Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation

cs.RO · 2026-06-29 · unverdicted · novelty 7.0

SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.

SATURN: Symbolic Spatial Reasoning for Multi-Perspective Grounding

cs.CV · 2026-06-21 · unverdicted · novelty 7.0

SATURN reconstructs approximate 3D scenes, derives soft perspective-aware predicates, and executes them symbolically to achieve stable performance on complex multi-perspective spatial grounding tasks where VLMs degrade.

From Uncertainty to Stability and Fidelity: Guiding Sparse-View 3D Gaussian Splatting with Fisher Information

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

Introduces Fisher Information-guided stereo augmentation and uncertainty-aware regularization to mitigate overfitting in sparse-view 3D Gaussian Splatting.

Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins

cs.CV · 2026-06-15 · conditional · novelty 7.0

OR3 converts OR clips to action-driven digital twins, uses LLM imagination for hypothetical ActDTs, and achieves 57.6 R@1 and 77.3 R@5 on 276 implicit queries from 386 robotic knee procedure clips, outperforming baselines.

World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible

cs.CV · 2026-06-11 · unverdicted · novelty 7.0

World Tracing introduces a multi-layer pixel-aligned 3D point representation instantiated via a diffusion transformer (WT-DiT) trained with pixel-space flow matching to jointly reconstruct visible surfaces and generate occluded geometry.

DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

DepthMaster unifies metric monocular depth estimation for perspective and panoramic images by patching panoramas into perspective views, adding a consistency loss and virtual cameras, and training mostly on perspective data to reach SOTA zero-shot results on 13 datasets.

PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback

cs.RO · 2026-06-07 · unverdicted · novelty 7.0

PhysAgent is a simulator-in-the-loop multi-agent system that automates physically grounded 4D synthesis from multimodal prompts by using trajectory feedback from vision models and LLM reasoning to optimize force fields.

ExMesh: EXplicit Mesh Reconstruction with Topology Adaptation

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

ExMesh introduces a framework for explicit mesh reconstruction from images that integrates adaptive topology updates into differentiable optimization while maintaining UV coordinates.

RigPAPR: Rig-Based Animation of Static Neural Point Clouds from a Fixed-Viewpoint Video

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

RigPAPR auto-rigs static PAPR point clouds and drives them via direct LBS from monocular fixed-view video, matching baselines at supervised views and exceeding them by 3+dB PSNR at novel views with cleaner joints.

From Pixels to Newtons: Predicting In Vivo Joint Contact Forces from Monocular Video

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

A transformer model predicts in vivo hip and knee contact forces from uncalibrated monocular video at accuracy matching subject-specific musculoskeletal simulations under leave-one-subject-out validation.

Geometry Gaussians: Decoupling Appearance and Geometry in Gaussian Splatting

cs.GR · 2026-06-03 · unverdicted · novelty 7.0

A dedicated geometry opacity parameter per 3D Gaussian decouples appearance from geometry and yields better novel-view rendering plus surface reconstruction on varied datasets.

ZipSplat: Fewer Gaussians, Better Splats

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

ZipSplat uses multi-view token extraction followed by k-means clustering and attention to decode compact scene tokens into unconstrained 3D Gaussians, achieving SOTA pose-free results with ~6x fewer primitives.

Z-FLoc: Zero-Shot Floorplan Localization via Geometric Primitives

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

Z-FLoc performs zero-shot floorplan localization by matching geometric primitives from BEV projections of monocular 3D reconstructions to floorplans using dedicated minimal solvers in a robust framework.

citing papers explorer

Showing 50 of 205 citing papers.

One Video, One World: Turning Monocular Video into Physical 4D Scenes cs.CV · 2026-06-30 · unverdicted · none · ref 48 · internal anchor
OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.
Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects cs.CV · 2026-05-27 · conditional · none · ref 25 · internal anchor
Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.
SpatialBench: Is Your Spatial Foundation Model an All-Round Player? cs.CV · 2026-05-26 · unverdicted · none · ref 57 · internal anchor
SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking cs.CV · 2026-05-12 · unverdicted · none · ref 46 · internal anchor
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
LIME: Learning Intent-aware Camera Motion from Egocentric Video cs.RO · 2026-07-02 · unverdicted · none · ref 63 · internal anchor
LIME formulates language-conditioned camera motion as predicting SE(3) target poses from RGB and intent text, using mined multi-intent supervision from egocentric video and a flow-matching pose head.
InvSplat: Inverse Feed-Forward Scene Splatting cs.CV · 2026-07-02 · unverdicted · none · ref 13 · internal anchor
InvSplat is a feed-forward multi-view model that predicts 3D Gaussians augmented with intrinsic material attributes for inverse rendering and relighting.
QWERTY: Training-Free Motion Control via Query-Warped Video Diffusion Transformers cs.CV · 2026-07-02 · unverdicted · none · ref 23 · internal anchor
QWERTY enables training-free motion control in pretrained image-to-video DiTs by warping the frame-invariant semantic subspace of queries in 3D full attention and using the predicted noise as self-guidance for latent optimization.
MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos cs.CV · 2026-07-01 · unverdicted · none · ref 46 · internal anchor
MindEdit-Bench introduces six spatial reasoning tasks from 120 private indoor photo triplets, with two new counterfactual editing tasks where VLMs score 8-31% against 81-97% human accuracy.
WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis cs.CV · 2026-06-30 · unverdicted · none · ref 27 · internal anchor
WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.
CasaMaestro: Multi-View Panoramas for House-Scale 3D Reconstruction cs.CV · 2026-06-30 · unverdicted · none · ref 15 · internal anchor
CasaMaestro predicts metric depth and poses from sparse multi-view panoramas to enable fast house-scale 3D reconstruction.
Walking in the Implicit: Interactive World Exploration via Neural Scene Representation cs.CV · 2026-06-29 · unverdicted · none · ref 59 · internal anchor
NeuWorld uses a transformer VAE to learn compact Neural Implicit Scenes from sparse posed frames and a diffusion transformer to evolve them conditioned on camera trajectories for consistent interactive exploration.
Pondering the Way: Spatial-perceiving World Action Model for Embodied Navigation cs.RO · 2026-06-29 · unverdicted · none · ref 21 · internal anchor
SWAM jointly generates intermediate RGB-D sequences and action trajectories from monocular RGB start/goal observations for embodied navigation.
SATURN: Symbolic Spatial Reasoning for Multi-Perspective Grounding cs.CV · 2026-06-21 · unverdicted · none · ref 4 · internal anchor
SATURN reconstructs approximate 3D scenes, derives soft perspective-aware predicates, and executes them symbolically to achieve stable performance on complex multi-perspective spatial grounding tasks where VLMs degrade.
From Uncertainty to Stability and Fidelity: Guiding Sparse-View 3D Gaussian Splatting with Fisher Information cs.CV · 2026-06-18 · unverdicted · none · ref 28 · internal anchor
Introduces Fisher Information-guided stereo augmentation and uncertainty-aware regularization to mitigate overfitting in sparse-view 3D Gaussian Splatting.
Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins cs.CV · 2026-06-15 · conditional · none · ref 12 · internal anchor
OR3 converts OR clips to action-driven digital twins, uses LLM imagination for hypothetical ActDTs, and achieves 57.6 R@1 and 77.3 R@5 on 276 implicit queries from 386 robotic knee procedure clips, outperforming baselines.
World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible cs.CV · 2026-06-11 · unverdicted · none · ref 34 · internal anchor
World Tracing introduces a multi-layer pixel-aligned 3D point representation instantiated via a diffusion transformer (WT-DiT) trained with pixel-space flow matching to jointly reconstruct visible surfaces and generate occluded geometry.
DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images cs.CV · 2026-06-10 · unverdicted · none · ref 47 · internal anchor
DepthMaster unifies metric monocular depth estimation for perspective and panoramic images by patching panoramas into perspective views, adding a consistency loss and virtual cameras, and training mostly on perspective data to reach SOTA zero-shot results on 13 datasets.
PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback cs.RO · 2026-06-07 · unverdicted · none · ref 15 · internal anchor
PhysAgent is a simulator-in-the-loop multi-agent system that automates physically grounded 4D synthesis from multimodal prompts by using trajectory feedback from vision models and LLM reasoning to optimize force fields.
ExMesh: EXplicit Mesh Reconstruction with Topology Adaptation cs.CV · 2026-06-05 · unverdicted · none · ref 27 · internal anchor
ExMesh introduces a framework for explicit mesh reconstruction from images that integrates adaptive topology updates into differentiable optimization while maintaining UV coordinates.
RigPAPR: Rig-Based Animation of Static Neural Point Clouds from a Fixed-Viewpoint Video cs.CV · 2026-06-04 · unverdicted · none · ref 27 · internal anchor
RigPAPR auto-rigs static PAPR point clouds and drives them via direct LBS from monocular fixed-view video, matching baselines at supervised views and exceeding them by 3+dB PSNR at novel views with cleaner joints.
From Pixels to Newtons: Predicting In Vivo Joint Contact Forces from Monocular Video cs.CV · 2026-06-04 · unverdicted · none · ref 42 · internal anchor
A transformer model predicts in vivo hip and knee contact forces from uncalibrated monocular video at accuracy matching subject-specific musculoskeletal simulations under leave-one-subject-out validation.
Geometry Gaussians: Decoupling Appearance and Geometry in Gaussian Splatting cs.GR · 2026-06-03 · unverdicted · none · ref 18 · internal anchor
A dedicated geometry opacity parameter per 3D Gaussian decouples appearance from geometry and yields better novel-view rendering plus surface reconstruction on varied datasets.
ZipSplat: Fewer Gaussians, Better Splats cs.CV · 2026-06-03 · unverdicted · none · ref 19 · internal anchor
ZipSplat uses multi-view token extraction followed by k-means clustering and attention to decode compact scene tokens into unconstrained 3D Gaussians, achieving SOTA pose-free results with ~6x fewer primitives.
Z-FLoc: Zero-Shot Floorplan Localization via Geometric Primitives cs.CV · 2026-06-03 · unverdicted · none · ref 27 · internal anchor
Z-FLoc performs zero-shot floorplan localization by matching geometric primitives from BEV projections of monocular 3D reconstructions to floorplans using dedicated minimal solvers in a robust framework.
Honey, I Shrunk the Arc de Triomphe! cs.CV · 2026-06-01 · unverdicted · none · ref 20 · 2 links · internal anchor
MetricScenes dataset from web photos and stereo imagery, plus a two-stage Poisson depth completion method, allows fine-tuning MoGe-2 to mitigate scale-collapse in metric monocular geometry while preserving benchmark performance.
MBench: A Comprehensive Benchmark on Memory Capability for Video World Models cs.CV · 2026-05-30 · unverdicted · none · ref 46 · internal anchor
MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation cs.CV · 2026-05-25 · unverdicted · none · ref 70 · internal anchor
WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.
Geo-Align: Video Generation Alignment via Metric Geometry Reward cs.CV · 2026-05-22 · unverdicted · none · ref 45 · internal anchor
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction cs.CV · 2026-05-22 · unverdicted · none · ref 13 · internal anchor
GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.
SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation cs.CV · 2026-05-21 · unverdicted · none · ref 71 · 2 links · internal anchor
SpaceDG is the first large-scale benchmark dataset (~1M QA pairs) simulating nine visual degradations in 3DGS-rendered scenes to measure and improve spatial intelligence robustness in MLLMs.
No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos cs.CV · 2026-05-21 · unverdicted · none · ref 40 · internal anchor
NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.
Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation cs.CV · 2026-05-19 · unverdicted · none · ref 36 · internal anchor
Anchored Tree Sampling converts horizon-compounding drift into anchor-bounded drift by organizing video generation as a sparse-to-dense tree of imputations instead of left-to-right autoregressive rollout.
Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth cs.CV · 2026-05-19 · unverdicted · none · ref 70 · internal anchor
Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.
StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video cs.CV · 2026-05-18 · unverdicted · none · ref 18 · internal anchor
StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.
Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model cs.RO · 2026-05-17 · unverdicted · none · ref 51 · internal anchor
A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.
Probing into Camera Control of Video Models cs.CV · 2026-05-14 · unverdicted · none · ref 27 · internal anchor
A training-free method reformulates camera control as geometric displacement fields applied via differentiable latent resampling, enabling control and bias probing in video diffusion models.
Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction cs.CV · 2026-05-12 · unverdicted · none · ref 92 · internal anchor
AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.
PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations cs.CV · 2026-05-12 · unverdicted · none · ref 18 · internal anchor
PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.
Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images cs.CV · 2026-05-08 · unverdicted · none · ref 21 · internal anchor
Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.
Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation cs.CV · 2026-05-05 · unverdicted · none · ref 32 · internal anchor
Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision cs.CV · 2026-04-29 · unverdicted · none · ref 25 · 2 links · internal anchor
AirZoo is a new dataset covering 378 regions across 22 countries with pixel-level metric depth and 6-DoF poses, shown via benchmarks to improve SoTA models on aerial image retrieval, cross-view matching, and multi-view 3D reconstruction.
View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity cs.CV · 2026-04-20 · unverdicted · none · ref 45 · internal anchor
A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens cs.CV · 2026-04-16 · unverdicted · none · ref 17 · internal anchor
GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 41 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates cs.CV · 2026-04-13 · unverdicted · none · ref 30 · internal anchor
EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.
LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation cs.CV · 2026-04-10 · unverdicted · none · ref 12 · internal anchor
A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction cs.CV · 2026-04-10 · unverdicted · none · ref 22 · internal anchor
TAIHRI is the first task-aware VLM for close-range HRI that localizes metric-scale 3D coordinates of critical keypoints by quantizing space and performing 2D keypoint reasoning via next-token prediction.
MoRight: Motion Control Done Right cs.CV · 2026-04-08 · unverdicted · none · ref 47 · internal anchor
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation cs.CV · 2026-04-07 · unverdicted · none · ref 13 · internal anchor
SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.
AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation cs.RO · 2026-04-07 · unverdicted · none · ref 18 · internal anchor
AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM3D benchmarks.

Depth Anything 3: Recovering the Visual Space from Any Views

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer