pith. sign in

super hub Mixed citations

Depth Anything 3: Recovering the Visual Space from Any Views

Mixed citation behavior. Most common role is method (42%).

191 Pith papers citing it
Method 42% of classified citations
abstract

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

hub tools

citation-role summary

background 13 method 13 baseline 4 dataset 1

citation-polarity summary

claims ledger

  • abstract We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new

authors

co-cited works

clear filters

representative citing papers

LIME: Learning Intent-aware Camera Motion from Egocentric Video

cs.RO · 2026-07-02 · unverdicted · novelty 7.0

LIME formulates language-conditioned camera motion as predicting SE(3) target poses from RGB and intent text, using mined multi-intent supervision from egocentric video and a flow-matching pose head.

InvSplat: Inverse Feed-Forward Scene Splatting

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

InvSplat is a feed-forward multi-view model that predicts 3D Gaussians augmented with intrinsic material attributes for inverse rendering and relighting.

World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible

cs.CV · 2026-06-11 · unverdicted · novelty 7.0

World Tracing introduces a multi-layer pixel-aligned 3D point representation instantiated via a diffusion transformer (WT-DiT) trained with pixel-space flow matching to jointly reconstruct visible surfaces and generate occluded geometry.

ZipSplat: Fewer Gaussians, Better Splats

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

ZipSplat uses multi-view token extraction followed by k-means clustering and attention to decode compact scene tokens into unconstrained 3D Gaussians, achieving SOTA pose-free results with ~6x fewer primitives.

Honey, I Shrunk the Arc de Triomphe!

cs.CV · 2026-06-01 · unverdicted · novelty 7.0 · 2 refs

MetricScenes dataset from web photos and stereo imagery, plus a two-stage Poisson depth completion method, allows fine-tuning MoGe-2 to mitigate scale-collapse in metric monocular geometry while preserving benchmark performance.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models cs.AI · 2026-06-08 · unverdicted · none · ref 18 · internal anchor

    AlloSpatial adds structured allocentric priors and a harness for tool-use and arbitration to improve spatial reasoning in foundation models, with 5-18% gains on VSI-Bench and MindCube in training-free settings and further gains after RL internalization.