SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.
hub Mixed citations
Virtual KITTI 2
Mixed citation behavior. Most common role is background (50%).
abstract
This paper introduces an updated version of the well-known Virtual KITTI dataset which consists of 5 sequence clones from the KITTI tracking benchmark. In addition, the dataset provides different variants of these sequences such as modified weather conditions (e.g. fog, rain) or modified camera configurations (e.g. rotated by 15 degrees). For each sequence, we provide multiple sets of images containing RGB, depth, class segmentation, instance segmentation, flow, and scene flow data. Camera parameters and poses as well as vehicle locations are available as well. In order to showcase some of the dataset's capabilities, we ran multiple relevant experiments using state-of-the-art algorithms from the field of autonomous driving. The dataset is available for download at https://europe.naverlabs.com/Research/Computer-Vision/Proxy-Virtual-Worlds.
hub tools
citation-role summary
citation-polarity summary
fields
cs.CV 49representative citing papers
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
SLIM adapts MoGe-2 to truly sparse LiDAR via partial-convolution encoder and multi-scale fusion neck, cutting absolute relative depth error by 39-51% at 100-150 m on Virtual KITTI and CARLA under density-agnostic training.
Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
VDPP is an RGB-free video depth post-processor that achieves over 43 FPS on Jetson Orin Nano by refining geometry at low resolution rather than reconstructing full scenes.
ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.
PointDiT is a from-scratch pixel-space Diffusion Transformer for monocular 3D point map estimation that outperforms latent diffusion models in sharpness and ambiguous regions while using a simpler architecture.
ICDepth adapts text-to-video diffusion transformers for video depth estimation via in-context conditioning, achieving SOTA results on benchmarks with 6-13x less training data than prior generative methods.
UniGP unifies controllable generation and dense prediction in an MMDiT-based diffusion model through simple joint training that preserves backbone priors.
Argus introduces a covisibility module and decomposed pixel-to-world mapping to deliver SOTA metric performance on camera pose, depth, and point cloud tasks using the Realsee3D panoramic dataset.
StereoFactory merges stereo matching foundation models via genetic subset search followed by CMA-ES module routing, reporting lower average errors on four benchmarks than baselines while using 2.7-3.7% of retraining time.
Wild3R is a feed-forward 3D Gaussian Splatting model trained on the new WildCity dataset of 200 scenes with 170 lighting conditions and transients to handle unconstrained sparse photo collections.
GARDEN uses gravity alignment and conditional 3D point classification to factorize RGB reconstructions into explicit rigid bodies plus decoupled background for direct physics simulation.
MDA represents per-pixel depth as a mixture of distributions so that boundary pixels can align hypotheses with distinct surfaces instead of averaging into empty space.
DéjàView applies a single transformer block recurrently for K refinement steps, matching or exceeding larger feed-forward models on five multi-view 3D benchmarks with fewer parameters and comparable compute.
SA4Depth aligns pose-depth scales in self-supervised monocular depth estimation via differentiable feature re-projection refinement, boosting consistency on KITTI, Cityscapes, and NYUv2.
UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.
A feed-forward model aligns ground and satellite features to predict Gaussian splats for improved novel-view synthesis on georeferenced outdoor scenes.
GemDepth adds explicit camera-pose geometry embeddings and an alternating spatio-temporal transformer to produce sharper, more temporally consistent video depth maps than prior smoothing-based methods.
Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20 FPS over sequences longer than 10,000 frames.
citing papers explorer
-
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
-
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
-
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.