pith. machine review for the scientific record. sign in

hub

Virtual KITTI 2

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it
abstract

This paper introduces an updated version of the well-known Virtual KITTI dataset which consists of 5 sequence clones from the KITTI tracking benchmark. In addition, the dataset provides different variants of these sequences such as modified weather conditions (e.g. fog, rain) or modified camera configurations (e.g. rotated by 15 degrees). For each sequence, we provide multiple sets of images containing RGB, depth, class segmentation, instance segmentation, flow, and scene flow data. Camera parameters and poses as well as vehicle locations are available as well. In order to showcase some of the dataset's capabilities, we ran multiple relevant experiments using state-of-the-art algorithms from the field of autonomous driving. The dataset is available for download at https://europe.naverlabs.com/Research/Computer-Vision/Proxy-Virtual-Worlds.

hub tools

fields

cs.CV 22

years

2026 20 2024 2

representative citing papers

Image Generators are Generalist Vision Learners

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.

Geometric Context Transformer for Streaming 3D Reconstruction

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20 FPS over sequences longer than 10,000 frames.

Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.

LoMa: Local Feature Matching Revisited

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

Scaling data, model size, and compute for local feature matching produces large performance gains on challenging benchmarks and a new manually annotated HardMatch dataset.

SAM 2: Segment Anything in Images and Videos

cs.CV · 2024-08-01 · conditional · novelty 6.0

SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.

Depth Anything V2

cs.CV · 2024-06-13 · unverdicted · novelty 6.0

Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.

The Midas Touch for Metric Depth

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.

Syn4D: A Multiview Synthetic 4D Dataset

cs.CV · 2026-05-06 · unverdicted · novelty 5.0

Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.

citing papers explorer

Showing 22 of 22 citing papers.

  • TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking cs.CV · 2026-05-12 · unverdicted · none · ref 4 · internal anchor

    TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

  • Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 8 · internal anchor

    A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

  • Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training cs.CV · 2026-04-08 · unverdicted · none · ref 10 · internal anchor

    Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.

  • VDPP: Video Depth Post-Processing for Speed and Scalability cs.CV · 2026-04-08 · unverdicted · none · ref 3 · internal anchor

    VDPP is an RGB-free video depth post-processor that achieves over 43 FPS on Jetson Orin Nano by refining geometry at low resolution rather than reconstructing full scenes.

  • GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth cs.CV · 2026-05-11 · unverdicted · none · ref 3 · 3 links · internal anchor

    GemDepth predicts inter-frame camera poses to inject geometric embeddings into a spatio-temporal transformer, yielding state-of-the-art 3D-consistent video depth.

  • Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation cs.CV · 2026-04-23 · unverdicted · none · ref 1 · internal anchor

    Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.

  • Image Generators are Generalist Vision Learners cs.CV · 2026-04-22 · unverdicted · none · ref 5 · internal anchor

    Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.

  • Geometric Context Transformer for Streaming 3D Reconstruction cs.CV · 2026-04-15 · unverdicted · none · ref 3 · internal anchor

    LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20 FPS over sequences longer than 10,000 frames.

  • Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective cs.CV · 2026-04-15 · unverdicted · none · ref 282 · internal anchor

    The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.

  • Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction cs.CV · 2026-04-09 · unverdicted · none · ref 7 · internal anchor

    Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

  • SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations cs.CV · 2026-04-09 · unverdicted · none · ref 10 · internal anchor

    SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.

  • LoMa: Local Feature Matching Revisited cs.CV · 2026-04-06 · unverdicted · none · ref 9 · internal anchor

    Scaling data, model size, and compute for local feature matching produces large performance gains on challenging benchmarks and a new manually annotated HardMatch dataset.

  • SimpleProc: Fully Procedural Synthetic Data from Simple Rules for Multi-View Stereo cs.CV · 2026-04-06 · unverdicted · none · ref 2 · internal anchor

    Procedural rules with NURBS generate MVS training data that outperforms same-scale manual curation and matches or exceeds larger manual datasets.

  • SAM 2: Segment Anything in Images and Videos cs.CV · 2024-08-01 · conditional · none · ref 3 · internal anchor

    SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.

  • Depth Anything V2 cs.CV · 2024-06-13 · unverdicted · none · ref 9 · internal anchor

    Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.

  • The Midas Touch for Metric Depth cs.CV · 2026-05-12 · unverdicted · none · ref 4 · internal anchor

    MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.

  • ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation cs.CV · 2026-05-08 · unverdicted · none · ref 41 · internal anchor

    ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.

  • Syn4D: A Multiview Synthetic 4D Dataset cs.CV · 2026-05-06 · unverdicted · none · ref 15 · internal anchor

    Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.

  • Who Handles Orientation? Investigating Invariance in Feature Matching cs.CV · 2026-04-13 · accept · none · ref 10 · internal anchor

    Learning rotation invariance in descriptors matches the performance of matcher-level invariance but allows earlier invariance, faster matchers, and no loss in upright performance when trained at scale.

  • SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation cs.CV · 2026-04-11 · unverdicted · none · ref 79 · internal anchor

    SMFormer achieves state-of-the-art self-supervised stereo matching by using vision foundation models for disturbance-resistant features and data augmentation to enforce output consistency, rivaling or exceeding some supervised methods on benchmarks including Booster.

  • Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching cs.CV · 2026-04-10 · unverdicted · none · ref 78 · internal anchor

    GREATEN fuses surface normals with image features via gated contextual-geometric fusion and efficient sparse attentions to cut stereo matching errors by up to 30% on real datasets when trained solely on synthetic data.

  • A Hybrid Approach for Closing the Sim2real Appearance Gap in Game Engine Synthetic Datasets cs.CV · 2026-05-04 · unverdicted · none · ref 14 · internal anchor

    Combining a diffusion model and an image-to-image translation model produces more photorealistic game-engine synthetic images than either alone while keeping semantic labels intact.