hub Mixed citations

Virtual KITTI 2

Yohann Cabon, Naila Murray, Martin Humenberger · 2020 · cs.CV · arXiv 2001.10773

Mixed citation behavior. Most common role is background (50%).

52 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 52 citing papers arXiv PDF

abstract

This paper introduces an updated version of the well-known Virtual KITTI dataset which consists of 5 sequence clones from the KITTI tracking benchmark. In addition, the dataset provides different variants of these sequences such as modified weather conditions (e.g. fog, rain) or modified camera configurations (e.g. rotated by 15 degrees). For each sequence, we provide multiple sets of images containing RGB, depth, class segmentation, instance segmentation, flow, and scene flow data. Camera parameters and poses as well as vehicle locations are available as well. In order to showcase some of the dataset's capabilities, we ran multiple relevant experiments using state-of-the-art algorithms from the field of autonomous driving. The dataset is available for download at https://europe.naverlabs.com/Research/Computer-Vision/Proxy-Virtual-Worlds.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 8 background 3 method 1

citation-polarity summary

background 6 use dataset 5 use method 1

representative citing papers

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

cs.CV · 2026-05-26 · unverdicted · novelty 8.0

SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

Humanoid-OmniOcc: Stereo-Based Full-View Occupancy Dataset for Embodied AI

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

Humanoid-OmniOcc delivers a large-scale panoramic stereo occupancy dataset for humanoid robots via Real2Sim2Real, with a model that outperforms monocular baselines in both unseen sim scenes and real settings.

Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

SLIM adapts MoGe-2 to truly sparse LiDAR via partial-convolution encoder and multi-scale fusion neck, cutting absolute relative depth error by 39-51% at 100-150 m on Virtual KITTI and CARLA under density-agnostic training.

Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.

Image Generators are Generalist Vision Learners

cs.CV · 2026-04-22 · conditional · novelty 7.0 · 2 refs

An image generator is instruction-tuned to perform diverse vision tasks by representing task outputs as RGB images, achieving SOTA on segmentation and depth estimation.

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.

VDPP: Video Depth Post-Processing for Speed and Scalability

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

VDPP is an RGB-free video depth post-processor that achieves over 43 FPS on Jetson Orin Nano by refining geometry at low resolution rather than reconstructing full scenes.

ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training

cs.CV · 2026-03-04 · unverdicted · novelty 7.0

ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

cs.CV · 2025-12-11 · unverdicted · novelty 7.0

A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.

PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

cs.CV · 2026-07-02 · unverdicted · novelty 6.0

PointDiT is a from-scratch pixel-space Diffusion Transformer for monocular 3D point map estimation that outperforms latent diffusion models in sharpness and ambiguous regions while using a simpler architecture.

ICDepth: Taming Video Diffusion Models for Video Depth Estimation via In-Context Conditioning

cs.CV · 2026-07-02 · unverdicted · novelty 6.0

ICDepth adapts text-to-video diffusion transformers for video depth estimation via in-context conditioning, achieving SOTA results on benchmarks with 6-13x less training data than prior generative methods.

UniGP: Taming Diffusion Transformer for Prior-Preserved Unified Generation and Perception

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

UniGP unifies controllable generation and dense prediction in an MMDiT-based diffusion model through simple joint training that preserves backbone priors.

Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes

cs.CV · 2026-06-29 · unverdicted · novelty 6.0 · 2 refs

Argus introduces a covisibility module and decomposed pixel-to-world mapping to deliver SOTA metric performance on camera pose, depth, and point cloud tasks using the Realsee3D panoramic dataset.

Prompting Diffusion Models for Zero-Shot Instance Segmentation

cs.CV · 2026-06-21 · unverdicted · novelty 6.0

Prompt2Seg augments diffusion models with an explicit spatial prompt conditioning branch, enabling zero-shot instance segmentation that generalizes from limited synthetic category training to diverse unseen objects and visual domains.

StereoFactory: A Unified Merging Framework for Robust Stereo Matching

cs.CV · 2026-06-16 · unverdicted · novelty 6.0

StereoFactory merges stereo matching foundation models via genetic subset search followed by CMA-ES module routing, reporting lower average errors on four benchmarks than baselines while using 2.7-3.7% of retraining time.

Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

Wild3R is a feed-forward 3D Gaussian Splatting model trained on the new WildCity dataset of 200 scenes with 170 lighting conditions and transients to handle unconstrained sparse photo collections.

GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

GARDEN uses gravity alignment and conditional 3D point classification to factorize RGB reconstructions into explicit rigid bodies plus decoupled background for direct physics simulation.

Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

MDA represents per-pixel depth as a mixture of distributions so that boundary pixels can align hypotheses with distinct surfaces instead of averaging into empty space.

D\'ej\`a View: Looping Transformers for Multi-View 3D Reconstruction

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

DéjàView applies a single transformer block recurrently for K refinement steps, matching or exceeding larger feed-forward models on five multi-view 3D benchmarks with fewer parameters and comparable compute.

SA4Depth: Consistent Pose-Depth Scale Alignment for Self-Supervised Monocular Depth Estimation

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

SA4Depth aligns pose-depth scales in self-supervised monocular depth estimation via differentiable feature re-projection refinement, boosting consistency on KITTI, Cityscapes, and NYUv2.

UniT: Unified Geometry Learning with Group Autoregressive Transformer

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.

Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

A feed-forward model aligns ground and satellite features to predict Gaussian splats for improved novel-view synthesis on georeferenced outdoor scenes.

citing papers explorer

Showing 50 of 51 citing papers after filters.

SpatialBench: Is Your Spatial Foundation Model an All-Round Player? cs.CV · 2026-05-26 · unverdicted · none · ref 8 · internal anchor
SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking cs.CV · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth cs.CV · 2026-05-26 · unverdicted · none · ref 10 · internal anchor
SLIM adapts MoGe-2 to truly sparse LiDAR via partial-convolution encoder and multi-scale fusion neck, cutting absolute relative depth error by 39-51% at 100-150 m on Virtual KITTI and CARLA under density-agnostic training.
Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth cs.CV · 2026-05-19 · unverdicted · none · ref 45 · internal anchor
Depth2Pose is a new evaluation framework for monocular depth estimators that uses relative camera pose accuracy as a task-driven proxy and introduces the D2P dataset of challenging out-of-distribution scenes.
Image Generators are Generalist Vision Learners cs.CV · 2026-04-22 · conditional · none · ref 5 · 2 links · internal anchor
An image generator is instruction-tuned to perform diverse vision tasks by representing task outputs as RGB images, achieving SOTA on segmentation and depth estimation.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 8 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training cs.CV · 2026-04-08 · unverdicted · none · ref 10 · internal anchor
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
VDPP: Video Depth Post-Processing for Speed and Scalability cs.CV · 2026-04-08 · unverdicted · none · ref 3 · internal anchor
VDPP is an RGB-free video depth post-processor that achieves over 43 FPS on Jetson Orin Nano by refining geometry at low resolution rather than reconstructing full scenes.
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training cs.CV · 2026-03-04 · unverdicted · none · ref 11 · internal anchor
ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space cs.CV · 2025-12-11 · unverdicted · none · ref 9 · internal anchor
A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.
PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation cs.CV · 2026-07-02 · unverdicted · none · ref 1 · internal anchor
PointDiT is a from-scratch pixel-space Diffusion Transformer for monocular 3D point map estimation that outperforms latent diffusion models in sharpness and ambiguous regions while using a simpler architecture.
ICDepth: Taming Video Diffusion Models for Video Depth Estimation via In-Context Conditioning cs.CV · 2026-07-02 · unverdicted · none · ref 2 · internal anchor
ICDepth adapts text-to-video diffusion transformers for video depth estimation via in-context conditioning, achieving SOTA results on benchmarks with 6-13x less training data than prior generative methods.
UniGP: Taming Diffusion Transformer for Prior-Preserved Unified Generation and Perception cs.CV · 2026-06-29 · unverdicted · none · ref 2 · internal anchor
UniGP unifies controllable generation and dense prediction in an MMDiT-based diffusion model through simple joint training that preserves backbone priors.
Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes cs.CV · 2026-06-29 · unverdicted · none · ref 9 · 2 links · internal anchor
Argus introduces a covisibility module and decomposed pixel-to-world mapping to deliver SOTA metric performance on camera pose, depth, and point cloud tasks using the Realsee3D panoramic dataset.
Prompting Diffusion Models for Zero-Shot Instance Segmentation cs.CV · 2026-06-21 · unverdicted · none · ref 31 · internal anchor
Prompt2Seg augments diffusion models with an explicit spatial prompt conditioning branch, enabling zero-shot instance segmentation that generalizes from limited synthetic category training to diverse unseen objects and visual domains.
StereoFactory: A Unified Merging Framework for Robust Stereo Matching cs.CV · 2026-06-16 · unverdicted · none · ref 6 · internal anchor
StereoFactory merges stereo matching foundation models via genetic subset search followed by CMA-ES module routing, reporting lower average errors on four benchmarks than baselines while using 2.7-3.7% of retraining time.
Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection cs.CV · 2026-06-10 · unverdicted · none · ref 2 · internal anchor
Wild3R is a feed-forward 3D Gaussian Splatting model trained on the new WildCity dataset of 200 scenes with 170 lighting conditions and transients to handle unconstrained sparse photo collections.
GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images cs.CV · 2026-06-02 · unverdicted · none · ref 50 · internal anchor
GARDEN uses gravity alignment and conditional 3D point classification to factorize RGB reconstructions into explicit rigid bodies plus decoupled background for direct physics simulation.
Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation cs.CV · 2026-06-01 · unverdicted · none · ref 5 · internal anchor
MDA represents per-pixel depth as a mixture of distributions so that boundary pixels can align hypotheses with distinct surfaces instead of averaging into empty space.
D\'ej\`a View: Looping Transformers for Multi-View 3D Reconstruction cs.CV · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
DéjàView applies a single transformer block recurrently for K refinement steps, matching or exceeding larger feed-forward models on five multi-view 3D benchmarks with fewer parameters and comparable compute.
SA4Depth: Consistent Pose-Depth Scale Alignment for Self-Supervised Monocular Depth Estimation cs.CV · 2026-05-27 · unverdicted · none · ref 29 · internal anchor
SA4Depth aligns pose-depth scales in self-supervised monocular depth estimation via differentiable feature re-projection refinement, boosting consistency on KITTI, Cityscapes, and NYUv2.
UniT: Unified Geometry Learning with Group Autoregressive Transformer cs.CV · 2026-05-20 · unverdicted · none · ref 52 · internal anchor
UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.
Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images cs.CV · 2026-05-19 · unverdicted · none · ref 3 · internal anchor
A feed-forward model aligns ground and satellite features to predict Gaussian splats for improved novel-view synthesis on georeferenced outdoor scenes.
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth cs.CV · 2026-05-11 · unverdicted · none · ref 3 · 4 links · internal anchor
GemDepth adds explicit camera-pose geometry embeddings and an alternating spatio-temporal transformer to produce sharper, more temporally consistent video depth maps than prior smoothing-based methods.
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation cs.CV · 2026-04-23 · unverdicted · none · ref 1 · internal anchor
Synthetic data complements real data in diffusion-based controllable human video generation, with effective sample selection improving motion realism, temporal consistency, and identity preservation.
Geometric Context Transformer for Streaming 3D Reconstruction cs.CV · 2026-04-15 · unverdicted · none · ref 3 · internal anchor
LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20 FPS over sequences longer than 10,000 frames.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective cs.CV · 2026-04-15 · unverdicted · none · ref 282 · internal anchor
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction cs.CV · 2026-04-09 · unverdicted · none · ref 7 · internal anchor
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations cs.CV · 2026-04-09 · unverdicted · none · ref 10 · internal anchor
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.
LoMa: Local Feature Matching Revisited cs.CV · 2026-04-06 · unverdicted · none · ref 9 · internal anchor
Scaling data, model size, and compute for local feature matching produces large performance gains on challenging benchmarks and a new manually annotated HardMatch dataset.
SimpleProc: Fully Procedural Synthetic Data from Simple Rules for Multi-View Stereo cs.CV · 2026-04-06 · unverdicted · none · ref 2 · internal anchor
Procedural rules with NURBS generate MVS training data that outperforms same-scale manual curation and matches or exceeds larger manual datasets.
Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion cs.CV · 2026-03-11 · unverdicted · none · ref 7 · internal anchor
Marigold-SSD delivers zero-shot depth completion via single-step diffusion with late fusion, achieving fast inference after only 4.5 GPU days of training while showing strong cross-domain results on indoor and outdoor benchmarks.
DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass cs.CV · 2025-12-15 · unverdicted · none · ref 1 · internal anchor
DePT3R performs joint dense point tracking and 3D reconstruction of dynamic scenes from multiple unposed images using a single neural network forward pass.
Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model cs.CV · 2025-11-30 · unverdicted · none · ref 68 · internal anchor
Lotus-2 is a two-stage deterministic adaptation of diffusion priors that achieves state-of-the-art monocular depth estimation with only 59K training samples.
Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models cs.CV · 2025-11-01 · unverdicted · none · ref 9 · internal anchor
A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.
Streaming 4D Visual Geometry Transformer cs.CV · 2025-07-15 · unverdicted · none · ref 2 · internal anchor
A causal transformer with key-value caching and distillation from a bidirectional VGGT model enables efficient online 4D geometry reconstruction from videos.
SAM 2: Segment Anything in Images and Videos cs.CV · 2024-08-01 · conditional · none · ref 3 · internal anchor
SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.
Depth Anything V2 cs.CV · 2024-06-13 · unverdicted · none · ref 9 · internal anchor
Depth Anything V2 delivers finer, more robust monocular depth predictions by replacing real labeled images with synthetic data, scaling the teacher model, and using large-scale pseudo-labeled real images for student training.
Robust Onion: Peeling Open Vocab Object Detectors Under Noise cs.CV · 2026-06-25 · unverdicted · none · ref 7 · 2 links · internal anchor
Empirical study finds OV-OD robustness driven by vision backbone and image domain via layer-wise feature collapse analysis, validated with a low-parameter robustness improvement on real data.
SCOPE: Scale-Consistent One-Pass Estimation of 3D Geometry cs.CV · 2026-06-19 · unverdicted · none · ref 42 · internal anchor
SCOPE uses affine-invariant 3D point maps with shared parameters and three consistency innovations to estimate 3D geometry from extended monocular videos, reporting 24.2% and 34.9% error reductions on ScanNet.
$R^3$: 3D Reconstruction via Relative Regression cs.CV · 2026-05-26 · unverdicted · none · ref 5 · internal anchor
R³ uses relative regression with confidence-weighted constraints from an MLP to support long-context offline and streaming 3D reconstruction without global coordinate assumptions.
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction cs.CV · 2026-05-22 · unverdicted · none · ref 3 · internal anchor
HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test
VGGT-$\Omega$ cs.CV · 2026-05-14 · unverdicted · none · ref 15 · internal anchor
VGGT-Ω improves feed-forward reconstruction accuracy and efficiency by architectural simplifications, register-based attention, and training on much larger supervised and unlabeled video data.
The Midas Touch for Metric Depth cs.CV · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation cs.CV · 2026-05-08 · unverdicted · none · ref 41 · internal anchor
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
Syn4D: A Multiview Synthetic 4D Dataset cs.CV · 2026-05-06 · unverdicted · none · ref 15 · internal anchor
Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
Who Handles Orientation? Investigating Invariance in Feature Matching cs.CV · 2026-04-13 · accept · none · ref 10 · internal anchor
Learning rotation invariance in descriptors matches the performance of matcher-level invariance but allows earlier invariance, faster matchers, and no loss in upright performance when trained at scale.
SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation cs.CV · 2026-04-11 · unverdicted · none · ref 79 · internal anchor
SMFormer achieves state-of-the-art self-supervised stereo matching by using vision foundation models for disturbance-resistant features and data augmentation to enforce output consistency, rivaling or exceeding some supervised methods on benchmarks including Booster.
DepthMaster: Taming Diffusion Models for Monocular Depth Estimation cs.CV · 2025-01-05 · unverdicted · none · ref 73 · internal anchor
DepthMaster proposes a single-step diffusion model with Feature Alignment and Fourier Enhancement modules in a two-stage training process to improve generalization and detail preservation in monocular depth estimation over prior diffusion methods.
A Hybrid Approach for Closing the Sim2real Appearance Gap in Game Engine Synthetic Datasets cs.CV · 2026-05-04 · unverdicted · none · ref 14 · internal anchor
Combining a diffusion model and an image-to-image translation model produces more photorealistic game-engine synthetic images than either alone while keeping semantic labels intact.

Virtual KITTI 2

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer