pith. sign in

arxiv: 2506.07826 · v2 · submitted 2025-06-09 · 💻 cs.CV · cs.LG· cs.RO

R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation

Pith reviewed 2026-05-19 10:18 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO
keywords 3D asset insertiondiffusion modelautonomous driving simulationneural rendering3D Gaussian Splattingshadow and lighting generationphotorealistic scene editing
0
0 comments X

The pith

R3D2 uses a one-step diffusion model to insert complete 3D assets into driving scenes while generating matching shadows and lighting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces R3D2 to solve the problem that neural reconstructions like 3D Gaussian Splatting bake lighting and shadows into objects, making them difficult to move or reuse across scenes. R3D2 is a lightweight diffusion model trained on a synthetic dataset where 3D assets generated from driving data are placed into neural-rendered environments so the model learns to produce realistic integration effects. This approach allows real-time insertion of full 3D objects into existing scenes for autonomous driving simulation. A reader would care because it promises to turn limited real-world recordings into many varied, photorealistic test cases without manual modeling of every lighting condition.

Core claim

R3D2 is a lightweight one-step diffusion model trained on synthetic placements of 3DGS-generated assets into neural rendering environments; once trained it generates the shadows, lighting, and other rendering effects required to insert complete reusable 3D assets into existing driving scenes in real time.

What carries the argument

R3D2, a one-step diffusion model that predicts the additional rendering effects needed to blend inserted 3D assets into a scene.

If this is right

  • Enables text-to-3D asset creation followed by realistic insertion into any reconstructed scene.
  • Supports moving the same 3D object from one driving dataset or scene to another while preserving appearance consistency.
  • Increases the number of controllable safety-critical test scenarios that can be generated from a fixed set of base scenes.
  • Reduces reliance on per-scene manual adjustment of lighting when building simulation environments for autonomous driving validation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same insertion technique could be applied to create rare-event test cases by compositing objects recorded in separate real drives.
  • Extending the model to handle object motion across video frames would allow simulation of dynamic interactions such as overtaking or merging.
  • Releasing the trained model and dataset may let other groups adapt the approach to non-driving domains like indoor robotics or urban planning.

Load-bearing premise

The lighting and shadow patterns learned from synthetically placed 3D assets in neural environments match the patterns that appear when the same assets are inserted into real captured driving scenes.

What would settle it

Insert the same 3D assets into a set of real-world driving images and measure whether the generated shadows and illumination from R3D2 match the actual captured shadows and lighting; large mismatches would show the model has not learned the real distribution.

Figures

Figures reproduced from arXiv: 2506.07826 by Adam Tonderski, Bernardo Taveira, Chensheng Peng, Christoffer Petersson, Fredrik Kahl, Kurt Keutzer, Masayoshi Tomizuka, Michael Felsberg, Wei Zhan, Wenzhao Zheng, William Ljungbergh.

Figure 1
Figure 1. Figure 1: R3D2 enables realistic insertion of 3D assets from different origins into existing scene [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Complete 3D asset generated from an in-the-wild object observation using Amodal3R [ [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Two image pairs from R3D3. The top row shows the original image [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the R3D2 training pipeline. Amodal3R [ [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Image editing results when applying several SotA image-editing tools to our setting. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results from R3D3. Realistically added effects from R3D2 are highlighted in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative examples of rendering effects generated with R3D2. We show results when [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional samples of off-the-shelf image editing models applied to our use case. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Multi-view images of partially reconstructed dynamic objects. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of text-to-3D assets generated with TRELLIS [45]. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Samples of R3D2 applied on reinserted actors. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
read the original abstract

Validating autonomous driving (AD) systems requires diverse and safety-critical testing, making photorealistic virtual environments essential. Traditional simulation platforms, while controllable, are resource-intensive to scale and often suffer from a domain gap with real-world data. In contrast, neural reconstruction methods like 3D Gaussian Splatting (3DGS) offer a scalable solution for creating photorealistic digital twins of real-world driving scenes. However, they struggle with dynamic object manipulation and reusability as their per-scene optimization-based methodology tends to result in incomplete object models with integrated illumination effects. This paper introduces R3D2, a lightweight, one-step diffusion model designed to overcome these limitations and enable realistic insertion of complete 3D assets into existing scenes by generating plausible rendering effects-such as shadows and consistent lighting-in real time. This is achieved by training R3D2 on a novel dataset: 3DGS object assets are generated from in-the-wild AD data using an image-conditioned 3D generative model, and then synthetically placed into neural rendering-based virtual environments, allowing R3D2 to learn realistic integration. Quantitative and qualitative evaluations demonstrate that R3D2 significantly enhances the realism of inserted assets, enabling use-cases like text-to-3D asset insertion and cross-scene/dataset object transfer, allowing for true scalability in AD validation. To promote further research in scalable and realistic AD simulation, we release our code, see https://research.zenseact.com/publications/R3D2/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces R3D2, a lightweight one-step diffusion model for inserting complete 3D assets into existing neural-reconstructed driving scenes while generating plausible shadows, lighting, and other rendering effects in real time. Assets are created via image-conditioned 3D generative models from in-the-wild AD data, then placed synthetically into neural rendering environments to form the training set; the model is evaluated quantitatively and qualitatively on realism improvements and demonstrated on applications including text-to-3D insertion and cross-scene transfer. Code is released.

Significance. If the central claims hold, R3D2 would offer a practical route to scalable, controllable, and photorealistic AD simulation by overcoming the static-object limitation of 3DGS reconstructions. The explicit code release is a clear strength that supports reproducibility and community follow-up on synthetic-to-real generalization.

major comments (2)
  1. [Abstract] Abstract: the statement that 'Quantitative and qualitative evaluations demonstrate that R3D2 significantly enhances the realism of inserted assets' is not supported by any named metrics (FID, LPIPS, user-study scores, etc.), baseline methods, or ablation tables. Without these details it is impossible to judge whether the reported gains are robust or merely reflect the synthetic training distribution.
  2. [Dataset construction and §4] Dataset construction and §4 (Experiments): the training data are generated entirely inside a closed synthetic loop (3DGS assets placed into neural rendering environments). No quantitative analysis or held-out test on real captured driving scenes with ground-truth inserted objects is described, so the domain gap in illumination, inter-reflections, and sensor noise remains unmeasured. This assumption is load-bearing for the claim that R3D2 produces realistic insertions usable for downstream AD validation.
minor comments (1)
  1. [Abstract / Conclusion] The link to the released code is given but the manuscript does not state whether the synthetic dataset generation pipeline and evaluation scripts are included; adding this clarification would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our work. We have addressed each of the major comments below and outline the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the statement that 'Quantitative and qualitative evaluations demonstrate that R3D2 significantly enhances the realism of inserted assets' is not supported by any named metrics (FID, LPIPS, user-study scores, etc.), baseline methods, or ablation tables. Without these details it is impossible to judge whether the reported gains are robust or merely reflect the synthetic training distribution.

    Authors: We appreciate this observation. The quantitative metrics, including FID and LPIPS scores, along with comparisons to baseline methods and ablation studies, are detailed in Section 4 (Experiments) of the manuscript. To improve clarity and allow immediate assessment of our results, we will revise the abstract to specify these metrics and baselines. This change will be incorporated in the revised version. revision: yes

  2. Referee: [Dataset construction and §4] Dataset construction and §4 (Experiments): the training data are generated entirely inside a closed synthetic loop (3DGS assets placed into neural rendering environments). No quantitative analysis or held-out test on real captured driving scenes with ground-truth inserted objects is described, so the domain gap in illumination, inter-reflections, and sensor noise remains unmeasured. This assumption is load-bearing for the claim that R3D2 produces realistic insertions usable for downstream AD validation.

    Authors: We agree that the training data is constructed synthetically, as this enables the creation of a large-scale dataset with controlled placements and ground-truth rendering effects, which would be impractical to obtain from real-world captures. In the experiments, we evaluate on held-out synthetic test sets quantitatively and provide qualitative results on real scenes to show the model's ability to generalize. We recognize the importance of quantifying the domain gap and will add a new subsection in the discussion or limitations to analyze potential gaps in illumination and sensor characteristics, along with suggestions for future real-world validation studies. However, performing quantitative tests with ground-truth insertions in real scenes is beyond the current scope due to the lack of such paired real data. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via external synthetic data pipeline

full rationale

The paper describes training a one-step diffusion model on a dataset constructed by generating 3DGS assets via an image-conditioned generative model and inserting them into neural rendering environments. This pipeline relies on external components (3DGS, image-conditioned models, neural renderers) rather than defining the target realism metric in terms of the model's own outputs. No equations, fitted parameters renamed as predictions, or self-citation chains are presented that reduce the central claim (plausible shadow/lighting generation) to the training inputs by construction. The method is evaluated for enhancements in realism and use-cases like cross-scene transfer, with the synthetic data serving as a proxy rather than a closed definitional loop. This is a standard training-on-synthetic setup without load-bearing self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on the assumption that synthetic placement of 3DGS assets into neural scenes produces training pairs whose illumination statistics match real insertions; no new physical constants or entities are introduced.

axioms (2)
  • domain assumption Image-conditioned 3D generative models can produce complete object assets from in-the-wild driving images without baked-in scene illumination.
    Invoked when creating the training assets from real AD data.
  • domain assumption Neural rendering environments provide a sufficiently accurate base scene for learning insertion effects.
    Used to generate the synthetic training pairs.

pith-pipeline@v0.9.0 · 5850 in / 1449 out tokens · 20799 ms · 2026-05-19T10:18:44.663487+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

    cs.CV 2026-03 unverdicted novelty 7.0

    ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.

  2. CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.

  3. Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images

    cs.CV 2026-04 unverdicted novelty 5.0

    Unposed-to-3D learns simulation-ready 3D vehicle models from unposed real images by predicting camera parameters for photometric self-supervision, then adding scale prediction and harmonization.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 3 Pith papers · 3 internal anchors

  1. [1]

    Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving

    Mina Alibeigi, William Ljungbergh, Adam Tonderski, Georg Hess, Adam Lilja, Carl Lindström, Daria Motorniuk, Junsheng Fu, Jenny Widahl, and Christoffer Petersson. Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving. In ICCV, 2023

  2. [2]

    Tiny autoencoder for stable diffusion

    Ollin Boer Bohan. Tiny autoencoder for stable diffusion. https://github.com/madebyollin/taesd, 2023

  3. [3]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, June 2023

  4. [4]

    nuscenes: A multimodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020

  5. [5]

    8 Sana-sprint: One-step diffusion with continuous-time consis- tency distillation.arXiv preprint arXiv:2503.09641, 2025

    Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Enze Xie, and Song Han. Sana-sprint: One-step diffusion with continuous-time consistency distillation. arXiv preprint arXiv:2503.09641, 2025

  6. [6]

    Boregowda, and R

    Ankit Dhiman, Manan Shah, Rishubh Parihar, Yash Bhalgat, Lokesh R. Boregowda, and R. Venkatesh Babu. Reflecting reality: Enabling diffusion models to produce faithful mirror reflections. ArXiv, 2024

  7. [7]

    Carla: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In CoRL, 2017

  8. [8]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024

  9. [9]

    Splatad: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving

    Georg Hess, Carl Lindström, Maryam Fatemi, Christoffer Petersson, and Lennart Svensson. Splatad: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving. In CVPR, 2025

  10. [10]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017

  11. [11]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 2020

  12. [12]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 2022

  13. [13]

    Smartedit: Exploring complex instruction-based image editing with multimodal large language models

    Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction-based image editing with multimodal large language models. In CVPR, 2024

  14. [14]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. TOG, 2023

  15. [15]

    Photorealistic object insertion with diffusion-guided inverse rendering

    Ruofan Liang, Zan Gojcic, Merlin Nimier-David, David Acuna, Nandita Vijaykumar, Sanja Fidler, and Zian Wang. Photorealistic object insertion with diffusion-guided inverse rendering. In ECCV, 2024. 10

  16. [16]

    Are nerfs ready for autonomous driving? towards closing the real-to-simulation gap

    Carl Lindström, Georg Hess, Adam Lilja, Maryam Fatemi, Lars Hammarstrand, Christoffer Petersson, and Lennart Svensson. Are nerfs ready for autonomous driving? towards closing the real-to-simulation gap. In CVPRW, 2024

  17. [17]

    3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view-consistent 2d diffusion priors

    Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view-consistent 2d diffusion priors. In NeurIPS, 2024

  18. [18]

    Neuroncap: Photorealistic closed-loop safety testing for autonomous driving

    William Ljungbergh, Adam Tonderski, Joakim Johnander, Holger Caesar, Kalle Åström, Michael Felsberg, and Christoffer Petersson. Neuroncap: Photorealistic closed-loop safety testing for autonomous driving. In ECCV, 2025

  19. [19]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  20. [20]

    SDEdit: Guided image synthesis and editing with stochastic differential equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022

  21. [21]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020

  22. [22]

    Image generated using chatgpt with dalle model

    OpenAI. Image generated using chatgpt with dalle model. https://chat.openai.com, 2025. Accessed May 15, 2025

  23. [23]

    pix2gestalt: Amodal segmentation by synthesizing wholes

    Ege Ozguroglu, Ruoshi Liu, Dídac Sur´s, Dian Chen, Achal Dave, Pavel Tokmakov, and Carl V ondrick. pix2gestalt: Amodal segmentation by synthesizing wholes. CVPR, 2024

  24. [24]

    One-step image translation with text-to-image models,

    Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models. arXiv preprint arXiv:2403.12036, 2024

  25. [25]

    Physically based rendering: From theory to implementa- tion

    Matt Pharr, Wenzel Jakob, and Greg Humphreys. Physically based rendering: From theory to implementa- tion. MIT Press, 2023

  26. [26]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

  27. [27]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

  28. [28]

    Film: Frame interpolation for large motion

    Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpolation for large motion. In ECCV, 2022

  29. [29]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022

  30. [30]

    Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023

  31. [31]

    Palette: Image-to-image diffusion models

    Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In SIGGRAPH, 2022

  32. [32]

    Sara Mahdavi, Raphael Gontijo-Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Lit, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Raphael Gontijo-Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022

  33. [33]

    Fast high-resolution image synthesis with latent adversarial diffusion distillation

    Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation. In SIGGRAPH, 2024

  34. [34]

    Airsim: High-fidelity visual and physical simulation for autonomous vehicles

    Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics: Results of the 11th International Conference, 2018

  35. [35]

    Gina-3d: Learning to generate implicit neural assets in the wild

    Bokui Shen, Xinchen Yan, Charles R Qi, Mahyar Najibi, Boyang Deng, Leonidas Guibas, Yin Zhou, and Dragomir Anguelov. Gina-3d: Learning to generate implicit neural assets in the wild. In CVPR, 2023

  36. [36]

    Emu edit: Precise image editing via recognition and generation tasks

    Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. In CVPR, 2024. 11

  37. [37]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

  38. [38]

    Neurad: Neural rendering for autonomous driving

    Adam Tonderski, Carl Lindström, Georg Hess, William Ljungbergh, Lennart Svensson, and Christoffer Petersson. Neurad: Neural rendering for autonomous driving. In CVPR, 2024

  39. [39]

    Suds: Scalable urban dynamic scenes

    Haithem Turki, Jason Y Zhang, Francesco Ferroni, and Deva Ramanan. Suds: Scalable urban dynamic scenes. In CVPR, 2023

  40. [40]

    Image quality assessment: from error visibility to structural similarity

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE TIP, 2004

  41. [41]

    Neural light field estimation for street scenes with differentiable virtual object insertion

    Zian Wang, Wenzheng Chen, David Acuna, Jan Kautz, and Sanja Fidler. Neural light field estimation for street scenes with differentiable virtual object insertion. In ECCV, 2022

  42. [42]

    Objectdrop: Bootstrapping counterfactuals for photorealistic object removal and insertion

    Daniel Winter, Matan Cohen, Shlomi Fruchter, Yael Pritch, Alex Rav-Acha, and Yedid Hoshen. Objectdrop: Bootstrapping counterfactuals for photorealistic object removal and insertion. In ECCV, 2024

  43. [43]

    Difix3d+: Improving 3d reconstructions with single-step diffusion models

    Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, and Huan Ling. Difix3d+: Improving 3d reconstructions with single-step diffusion models. In CVPR, 2025

  44. [44]

    Amodal3r: Amodal 3d reconstruction from occluded 2d images

    Tianhao Wu, Chuanxia Zheng, Frank Guan, Andrea Vedaldi, and Tat-Jen Cham. Amodal3r: Amodal 3d reconstruction from occluded 2d images. arXiv preprint arXiv:2503.13439, 2025

  45. [45]

    Structured 3d latents for scalable and versatile 3d generation

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. In CVPR, 2025

  46. [46]

    Pandaset: Advanced sensor suite dataset for autonomous driving

    Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, Yunlong Wang, and Diange Yang. Pandaset: Advanced sensor suite dataset for autonomous driving. In ITSC, 2021

  47. [47]

    Street gaussians for modeling dynamic urban scenes

    Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians for modeling dynamic urban scenes. In ECCV, 2024

  48. [48]

    EmerneRF: Emergent spatial-temporal scene decomposition via self-supervision

    Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Seung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler, Marco Pavone, and Yue Wang. EmerneRF: Emergent spatial-temporal scene decomposition via self-supervision. In ICLR, 2024

  49. [49]

    Unisim: A neural closed-loop sensor simulator

    Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Manivasagam, Wei-Chiu Ma, Anqi Joyce Yang, and Raquel Urtasun. Unisim: A neural closed-loop sensor simulator. In CVPR, 2023

  50. [50]

    Scaling in-the-wild training for diffusion-based illumination harmonization and editing by imposing consistent light transport

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Scaling in-the-wild training for diffusion-based illumination harmonization and editing by imposing consistent light transport. In ICLR, 2025

  51. [51]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018

  52. [52]

    Make the cars in the image realistic with shadows and good lighting

    Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, and Ming-Hsuan Yang. Driving- gaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes. In CVPR, 2024. 12 Supplementary material A Implementation details A.1 Model architecture As described in Sec. 3.2, R3D2 builds upon the work of [ 24], with key differences ...