R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation

Adam Tonderski; Bernardo Taveira; Chensheng Peng; Christoffer Petersson; Fredrik Kahl; Kurt Keutzer; Masayoshi Tomizuka; Michael Felsberg; Wei Zhan; Wenzhao Zheng

arxiv: 2506.07826 · v2 · submitted 2025-06-09 · 💻 cs.CV · cs.LG· cs.RO

R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation

William Ljungbergh , Bernardo Taveira , Wenzhao Zheng , Adam Tonderski , Chensheng Peng , Fredrik Kahl , Christoffer Petersson , Michael Felsberg

show 3 more authors

Kurt Keutzer Masayoshi Tomizuka Wei Zhan

This is my paper

Pith reviewed 2026-05-19 10:18 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO

keywords 3D asset insertiondiffusion modelautonomous driving simulationneural rendering3D Gaussian Splattingshadow and lighting generationphotorealistic scene editing

0 comments

The pith

R3D2 uses a one-step diffusion model to insert complete 3D assets into driving scenes while generating matching shadows and lighting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces R3D2 to solve the problem that neural reconstructions like 3D Gaussian Splatting bake lighting and shadows into objects, making them difficult to move or reuse across scenes. R3D2 is a lightweight diffusion model trained on a synthetic dataset where 3D assets generated from driving data are placed into neural-rendered environments so the model learns to produce realistic integration effects. This approach allows real-time insertion of full 3D objects into existing scenes for autonomous driving simulation. A reader would care because it promises to turn limited real-world recordings into many varied, photorealistic test cases without manual modeling of every lighting condition.

Core claim

R3D2 is a lightweight one-step diffusion model trained on synthetic placements of 3DGS-generated assets into neural rendering environments; once trained it generates the shadows, lighting, and other rendering effects required to insert complete reusable 3D assets into existing driving scenes in real time.

What carries the argument

R3D2, a one-step diffusion model that predicts the additional rendering effects needed to blend inserted 3D assets into a scene.

If this is right

Enables text-to-3D asset creation followed by realistic insertion into any reconstructed scene.
Supports moving the same 3D object from one driving dataset or scene to another while preserving appearance consistency.
Increases the number of controllable safety-critical test scenarios that can be generated from a fixed set of base scenes.
Reduces reliance on per-scene manual adjustment of lighting when building simulation environments for autonomous driving validation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same insertion technique could be applied to create rare-event test cases by compositing objects recorded in separate real drives.
Extending the model to handle object motion across video frames would allow simulation of dynamic interactions such as overtaking or merging.
Releasing the trained model and dataset may let other groups adapt the approach to non-driving domains like indoor robotics or urban planning.

Load-bearing premise

The lighting and shadow patterns learned from synthetically placed 3D assets in neural environments match the patterns that appear when the same assets are inserted into real captured driving scenes.

What would settle it

Insert the same 3D assets into a set of real-world driving images and measure whether the generated shadows and illumination from R3D2 match the actual captured shadows and lighting; large mismatches would show the model has not learned the real distribution.

Figures

Figures reproduced from arXiv: 2506.07826 by Adam Tonderski, Bernardo Taveira, Chensheng Peng, Christoffer Petersson, Fredrik Kahl, Kurt Keutzer, Masayoshi Tomizuka, Michael Felsberg, Wei Zhan, Wenzhao Zheng, William Ljungbergh.

**Figure 2.** Figure 2: Complete 3D asset generated from an in-the-wild object observation using Amodal3R [ [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Two image pairs from R3D3. The top row shows the original image [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the R3D2 training pipeline. Amodal3R [ [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Image editing results when applying several SotA image-editing tools to our setting. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results from R3D3. Realistically added effects from R3D2 are highlighted in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative examples of rendering effects generated with R3D2. We show results when [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Additional samples of off-the-shelf image editing models applied to our use case. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Multi-view images of partially reconstructed dynamic objects. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Example of text-to-3D assets generated with TRELLIS [45]. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Samples of R3D2 applied on reinserted actors. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

read the original abstract

Validating autonomous driving (AD) systems requires diverse and safety-critical testing, making photorealistic virtual environments essential. Traditional simulation platforms, while controllable, are resource-intensive to scale and often suffer from a domain gap with real-world data. In contrast, neural reconstruction methods like 3D Gaussian Splatting (3DGS) offer a scalable solution for creating photorealistic digital twins of real-world driving scenes. However, they struggle with dynamic object manipulation and reusability as their per-scene optimization-based methodology tends to result in incomplete object models with integrated illumination effects. This paper introduces R3D2, a lightweight, one-step diffusion model designed to overcome these limitations and enable realistic insertion of complete 3D assets into existing scenes by generating plausible rendering effects-such as shadows and consistent lighting-in real time. This is achieved by training R3D2 on a novel dataset: 3DGS object assets are generated from in-the-wild AD data using an image-conditioned 3D generative model, and then synthetically placed into neural rendering-based virtual environments, allowing R3D2 to learn realistic integration. Quantitative and qualitative evaluations demonstrate that R3D2 significantly enhances the realism of inserted assets, enabling use-cases like text-to-3D asset insertion and cross-scene/dataset object transfer, allowing for true scalability in AD validation. To promote further research in scalable and realistic AD simulation, we release our code, see https://research.zenseact.com/publications/R3D2/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R3D2 gives a practical one-step diffusion fix for dropping 3D assets into 3DGS driving scenes with shadows and lighting, but the synthetic training loop leaves open questions about real-world transfer.

read the letter

The key point with this paper is that it gives a lightweight one-step diffusion model for inserting complete 3D assets into 3D Gaussian Splatting reconstructions of driving scenes, generating effects like shadows and lighting on the fly. This could help scale up photorealistic simulations without rebuilding everything from scratch. The work does a few things well. It tackles a real bottleneck in autonomous driving validation by making it easier to reuse captured scenes and add arbitrary objects. The training setup uses image-conditioned 3D generative models to create assets from in-the-wild data and then places them synthetically into neural rendering environments. That lets the model learn integration effects. They also show applications like text-to-3D insertion and cross-scene transfers. Releasing the code is a plus for reproducibility. It's an incremental but sensible extension of diffusion-based editing techniques to the 3DGS and AD domain. No new theory, but the combination seems targeted at the problem. The soft spot is the reliance on synthetic training data. The assets and scenes are all generated or rendered in controlled virtual setups, which might miss the messiness of real-world illumination, inter-reflections, and sensor characteristics. If the model doesn't generalize beyond that closed loop, the claimed realism for actual driving scenes could be overstated. The abstract talks about quantitative and qualitative improvements, but I'd want to see the exact metrics, how they compare to baselines, and any ablations on the domain gap before being convinced the gains are robust. This paper is for people working on simulation for autonomous driving or neural rendering pipelines. Someone looking for a practical tool to increase diversity in test scenarios might find it useful, especially since the code is available. It has enough technical grounding to deserve a serious look from referees. I would recommend sending it out for peer review to get input on the evaluation and whether the synthetic-to-real transfer holds up.

Referee Report

2 major / 1 minor

Summary. The paper introduces R3D2, a lightweight one-step diffusion model for inserting complete 3D assets into existing neural-reconstructed driving scenes while generating plausible shadows, lighting, and other rendering effects in real time. Assets are created via image-conditioned 3D generative models from in-the-wild AD data, then placed synthetically into neural rendering environments to form the training set; the model is evaluated quantitatively and qualitatively on realism improvements and demonstrated on applications including text-to-3D insertion and cross-scene transfer. Code is released.

Significance. If the central claims hold, R3D2 would offer a practical route to scalable, controllable, and photorealistic AD simulation by overcoming the static-object limitation of 3DGS reconstructions. The explicit code release is a clear strength that supports reproducibility and community follow-up on synthetic-to-real generalization.

major comments (2)

[Abstract] Abstract: the statement that 'Quantitative and qualitative evaluations demonstrate that R3D2 significantly enhances the realism of inserted assets' is not supported by any named metrics (FID, LPIPS, user-study scores, etc.), baseline methods, or ablation tables. Without these details it is impossible to judge whether the reported gains are robust or merely reflect the synthetic training distribution.
[Dataset construction and §4] Dataset construction and §4 (Experiments): the training data are generated entirely inside a closed synthetic loop (3DGS assets placed into neural rendering environments). No quantitative analysis or held-out test on real captured driving scenes with ground-truth inserted objects is described, so the domain gap in illumination, inter-reflections, and sensor noise remains unmeasured. This assumption is load-bearing for the claim that R3D2 produces realistic insertions usable for downstream AD validation.

minor comments (1)

[Abstract / Conclusion] The link to the released code is given but the manuscript does not state whether the synthetic dataset generation pipeline and evaluation scripts are included; adding this clarification would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our work. We have addressed each of the major comments below and outline the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that 'Quantitative and qualitative evaluations demonstrate that R3D2 significantly enhances the realism of inserted assets' is not supported by any named metrics (FID, LPIPS, user-study scores, etc.), baseline methods, or ablation tables. Without these details it is impossible to judge whether the reported gains are robust or merely reflect the synthetic training distribution.

Authors: We appreciate this observation. The quantitative metrics, including FID and LPIPS scores, along with comparisons to baseline methods and ablation studies, are detailed in Section 4 (Experiments) of the manuscript. To improve clarity and allow immediate assessment of our results, we will revise the abstract to specify these metrics and baselines. This change will be incorporated in the revised version. revision: yes
Referee: [Dataset construction and §4] Dataset construction and §4 (Experiments): the training data are generated entirely inside a closed synthetic loop (3DGS assets placed into neural rendering environments). No quantitative analysis or held-out test on real captured driving scenes with ground-truth inserted objects is described, so the domain gap in illumination, inter-reflections, and sensor noise remains unmeasured. This assumption is load-bearing for the claim that R3D2 produces realistic insertions usable for downstream AD validation.

Authors: We agree that the training data is constructed synthetically, as this enables the creation of a large-scale dataset with controlled placements and ground-truth rendering effects, which would be impractical to obtain from real-world captures. In the experiments, we evaluate on held-out synthetic test sets quantitatively and provide qualitative results on real scenes to show the model's ability to generalize. We recognize the importance of quantifying the domain gap and will add a new subsection in the discussion or limitations to analyze potential gaps in illumination and sensor characteristics, along with suggestions for future real-world validation studies. However, performing quantitative tests with ground-truth insertions in real scenes is beyond the current scope due to the lack of such paired real data. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via external synthetic data pipeline

full rationale

The paper describes training a one-step diffusion model on a dataset constructed by generating 3DGS assets via an image-conditioned generative model and inserting them into neural rendering environments. This pipeline relies on external components (3DGS, image-conditioned models, neural renderers) rather than defining the target realism metric in terms of the model's own outputs. No equations, fitted parameters renamed as predictions, or self-citation chains are presented that reduce the central claim (plausible shadow/lighting generation) to the training inputs by construction. The method is evaluated for enhancements in realism and use-cases like cross-scene transfer, with the synthetic data serving as a proxy rather than a closed definitional loop. This is a standard training-on-synthetic setup without load-bearing self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on the assumption that synthetic placement of 3DGS assets into neural scenes produces training pairs whose illumination statistics match real insertions; no new physical constants or entities are introduced.

axioms (2)

domain assumption Image-conditioned 3D generative models can produce complete object assets from in-the-wild driving images without baked-in scene illumination.
Invoked when creating the training assets from real AD data.
domain assumption Neural rendering environments provide a sufficiently accurate base scene for learning insertion effects.
Used to generate the synthetic training pairs.

pith-pipeline@v0.9.0 · 5850 in / 1449 out tokens · 20799 ms · 2026-05-19T10:18:44.663487+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation
cs.CV 2026-03 unverdicted novelty 7.0

ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
cs.CV 2026-04 unverdicted novelty 5.0

Unposed-to-3D learns simulation-ready 3D vehicle models from unposed real images by predicting camera parameters for photometric self-supervision, then adding scale prediction and harmonization.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 3 Pith papers · 3 internal anchors

[1]

Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving

Mina Alibeigi, William Ljungbergh, Adam Tonderski, Georg Hess, Adam Lilja, Carl Lindström, Daria Motorniuk, Junsheng Fu, Jenny Widahl, and Christoffer Petersson. Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving. In ICCV, 2023

work page 2023
[2]

Tiny autoencoder for stable diffusion

Ollin Boer Bohan. Tiny autoencoder for stable diffusion. https://github.com/madebyollin/taesd, 2023

work page 2023
[3]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, June 2023

work page 2023
[4]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020

work page 2020
[5]

8 Sana-sprint: One-step diffusion with continuous-time consis- tency distillation.arXiv preprint arXiv:2503.09641, 2025

Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Enze Xie, and Song Han. Sana-sprint: One-step diffusion with continuous-time consistency distillation. arXiv preprint arXiv:2503.09641, 2025

work page arXiv 2025
[6]

Boregowda, and R

Ankit Dhiman, Manan Shah, Rishubh Parihar, Yash Bhalgat, Lokesh R. Boregowda, and R. Venkatesh Babu. Reflecting reality: Enabling diffusion models to produce faithful mirror reflections. ArXiv, 2024

work page 2024
[7]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In CoRL, 2017

work page 2017
[8]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024

work page 2024
[9]

Splatad: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving

Georg Hess, Carl Lindström, Maryam Fatemi, Christoffer Petersson, and Lennart Svensson. Splatad: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving. In CVPR, 2025

work page 2025
[10]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017

work page 2017
[11]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 2020

work page 2020
[12]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 2022

work page 2022
[13]

Smartedit: Exploring complex instruction-based image editing with multimodal large language models

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction-based image editing with multimodal large language models. In CVPR, 2024

work page 2024
[14]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. TOG, 2023

work page 2023
[15]

Photorealistic object insertion with diffusion-guided inverse rendering

Ruofan Liang, Zan Gojcic, Merlin Nimier-David, David Acuna, Nandita Vijaykumar, Sanja Fidler, and Zian Wang. Photorealistic object insertion with diffusion-guided inverse rendering. In ECCV, 2024. 10

work page 2024
[16]

Are nerfs ready for autonomous driving? towards closing the real-to-simulation gap

Carl Lindström, Georg Hess, Adam Lilja, Maryam Fatemi, Lars Hammarstrand, Christoffer Petersson, and Lennart Svensson. Are nerfs ready for autonomous driving? towards closing the real-to-simulation gap. In CVPRW, 2024

work page 2024
[17]

3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view-consistent 2d diffusion priors

Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view-consistent 2d diffusion priors. In NeurIPS, 2024

work page 2024
[18]

Neuroncap: Photorealistic closed-loop safety testing for autonomous driving

William Ljungbergh, Adam Tonderski, Joakim Johnander, Holger Caesar, Kalle Åström, Michael Felsberg, and Christoffer Petersson. Neuroncap: Photorealistic closed-loop safety testing for autonomous driving. In ECCV, 2025

work page 2025
[19]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

SDEdit: Guided image synthesis and editing with stochastic differential equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022

work page 2022
[21]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020

work page 2020
[22]

Image generated using chatgpt with dalle model

OpenAI. Image generated using chatgpt with dalle model. https://chat.openai.com, 2025. Accessed May 15, 2025

work page 2025
[23]

pix2gestalt: Amodal segmentation by synthesizing wholes

Ege Ozguroglu, Ruoshi Liu, Dídac Sur´s, Dian Chen, Achal Dave, Pavel Tokmakov, and Carl V ondrick. pix2gestalt: Amodal segmentation by synthesizing wholes. CVPR, 2024

work page 2024
[24]

One-step image translation with text-to-image models,

Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models. arXiv preprint arXiv:2403.12036, 2024

work page arXiv 2024
[25]

Physically based rendering: From theory to implementa- tion

Matt Pharr, Wenzel Jakob, and Greg Humphreys. Physically based rendering: From theory to implementa- tion. MIT Press, 2023

work page 2023
[26]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Film: Frame interpolation for large motion

Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpolation for large motion. In ECCV, 2022

work page 2022
[29]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022

work page 2022
[30]

Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023

work page 2023
[31]

Palette: Image-to-image diffusion models

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In SIGGRAPH, 2022

work page 2022
[32]

Sara Mahdavi, Raphael Gontijo-Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Lit, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Raphael Gontijo-Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022

work page 2022
[33]

Fast high-resolution image synthesis with latent adversarial diffusion distillation

Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation. In SIGGRAPH, 2024

work page 2024
[34]

Airsim: High-fidelity visual and physical simulation for autonomous vehicles

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics: Results of the 11th International Conference, 2018

work page 2018
[35]

Gina-3d: Learning to generate implicit neural assets in the wild

Bokui Shen, Xinchen Yan, Charles R Qi, Mahyar Najibi, Boyang Deng, Leonidas Guibas, Yin Zhou, and Dragomir Anguelov. Gina-3d: Learning to generate implicit neural assets in the wild. In CVPR, 2023

work page 2023
[36]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. In CVPR, 2024. 11

work page 2024
[37]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

work page 2020
[38]

Neurad: Neural rendering for autonomous driving

Adam Tonderski, Carl Lindström, Georg Hess, William Ljungbergh, Lennart Svensson, and Christoffer Petersson. Neurad: Neural rendering for autonomous driving. In CVPR, 2024

work page 2024
[39]

Suds: Scalable urban dynamic scenes

Haithem Turki, Jason Y Zhang, Francesco Ferroni, and Deva Ramanan. Suds: Scalable urban dynamic scenes. In CVPR, 2023

work page 2023
[40]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE TIP, 2004

work page 2004
[41]

Neural light field estimation for street scenes with differentiable virtual object insertion

Zian Wang, Wenzheng Chen, David Acuna, Jan Kautz, and Sanja Fidler. Neural light field estimation for street scenes with differentiable virtual object insertion. In ECCV, 2022

work page 2022
[42]

Objectdrop: Bootstrapping counterfactuals for photorealistic object removal and insertion

Daniel Winter, Matan Cohen, Shlomi Fruchter, Yael Pritch, Alex Rav-Acha, and Yedid Hoshen. Objectdrop: Bootstrapping counterfactuals for photorealistic object removal and insertion. In ECCV, 2024

work page 2024
[43]

Difix3d+: Improving 3d reconstructions with single-step diffusion models

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, and Huan Ling. Difix3d+: Improving 3d reconstructions with single-step diffusion models. In CVPR, 2025

work page 2025
[44]

Amodal3r: Amodal 3d reconstruction from occluded 2d images

Tianhao Wu, Chuanxia Zheng, Frank Guan, Andrea Vedaldi, and Tat-Jen Cham. Amodal3r: Amodal 3d reconstruction from occluded 2d images. arXiv preprint arXiv:2503.13439, 2025

work page arXiv 2025
[45]

Structured 3d latents for scalable and versatile 3d generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. In CVPR, 2025

work page 2025
[46]

Pandaset: Advanced sensor suite dataset for autonomous driving

Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, Yunlong Wang, and Diange Yang. Pandaset: Advanced sensor suite dataset for autonomous driving. In ITSC, 2021

work page 2021
[47]

Street gaussians for modeling dynamic urban scenes

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians for modeling dynamic urban scenes. In ECCV, 2024

work page 2024
[48]

EmerneRF: Emergent spatial-temporal scene decomposition via self-supervision

Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Seung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler, Marco Pavone, and Yue Wang. EmerneRF: Emergent spatial-temporal scene decomposition via self-supervision. In ICLR, 2024

work page 2024
[49]

Unisim: A neural closed-loop sensor simulator

Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Manivasagam, Wei-Chiu Ma, Anqi Joyce Yang, and Raquel Urtasun. Unisim: A neural closed-loop sensor simulator. In CVPR, 2023

work page 2023
[50]

Scaling in-the-wild training for diffusion-based illumination harmonization and editing by imposing consistent light transport

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Scaling in-the-wild training for diffusion-based illumination harmonization and editing by imposing consistent light transport. In ICLR, 2025

work page 2025
[51]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018

work page 2018
[52]

Make the cars in the image realistic with shadows and good lighting

Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, and Ming-Hsuan Yang. Driving- gaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes. In CVPR, 2024. 12 Supplementary material A Implementation details A.1 Model architecture As described in Sec. 3.2, R3D2 builds upon the work of [ 24], with key differences ...

work page 2024

[1] [1]

Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving

Mina Alibeigi, William Ljungbergh, Adam Tonderski, Georg Hess, Adam Lilja, Carl Lindström, Daria Motorniuk, Junsheng Fu, Jenny Widahl, and Christoffer Petersson. Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving. In ICCV, 2023

work page 2023

[2] [2]

Tiny autoencoder for stable diffusion

Ollin Boer Bohan. Tiny autoencoder for stable diffusion. https://github.com/madebyollin/taesd, 2023

work page 2023

[3] [3]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, June 2023

work page 2023

[4] [4]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020

work page 2020

[5] [5]

8 Sana-sprint: One-step diffusion with continuous-time consis- tency distillation.arXiv preprint arXiv:2503.09641, 2025

Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Enze Xie, and Song Han. Sana-sprint: One-step diffusion with continuous-time consistency distillation. arXiv preprint arXiv:2503.09641, 2025

work page arXiv 2025

[6] [6]

Boregowda, and R

Ankit Dhiman, Manan Shah, Rishubh Parihar, Yash Bhalgat, Lokesh R. Boregowda, and R. Venkatesh Babu. Reflecting reality: Enabling diffusion models to produce faithful mirror reflections. ArXiv, 2024

work page 2024

[7] [7]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In CoRL, 2017

work page 2017

[8] [8]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024

work page 2024

[9] [9]

Splatad: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving

Georg Hess, Carl Lindström, Maryam Fatemi, Christoffer Petersson, and Lennart Svensson. Splatad: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving. In CVPR, 2025

work page 2025

[10] [10]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017

work page 2017

[11] [11]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 2020

work page 2020

[12] [12]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 2022

work page 2022

[13] [13]

Smartedit: Exploring complex instruction-based image editing with multimodal large language models

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction-based image editing with multimodal large language models. In CVPR, 2024

work page 2024

[14] [14]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. TOG, 2023

work page 2023

[15] [15]

Photorealistic object insertion with diffusion-guided inverse rendering

Ruofan Liang, Zan Gojcic, Merlin Nimier-David, David Acuna, Nandita Vijaykumar, Sanja Fidler, and Zian Wang. Photorealistic object insertion with diffusion-guided inverse rendering. In ECCV, 2024. 10

work page 2024

[16] [16]

Are nerfs ready for autonomous driving? towards closing the real-to-simulation gap

Carl Lindström, Georg Hess, Adam Lilja, Maryam Fatemi, Lars Hammarstrand, Christoffer Petersson, and Lennart Svensson. Are nerfs ready for autonomous driving? towards closing the real-to-simulation gap. In CVPRW, 2024

work page 2024

[17] [17]

3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view-consistent 2d diffusion priors

Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view-consistent 2d diffusion priors. In NeurIPS, 2024

work page 2024

[18] [18]

Neuroncap: Photorealistic closed-loop safety testing for autonomous driving

William Ljungbergh, Adam Tonderski, Joakim Johnander, Holger Caesar, Kalle Åström, Michael Felsberg, and Christoffer Petersson. Neuroncap: Photorealistic closed-loop safety testing for autonomous driving. In ECCV, 2025

work page 2025

[19] [19]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

SDEdit: Guided image synthesis and editing with stochastic differential equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022

work page 2022

[21] [21]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020

work page 2020

[22] [22]

Image generated using chatgpt with dalle model

OpenAI. Image generated using chatgpt with dalle model. https://chat.openai.com, 2025. Accessed May 15, 2025

work page 2025

[23] [23]

pix2gestalt: Amodal segmentation by synthesizing wholes

Ege Ozguroglu, Ruoshi Liu, Dídac Sur´s, Dian Chen, Achal Dave, Pavel Tokmakov, and Carl V ondrick. pix2gestalt: Amodal segmentation by synthesizing wholes. CVPR, 2024

work page 2024

[24] [24]

One-step image translation with text-to-image models,

Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models. arXiv preprint arXiv:2403.12036, 2024

work page arXiv 2024

[25] [25]

Physically based rendering: From theory to implementa- tion

Matt Pharr, Wenzel Jakob, and Greg Humphreys. Physically based rendering: From theory to implementa- tion. MIT Press, 2023

work page 2023

[26] [26]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

Film: Frame interpolation for large motion

Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpolation for large motion. In ECCV, 2022

work page 2022

[29] [29]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022

work page 2022

[30] [30]

Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023

work page 2023

[31] [31]

Palette: Image-to-image diffusion models

Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In SIGGRAPH, 2022

work page 2022

[32] [32]

Sara Mahdavi, Raphael Gontijo-Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Lit, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Raphael Gontijo-Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022

work page 2022

[33] [33]

Fast high-resolution image synthesis with latent adversarial diffusion distillation

Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation. In SIGGRAPH, 2024

work page 2024

[34] [34]

Airsim: High-fidelity visual and physical simulation for autonomous vehicles

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics: Results of the 11th International Conference, 2018

work page 2018

[35] [35]

Gina-3d: Learning to generate implicit neural assets in the wild

Bokui Shen, Xinchen Yan, Charles R Qi, Mahyar Najibi, Boyang Deng, Leonidas Guibas, Yin Zhou, and Dragomir Anguelov. Gina-3d: Learning to generate implicit neural assets in the wild. In CVPR, 2023

work page 2023

[36] [36]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. In CVPR, 2024. 11

work page 2024

[37] [37]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

work page 2020

[38] [38]

Neurad: Neural rendering for autonomous driving

Adam Tonderski, Carl Lindström, Georg Hess, William Ljungbergh, Lennart Svensson, and Christoffer Petersson. Neurad: Neural rendering for autonomous driving. In CVPR, 2024

work page 2024

[39] [39]

Suds: Scalable urban dynamic scenes

Haithem Turki, Jason Y Zhang, Francesco Ferroni, and Deva Ramanan. Suds: Scalable urban dynamic scenes. In CVPR, 2023

work page 2023

[40] [40]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE TIP, 2004

work page 2004

[41] [41]

Neural light field estimation for street scenes with differentiable virtual object insertion

Zian Wang, Wenzheng Chen, David Acuna, Jan Kautz, and Sanja Fidler. Neural light field estimation for street scenes with differentiable virtual object insertion. In ECCV, 2022

work page 2022

[42] [42]

Objectdrop: Bootstrapping counterfactuals for photorealistic object removal and insertion

Daniel Winter, Matan Cohen, Shlomi Fruchter, Yael Pritch, Alex Rav-Acha, and Yedid Hoshen. Objectdrop: Bootstrapping counterfactuals for photorealistic object removal and insertion. In ECCV, 2024

work page 2024

[43] [43]

Difix3d+: Improving 3d reconstructions with single-step diffusion models

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, and Huan Ling. Difix3d+: Improving 3d reconstructions with single-step diffusion models. In CVPR, 2025

work page 2025

[44] [44]

Amodal3r: Amodal 3d reconstruction from occluded 2d images

Tianhao Wu, Chuanxia Zheng, Frank Guan, Andrea Vedaldi, and Tat-Jen Cham. Amodal3r: Amodal 3d reconstruction from occluded 2d images. arXiv preprint arXiv:2503.13439, 2025

work page arXiv 2025

[45] [45]

Structured 3d latents for scalable and versatile 3d generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. In CVPR, 2025

work page 2025

[46] [46]

Pandaset: Advanced sensor suite dataset for autonomous driving

Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, Yunlong Wang, and Diange Yang. Pandaset: Advanced sensor suite dataset for autonomous driving. In ITSC, 2021

work page 2021

[47] [47]

Street gaussians for modeling dynamic urban scenes

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians for modeling dynamic urban scenes. In ECCV, 2024

work page 2024

[48] [48]

EmerneRF: Emergent spatial-temporal scene decomposition via self-supervision

Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Seung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler, Marco Pavone, and Yue Wang. EmerneRF: Emergent spatial-temporal scene decomposition via self-supervision. In ICLR, 2024

work page 2024

[49] [49]

Unisim: A neural closed-loop sensor simulator

Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Manivasagam, Wei-Chiu Ma, Anqi Joyce Yang, and Raquel Urtasun. Unisim: A neural closed-loop sensor simulator. In CVPR, 2023

work page 2023

[50] [50]

Scaling in-the-wild training for diffusion-based illumination harmonization and editing by imposing consistent light transport

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Scaling in-the-wild training for diffusion-based illumination harmonization and editing by imposing consistent light transport. In ICLR, 2025

work page 2025

[51] [51]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018

work page 2018

[52] [52]

Make the cars in the image realistic with shadows and good lighting

Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, and Ming-Hsuan Yang. Driving- gaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes. In CVPR, 2024. 12 Supplementary material A Implementation details A.1 Model architecture As described in Sec. 3.2, R3D2 builds upon the work of [ 24], with key differences ...

work page 2024