arxiv: 2604.04887 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: no theorem link

HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes

Mauricio Soroco , Francesco Pittaluga , Zaid Tasneem , Abhishek Aich , Bingbing Zhuang , Wuyang Chen , Manmohan Chandraker , Ziyu Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords driving scene editinginstruction-guided editingsemantic image editingautonomous drivingphotorealistic synthesismulti-level editingscene generalizationpaired dataset generation

0 comments

The pith

HorizonWeaver enables photorealistic editing of dense driving scenes using language instructions at multiple levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish a practical way to generate varied, realistic driving scenes from text instructions so that autonomous vehicle systems can be tested more thoroughly than real-world collection alone allows. Existing editors trained on simpler or artistic images cannot keep up with crowded traffic layouts that require simultaneous object changes and scene-wide adjustments while staying consistent across weather or location shifts. It does this by assembling a large set of paired original and edited images, directing edits with masks tied to language prompts, and training under combined objectives that enforce both visual fidelity to the source and adherence to the given instructions. If the approach holds, it would let developers create targeted variations of safety-critical situations at scale without introducing obvious artifacts or losing performance when conditions change.

Core claim

HorizonWeaver tackles multi-level granularity, rich semantics, and domain shifts in driving scene editing by generating a large paired real/synthetic dataset from driving sources, introducing language-guided masks that incorporate semantic information for precise control, and applying joint losses that enforce content preservation alongside instruction alignment, yielding a collection of 255K images across 13 categories that improves on prior methods in standard image metrics and downstream task accuracy.

What carries the argument

Language-guided masks enriched with semantic prompts that direct fine-grained edits while joint content-preservation and instruction-alignment losses maintain scene coherence.

If this is right

Edits remain coherent at both individual object and full scene scales inside crowded traffic environments.
Performance improves on image similarity measures such as L1 distance, CLIP alignment, and DINO features relative to earlier editors.
Downstream bird's-eye-view segmentation accuracy rises by a substantial margin on edited data.
User studies show markedly higher preference for the resulting images over those from prior approaches.
The method supports generation of controllable scenes for safety validation beyond what real recordings provide.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pairing and masking strategy could be adapted to generate synthetic training data for other perception tasks such as object detection or lane following.
Temporal extensions might add frame-to-frame consistency so the technique applies to video sequences of driving scenes.
The emphasis on domain-shift handling suggests the framework could transfer to editing other dense real-world imagery such as pedestrian crowds or construction sites.

Load-bearing premise

A paired real/synthetic dataset built from multiple driving sources together with language-guided masks and joint losses can produce coherent multi-level edits that generalize to new climates, layouts, and traffic without major artifacts or overfitting.

What would settle it

Evaluating the edited outputs on a held-out collection of driving scenes from previously unseen regions or weather conditions and checking whether photorealism, instruction match, and any downstream segmentation gains hold; a clear drop would indicate the central claim does not generalize.

Figures

Figures reproduced from arXiv: 2604.04887 by Abhishek Aich, Bingbing Zhuang, Francesco Pittaluga, Manmohan Chandraker, Mauricio Soroco, Wuyang Chen, Zaid Tasneem, Ziyu Jiang.

**Figure 1.** Figure 1: In each example, from left to right are the input image, LangMasks, and output image. Masks for global edits are blank. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Dataset Construction. Real-world data (Sec. 3.2.1) are paired by camera pose and annotated using an image descriptor pipeline and passed to an LLM to produce instructions. Pseudo-data (Sec 3.2.2) for Local edits crop an annotated object before VLM filtering; global edits apply VLM filtering to full images. Our dataset is composed of image pairs, global editing instructions, and masks indicating fine-graine… view at source ↗

**Figure 3.** Figure 3: LangMask Generation and Training. Left: To provide fine-grained instructions with rich semantics, we insert CLIP text features into binary masks (Section 3.3). Right: To support LangMasks, we copy and expand the VAE. It is trained end to end with the editing model. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Training language-guided driving scene image editing. Our training pipeline supports both supervised training for paired images and unsupervised training for unpaired ones (e.g. downstream unseen real scenarios). We include three training objectives: supervised fine-tuning Lsft (Section 3.4.1), cycle consistency Lcycle (Section 3.4.2), and Lclip (Section 3.4.3) [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: HorizonWeaver Editing. Rows 1,2 Local edits: the masks (projected as binary images and stated in text for reference) enable modifications to traffic. Rows 3, 4 Global edits: the text prompt informs the appearance of the scene. For brevity, only the portions relevant to the shown edits are displayed. Rows 5, 6 Compound edits: The masks (projected as binary images) enable modifications to traffic while the t… view at source ↗

**Figure 6.** Figure 6: Annotation pipeline prompt used for the VLM. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Example annotation produced by the annotation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Truncated annotation output corresponding to Fig. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt used to generate real-world image editing instructions. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: HorizonWeaver Editing. Local edits: the masks (projected as binary images and stated in text for reference) enable modifications to traffic. We compare to Qwen [60], OmniGen2 [61], UltraEdit [72], and BAGEL [13]. tency and outperforms baselines in temporal consistency and visual quality. Our model has fewer than 8B parameters whereas Qwen has 20B. We further evaluate geometric consistency by comparing t… view at source ↗

**Figure 11.** Figure 11: HorizonWeaver Editing. Global edits: the text prompt informs the appearance of the scene. For brevity, only the portions relevant to the shown edits are displayed. We compare to Qwen [60], OmniGen2 [61], UltraEdit [72], and BAGEL [13] [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: HorizonWeaver Editing. Compound edits: The masks (projected as binary images) enable modifications to traffic while the text prompt informs the desired global appearance. We compare to Qwen [60], OmniGen2 [61], UltraEdit [72], and BAGEL [13] [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Driving specific editor. MagicDrive [14], cannot preserve critical driving scene components (appearance of vehicles, construction infrastructure, road signs, etc.) Long-Trail Editing our edits can extend to rare scenarios, e.g., crosswalks, road signs, and specialized vehicles (cement mixer) which are not preserved by MagicDrive [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Temporal Consistency. Videos produced from an edited initial image by HorizonWeaver, UltraEdit, and Qwen (top to bottom). UltraEdit leaves the initial image unchanged. Prompt: Change the season to summer [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Driving scenes are dense; thus it is difficult to de [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Overlapping Edits, HorizonWeaver will apply insertion edits to empty spaces in crowded regions 18. Limitations A failure case appears in [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

read the original abstract

Ensuring safety in autonomous driving requires scalable generation of realistic, controllable driving scenes beyond what real-world testing provides. Yet existing instruction guided image editors, trained on object-centric or artistic data, struggle with dense, safety-critical driving layouts. We propose HorizonWeaver, which tackles three fundamental challenges in driving scene editing: (1) multi-level granularity, requiring coherent object- and scene-level edits in dense environments; (2) rich high-level semantics, preserving diverse objects while following detailed instructions; and (3) ubiquitous domain shifts, handling changes in climate, layout, and traffic across unseen environments. The core of HorizonWeaver is a set of complementary contributions across data, model, and training: (1) Data: Large-scale dataset generation, where we build a paired real/synthetic dataset from Boreas, nuScenes, and Argoverse2 to improve generalization; (2) Model: Language-Guided Masks for fine-grained editing, where semantics-enriched masks and prompts enable precise, language-guided edits; and (3) Training: Content preservation and instruction alignment, where joint losses enforce scene consistency and instruction fidelity. Together, HorizonWeaver provides a scalable framework for photorealistic, instruction-driven editing of complex driving scenes, collecting 255K images across 13 editing categories and outperforming prior methods in L1, CLIP, and DINO metrics, achieving +46.4% user preference and improving BEV segmentation IoU by +33%. Project page: https://msoroco.github.io/horizonweaver/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HorizonWeaver builds a multi-source paired dataset and language-guided masks for driving scene edits, with reported metric gains, but the generalization story needs hold-out tests to hold up.

read the letter

HorizonWeaver adds a paired dataset from Boreas, nuScenes, and Argoverse2 plus language-guided masks and joint preservation-alignment losses to handle instruction-driven edits in dense driving scenes. The paper collects 255K images across 13 categories and shows gains on L1, CLIP, DINO, a user study, and downstream BEV IoU. That dataset scale and the focus on multi-level granularity in safety-critical layouts are the concrete steps forward. The masks let prompts control object and scene changes without breaking traffic layouts, and the losses try to keep content stable while following instructions, which matches the needs of AV simulation work. The approach builds on existing editors but tailors them to driving data in a direct way. The main soft spot is evaluation transparency. The abstract lists improvements but gives no full baseline list, ablation tables, or split details, so it is hard to separate the contribution of the masks and losses from the new data itself. Generalization to unseen climates or layouts rests on the three-source mix, yet the stress-test note is right that no explicit cross-dataset hold-outs or domain-gap numbers appear in the summary. If test scenes stay close to the training sources, the +33% IoU and +46% preference could shrink. No discussion of failure cases in complex edits either. This paper is for researchers building controllable simulators for autonomous driving safety tests. A reader who needs realistic edited scenes for validation would find the dataset and framework useful even before the numbers are fully stress-tested. It deserves peer review because the problem is real and the pieces are plausible, though the referee should ask for clearer ablations and hold-out results.

Referee Report

2 major / 0 minor

Summary. The paper proposes HorizonWeaver, a framework for photorealistic instruction-driven editing of complex driving scenes. It tackles multi-level granularity, rich semantics, and domain shifts via three contributions: a paired real/synthetic dataset of 255K images across 13 editing categories constructed from Boreas, nuScenes, and Argoverse2; language-guided masks enriched with semantics and prompts; and joint losses enforcing content preservation and instruction alignment. The method claims to outperform prior editors on L1, CLIP, and DINO metrics, achieve +46.4% user preference, and deliver +33% gains in downstream BEV segmentation IoU.

Significance. If the generalization and editing coherence claims hold, the work would be significant for autonomous driving research, offering a scalable way to generate controllable, safety-critical scene variations beyond real-world collection limits. The scale of the constructed dataset and the downstream IoU improvement on BEV segmentation are notable strengths that could support broader use in simulation and testing pipelines.

major comments (2)

[Abstract and Experiments] The central claim of generalization to unseen climates, layouts, and traffic (abstract) depends on the 3-source paired dataset enabling robust domain-shift handling, yet no explicit cross-dataset hold-out splits, climate-specific test sets, or quantitative domain-gap metrics (e.g., FID or distribution distances between train and test) are described. Without these, the reported L1/CLIP/DINO margins and +33% BEV IoU could be attributable to reduced distributional shift rather than the language-guided masks or joint losses.
[Evaluation] The quantitative support for the +46.4% user preference and +33% BEV IoU gains lacks sufficient detail on baseline implementations, data splits, ablation studies isolating the joint losses and mask generation, or controls for post-hoc selection. This makes it difficult to verify that the gains are load-bearing on the proposed model components rather than dataset construction alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for clarifying our experimental design and evaluation. We address each major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract and Experiments] The central claim of generalization to unseen climates, layouts, and traffic (abstract) depends on the 3-source paired dataset enabling robust domain-shift handling, yet no explicit cross-dataset hold-out splits, climate-specific test sets, or quantitative domain-gap metrics (e.g., FID or distribution distances between train and test) are described. Without these, the reported L1/CLIP/DINO margins and +33% BEV IoU could be attributable to reduced distributional shift rather than the language-guided masks or joint losses.

Authors: We agree that the manuscript would benefit from explicit documentation of these elements to better isolate the contributions of our components. The paired dataset was constructed from Boreas, nuScenes, and Argoverse2 specifically to introduce diversity in climates, layouts, and traffic during training, with evaluation on held-out portions intended to test generalization. However, cross-dataset hold-out splits, climate-specific test sets, and quantitative metrics such as FID are not detailed. In the revision, we will add: (1) a clear description of the train/test splits across the three sources with no scene overlap, (2) performance breakdowns on climate-specific subsets where feasible, and (3) FID and distribution distance metrics between training and test distributions. These additions will help substantiate that the L1/CLIP/DINO and BEV IoU improvements arise from the language-guided masks and joint losses. revision: yes
Referee: [Evaluation] The quantitative support for the +46.4% user preference and +33% BEV IoU gains lacks sufficient detail on baseline implementations, data splits, ablation studies isolating the joint losses and mask generation, or controls for post-hoc selection. This makes it difficult to verify that the gains are load-bearing on the proposed model components rather than dataset construction alone.

Authors: We acknowledge that greater transparency is required to confirm the load-bearing role of our proposed components. The reported gains were measured against adapted prior editors on our 255K-image dataset using the described metrics and user study protocol. In the revised manuscript and supplementary material, we will provide: (1) detailed specifications of baseline implementations and any adaptations to driving scenes, (2) exact data splits and sample counts for each quantitative result, (3) full ablation studies isolating the joint losses and language-guided mask generation, and (4) controls such as results on randomly sampled (non-cherry-picked) outputs to address post-hoc selection. These changes will enable verification that the improvements are attributable to the model rather than dataset construction. revision: yes

Circularity Check

0 steps flagged

No significant circularity in HorizonWeaver derivation

full rationale

The paper presents an empirical method with three contributions: constructing a paired real/synthetic dataset from public sources (Boreas, nuScenes, Argoverse2), language-guided masks for editing, and joint losses for consistency. Results are reported on external metrics (L1, CLIP, DINO) plus user study and BEV IoU on a newly collected 255K-image dataset across 13 categories. No equations, predictions, or claims reduce by construction to fitted parameters, self-definitions, or self-citation chains; all load-bearing steps rely on independent benchmarks and hold-out evaluations outside the training process.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard deep learning assumptions about the ability of generative models to follow instructions while preserving content, plus the representativeness of the constructed dataset for domain shifts. No new physical entities are introduced.

free parameters (1)

Hyperparameters for the joint losses and mask generation
Standard training choices whose specific values are not detailed in the abstract but are required for the reported performance.

axioms (2)

domain assumption Language-enriched masks and prompts can enable precise, coherent edits at multiple granularities in dense scenes
Invoked in the model contribution section of the abstract.
domain assumption Joint losses can simultaneously enforce scene consistency and instruction fidelity without trade-offs that degrade photorealism
Invoked in the training contribution of the abstract.

pith-pipeline@v0.9.0 · 5606 in / 1587 out tokens · 68997 ms · 2026-05-10T19:35:13.695096+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

92 extracted references · 18 canonical work pages · 4 internal anchors

[1]

Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025

Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025. 3

work page arXiv 2025
[2]

Improving image genera- tion with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image genera- tion with better captions. 1, 2
[3]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023. 1, 2, 3

2023
[4]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structpix2pix: Learning to follow image editing instructions,
[5]

Boreas: A multi-season au- tonomous driving dataset.The International Journal of Robotics Research, 42(1-2):33–42, 2023

Keenan Burnett, David J Yoon, Yuchen Wu, Andrew Z Li, Haowei Zhang, Shichen Lu, Jingxing Qian, Wei-Kang Tseng, Andrew Lambert, Keith YK Leung, Angela P Schoel- lig, and Timothy D Barfoot. Boreas: A multi-season au- tonomous driving dataset.The International Journal of Robotics Research, 42(1-2):33–42, 2023. 1, 2, 3, 6

2023
[6]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020. 1, 2, 6

2020
[7]

Nuplan: A closed-loop ml-based plan- ning benchmark for autonomous vehicles

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. Nuplan: A closed-loop ml-based plan- ning benchmark for autonomous vehicles. InCVPR ADP3 workshop, 2021. 6

2021
[8]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 6

2021
[9]

Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Ruiz, Xuhui Jia, Ming-Wei Chang, and William W. Cohen. Subject-driven text-to-image generation via apprenticeship learning, 2023. 3

2023
[10]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 3, 4, 1

2024
[11]

Omnire: Omni urban scene reconstruction.arXiv preprint arXiv:2408.16760, 2024

Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lutio, Janick Martinez Esturo, Boris Ivanovic, Or Litany, Zan Go- jcic, Sanja Fidler, Marco Pavone, et al. Omnire: Omni ur- ban scene reconstruction.arXiv preprint arXiv:2408.16760,

work page arXiv
[12]

VQGAN-CLIP: open domain image generation and editing with natural language guidance

Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Ed- ward Raff. VQGAN-CLIP: open domain image generation and editing with natural language guidance. InComputer Vi- sion - ECCV 2022 - 17th European Conference, Tel Aviv, Is- rael, October 23-27, 2022, Proceedings, Part XXXVII, pages 88–105. Springer, 2022. 3

2022
[13]

Emerging properties in unified multimodal pretraining, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. 3, 7, 8, 9

2025
[14]

Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023

Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023. 1, 3, 7, 10

work page arXiv 2023
[15]

Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive con- trol.arXiv preprint arXiv:2411.13807, 2024

Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhen- guo Li, and Qiang Xu. Magicdrivedit: High-resolution long video generation for autonomous driving with adaptive con- trol.arXiv preprint arXiv:2411.13807, 2024. 3

work page arXiv 2024
[16]

Clova: A closed-loop visual assistant with tool usage and update

Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, and Qing Li. Clova: A closed-loop visual assistant with tool usage and update. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13258–13268, 2024. 3

2024
[17]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.CoRR, abs/2208.01626,

work page internal anchor Pith review arXiv
[18]

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1

2024
[19]

arXiv preprint arXiv:2404.09990 , year=

Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990, 2024. 2

work page arXiv 2024
[20]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models.CoRR, abs/2210.09276, 2022. 3

work page arXiv 2022
[21]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, 2023. 3, 1

2023
[22]

Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2024

Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2024. 7

2024
[23]

Learning an image editing model without image editing pairs, 2025

Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao, Yotam Nitzan, Yuheng Li, Krishna Kumar Singh, Richard Zhang, Eli Shechtman, Jun-Yan Zhu, and Xun Huang. Learning an image editing model without image editing pairs, 2025. 2

2025
[24]

Uniscene: Unified occupancy-centric driving scene generation, 2025

Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, Shuchang Zhou, Li Zhang, Xiaojuan Qi, Hao Zhao, Mu Yang, Wenjun Zeng, and Xin Jin. Uniscene: Unified occupancy-centric driving scene generation, 2025. 3

2025
[25]

Dongxu Li, Junnan Li, and Steven C. H. Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to- image generation and editing, 2023. 3

2023
[26]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 7

2024
[27]

Driveeditor: A unified 3d information-guided framework for controllable object editing in driving scenes

Yiyuan Liang, Zhiying Yan, Liqun Chen, Jiahuan Zhou, Luxin Yan, Sheng Zhong, and Xu Zou. Driveeditor: A unified 3d information-guided framework for controllable object editing in driving scenes. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5164– 5172, 2025. 3

2025
[28]

Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Junhao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views,
[29]

Open-edit: Open- domain image manipulation with open-vocabulary instruc- tions

Xihui Liu, Zhe Lin, Jianming Zhang, Handong Zhao, Quan Tran, Xiaogang Wang, and Hongsheng Li. Open-edit: Open- domain image manipulation with open-vocabulary instruc- tions. InComputer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceed- ings, Part XI, pages 89–106. Springer, 2020. 3

2020
[30]

Bevfusion: Multi- task multi-sensor fusion with unified bird’s-eye view repre- sentation

Zhijian Liu, Haotian Tang, Alexander Amini, Xingyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multi- task multi-sensor fusion with unified bird’s-eye view repre- sentation. InIEEE International Conference on Robotics and Automation (ICRA), 2023. 8, 7

2023
[31]

Chung et al

Yifan Lu, Xuanchi Ren, Jiawei Yang, Tianchang Shen, Zhangjie Wu, Jun Gao, Yue Wang, Siheng Chen, Mike Chen, Sanja Fidler, et al. Infinicube: Unbounded and con- trollable dynamic 3d driving scene generation with world- guided video models.arXiv preprint arXiv:2412.03934,

work page arXiv
[32]

Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning.arXiv preprint arXiv:2307.11410, 2023

Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning.arXiv preprint arXiv:2307.11410, 2023. 3

work page arXiv 2023
[33]

Ruling the operational boundaries: A survey on operational design domains of autonomous driving systems

Marcel Aguirre Mehlhorn, Andreas Richter, and Yuri AW Shardt. Ruling the operational boundaries: A survey on operational design domains of autonomous driving systems. IFAC-PapersOnLine, 56(2):2202–2213, 2023. 1

2023
[34]

SDEdit: Guided image synthesis and editing with stochastic differential equa- tions

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. InInternational Conference on Learning Representa- tions, 2022. 3

2022
[35]

Scaling open-vocabulary object detection, 2024

Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling open-vocabulary object detection, 2024. 3, 1

2024
[36]

Dreamland: Controllable world creation with simulator and generative models, 2025

Sicheng Mo, Ziyang Leng, Leon Liu, Weizhen Wang, Honglin He, and Bolei Zhou. Dreamland: Controllable world creation with simulator and generative models, 2025. 10

2025
[37]

Null-text inversion for editing real im- ages using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real im- ages using guided diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023. 3

2023
[38]

GLIDE: towards photorealis- tic image generation and editing with text-guided diffusion models

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: towards photorealis- tic image generation and editing with text-guided diffusion models. InInternational Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, pages 16784–16804. PMLR, 2022. 3

2022
[39]

NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, ...
[40]

Kosmos-g: Generating images in context with multimodal large language models, 2024

Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, and Furu Wei. Kosmos-g: Generating images in context with multimodal large language models, 2024. 3

2024
[41]

Poisson image editing.ACM Trans

Patrick P ´erez, Michel Gangnet, and Andrew Blake. Poisson image editing.ACM Trans. Graph., 22(3):313–318, 2003. 4

2003
[42]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8821–8831. PMLR, 2021. 2

2021
[43]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with CLIP latents.CoRR, abs/2204.06125, 2022. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674– 10685. IEEE, 2022. 1, 2

2022
[45]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22500–22510, 2023. 3

2023
[46]

Fleet, and Mohammad Norouzi

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. InAdvances in Neural Infor- mation Processing Systems, 2022. 3

2022
[47]

Neural atlas graphs for dynamic scene decomposition and editing,

Jan Philipp Schneider, Pratik Singh Bisht, Ilya Chugunov, Andreas Kolb, Michael Moeller, and Felix Heide. Neural atlas graphs for dynamic scene decomposition and editing,
[48]

Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text model...

2022
[49]

Emu edit: Precise image editing via recognition and generation tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and genera- tion tasks.arXiv preprint arXiv:2311.10089, 2023. 3

work page arXiv 2023
[50]

Seededit: Align image re-generation to image editing

Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024. 2

work page arXiv 2024
[51]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et- tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in percepti...

2020
[52]

Lidarf: Delving into lidar for neural radiance field on street scenes

Shanlin Sun, Bingbing Zhuang, Ziyu Jiang, Buyu Liu, Xiao- hui Xie, and Manmohan Chandraker. Lidarf: Delving into lidar for neural radiance field on street scenes. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19563–19572, 2024. 3

2024
[53]

Street- view image generation from a bird’s-eye view layout.IEEE Robotics and Automation Letters, 2024

Alexander Swerdlow, Runsheng Xu, and Bolei Zhou. Street- view image generation from a bird’s-eye view layout.IEEE Robotics and Automation Letters, 2024. 3

2024
[54]

Neurad: Neural rendering for autonomous driving

Adam Tonderski, Carl Lindstr ¨om, Georg Hess, William Ljungbergh, Lennart Svensson, and Christoffer Petersson. Neurad: Neural rendering for autonomous driving. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14895–14904, 2024. 3

2024
[55]

Plug-and-play diffusion features for text-driven image-to-image translation, 2022

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation, 2022. 3

2022
[56]

Imagen editor and editbench: Advancing and evaluating text-guided im- age inpainting

Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont- Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah Laszlo, David J Fleet, Radu Soricut, et al. Imagen editor and editbench: Advancing and evaluating text-guided im- age inpainting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18359– 18369, 2023. 2

2023
[57]

Drivedreamer: towards real-world-driven world models for autonomous driving.arXiv preprint arXiv:2309.09777, 2023

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- driven world models for autonomous driving.arXiv preprint arXiv:2309.09777, 2023. 3

work page arXiv 2023
[58]

Panacea: Panoramic and controllable video generation for autonomous driving

Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3

2024
[59]

Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023. 1, 2, 6

work page internal anchor Pith review arXiv 2023
[60]

Qwen-image technical report,

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...
[61]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Jun- jie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jia- hao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omni- gen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.1887...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Bevcontrol: Accurately controlling street- view elements with multi-perspective consistency via bev sketch layout.arXiv preprint arXiv:2308.01661, 2023

Kairui Yang, Enhui Ma, Jibin Peng, Qing Guo, Di Lin, and Kaicheng Yu. Bevcontrol: Accurately controlling street- view elements with multi-perspective consistency via bev sketch layout.arXiv preprint arXiv:2308.01661, 2023. 3

work page arXiv 2023
[63]

Unisim: A neural closed-loop sensor simulator

Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Mani- vasagam, Wei-Chiu Ma, Anqi Joyce Yang, and Raquel Ur- tasun. Unisim: A neural closed-loop sensor simulator. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 1389–1399, 2023. 3

2023
[64]

Genassets: Gener- ating in-the-wild 3d assets in latent space.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22392–22403, 2025

Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Mani- vasagam, Yun Chen, and Raquel Urtasun. Genassets: Gener- ating in-the-wild 3d assets in latent space.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22392–22403, 2025. 3

2025
[65]

ifinder: Structured zero-shot vision-based llm grounding for dash-cam video reasoning.Advances in Neural Information Processing Systems, 2025

Manyi Yao, Bingbing Zhuang, Sparsh Garg, Amit Roy- Chowdhury, Christian Shelton, Manmohan Chandraker, and Abhishek Aich. ifinder: Structured zero-shot vision-based llm grounding for dash-cam video reasoning.Advances in Neural Information Processing Systems, 2025. 3, 1, 11

2025
[66]

Arbitrary-steps image super-resolution via diffusion inver- sion, 2025

Zongsheng Yue, Kang Liao, and Chen Change Loy. Arbitrary-steps image super-resolution via diffusion inver- sion, 2025. 4

2025
[67]

Magicbrush: A manually annotated dataset for instruction- guided image editing.Advances in Neural Information Pro- cessing Systems, 36, 2024

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction- guided image editing.Advances in Neural Information Pro- cessing Systems, 36, 2024. 2, 3

2024
[68]

Magicbrush: A manually annotated dataset for instruction- guided image editing, 2024

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction- guided image editing, 2024. 6

2024
[69]

Adding conditional control to text-to-image diffusion models, 2023

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 3

2023
[70]

Efros, Eli Shecht- man, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric, 2018. 5

2018
[71]

Hive: Harnessing human feedback for instructional visual editing.arXiv preprint arXiv:2303.09618, 2023

Shu Zhang, Xinyi Yang, Yihao Feng, Can Qin, Chia- Chih Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese, Stefano Ermon, et al. Hive: Harnessing human feedback for instructional visual editing.arXiv preprint arXiv:2303.09618, 2023. 3

work page arXiv 2023
[72]

Ultraedit: Instruction-based fine-grained im- age editing at scale, 2024

Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Ru- jie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained im- age editing at scale, 2024. 1, 2, 5, 6, 7, 8, 9

2024
[73]

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle- consistent adversarial networks. InIEEE International Con- ference on Computer Vision, ICCV 2017, Venice, Italy, Octo- ber 22-29, 2017, pages 2242–2251. IEEE Computer Society,

2017
[74]

3 HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes Supplementary Material
[75]

Image Descriptor

Dataset Collection Details 6.1. Real-World Data Pairing Details Given a multi-season driving dataset with repeated routes and calibrated camera poses, we convert unpaired record- ings into pose-aligned image pairs using a simple geometric matching rule. LetI source be a frame with camera pose(⃗ xs, ϕs, θs, ψs), where⃗ xs ∈R 3 is the camera position and(ϕ ...
[76]

We show the VLM prompt in Appendix 8.1

We use an image-based vision-language model (VLM) [10] for a global interpretation of extremely fine-grained attributes. We show the VLM prompt in Appendix 8.1
[77]

Instance-Level Semantic Decomposition.After preparing the global description, we then record objects present in the scene:

To estimate object distances, we apply a metric depth es- timation model, Metric3d [18], to the full image, produc- ing a depth map whose values correspond to real-world distances. Instance-Level Semantic Decomposition.After preparing the global description, we then record objects present in the scene:
[78]

We run a 2D object detector (Owlv2 [35]) that returns, for each detected object, a bounding box, a class label (from the set ‘ambulance’, ‘bicycle’, ‘traffic light’, ‘traf- fic cone’, ‘person’, ‘car’, ‘motorcycle’, ‘bus’, ‘building’, ‘fire truck’), and a unique object ID
[79]

The object’s distance is taken as the mean depth over this masked area

For each object, we crop the global depth map (from the 2nd step above) to its bounding box and then refine that region with a binary mask from the Segment Anything Model (SAM [21]), ensuring we exclude background pixels. The object’s distance is taken as the mean depth over this masked area
[80]

We show an example annotation in Sec 8.1

We invoke the VLM [10] on each object’s bounding box to extract additional attributes, such as vehicle color or traffic-light state. We show an example annotation in Sec 8.1. 6.3. Global Editing Details We define three categories of global scene edits that capture the dominant real-world variations in driving environments: •Weather:Sunny, Cloudy, Foggy, R...

Showing first 80 references.