Recognition: no theorem link
HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes
Pith reviewed 2026-05-10 19:35 UTC · model grok-4.3
The pith
HorizonWeaver enables photorealistic editing of dense driving scenes using language instructions at multiple levels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HorizonWeaver tackles multi-level granularity, rich semantics, and domain shifts in driving scene editing by generating a large paired real/synthetic dataset from driving sources, introducing language-guided masks that incorporate semantic information for precise control, and applying joint losses that enforce content preservation alongside instruction alignment, yielding a collection of 255K images across 13 categories that improves on prior methods in standard image metrics and downstream task accuracy.
What carries the argument
Language-guided masks enriched with semantic prompts that direct fine-grained edits while joint content-preservation and instruction-alignment losses maintain scene coherence.
If this is right
- Edits remain coherent at both individual object and full scene scales inside crowded traffic environments.
- Performance improves on image similarity measures such as L1 distance, CLIP alignment, and DINO features relative to earlier editors.
- Downstream bird's-eye-view segmentation accuracy rises by a substantial margin on edited data.
- User studies show markedly higher preference for the resulting images over those from prior approaches.
- The method supports generation of controllable scenes for safety validation beyond what real recordings provide.
Where Pith is reading between the lines
- The same pairing and masking strategy could be adapted to generate synthetic training data for other perception tasks such as object detection or lane following.
- Temporal extensions might add frame-to-frame consistency so the technique applies to video sequences of driving scenes.
- The emphasis on domain-shift handling suggests the framework could transfer to editing other dense real-world imagery such as pedestrian crowds or construction sites.
Load-bearing premise
A paired real/synthetic dataset built from multiple driving sources together with language-guided masks and joint losses can produce coherent multi-level edits that generalize to new climates, layouts, and traffic without major artifacts or overfitting.
What would settle it
Evaluating the edited outputs on a held-out collection of driving scenes from previously unseen regions or weather conditions and checking whether photorealism, instruction match, and any downstream segmentation gains hold; a clear drop would indicate the central claim does not generalize.
Figures
read the original abstract
Ensuring safety in autonomous driving requires scalable generation of realistic, controllable driving scenes beyond what real-world testing provides. Yet existing instruction guided image editors, trained on object-centric or artistic data, struggle with dense, safety-critical driving layouts. We propose HorizonWeaver, which tackles three fundamental challenges in driving scene editing: (1) multi-level granularity, requiring coherent object- and scene-level edits in dense environments; (2) rich high-level semantics, preserving diverse objects while following detailed instructions; and (3) ubiquitous domain shifts, handling changes in climate, layout, and traffic across unseen environments. The core of HorizonWeaver is a set of complementary contributions across data, model, and training: (1) Data: Large-scale dataset generation, where we build a paired real/synthetic dataset from Boreas, nuScenes, and Argoverse2 to improve generalization; (2) Model: Language-Guided Masks for fine-grained editing, where semantics-enriched masks and prompts enable precise, language-guided edits; and (3) Training: Content preservation and instruction alignment, where joint losses enforce scene consistency and instruction fidelity. Together, HorizonWeaver provides a scalable framework for photorealistic, instruction-driven editing of complex driving scenes, collecting 255K images across 13 editing categories and outperforming prior methods in L1, CLIP, and DINO metrics, achieving +46.4% user preference and improving BEV segmentation IoU by +33%. Project page: https://msoroco.github.io/horizonweaver/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes HorizonWeaver, a framework for photorealistic instruction-driven editing of complex driving scenes. It tackles multi-level granularity, rich semantics, and domain shifts via three contributions: a paired real/synthetic dataset of 255K images across 13 editing categories constructed from Boreas, nuScenes, and Argoverse2; language-guided masks enriched with semantics and prompts; and joint losses enforcing content preservation and instruction alignment. The method claims to outperform prior editors on L1, CLIP, and DINO metrics, achieve +46.4% user preference, and deliver +33% gains in downstream BEV segmentation IoU.
Significance. If the generalization and editing coherence claims hold, the work would be significant for autonomous driving research, offering a scalable way to generate controllable, safety-critical scene variations beyond real-world collection limits. The scale of the constructed dataset and the downstream IoU improvement on BEV segmentation are notable strengths that could support broader use in simulation and testing pipelines.
major comments (2)
- [Abstract and Experiments] The central claim of generalization to unseen climates, layouts, and traffic (abstract) depends on the 3-source paired dataset enabling robust domain-shift handling, yet no explicit cross-dataset hold-out splits, climate-specific test sets, or quantitative domain-gap metrics (e.g., FID or distribution distances between train and test) are described. Without these, the reported L1/CLIP/DINO margins and +33% BEV IoU could be attributable to reduced distributional shift rather than the language-guided masks or joint losses.
- [Evaluation] The quantitative support for the +46.4% user preference and +33% BEV IoU gains lacks sufficient detail on baseline implementations, data splits, ablation studies isolating the joint losses and mask generation, or controls for post-hoc selection. This makes it difficult to verify that the gains are load-bearing on the proposed model components rather than dataset construction alone.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for clarifying our experimental design and evaluation. We address each major comment below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Abstract and Experiments] The central claim of generalization to unseen climates, layouts, and traffic (abstract) depends on the 3-source paired dataset enabling robust domain-shift handling, yet no explicit cross-dataset hold-out splits, climate-specific test sets, or quantitative domain-gap metrics (e.g., FID or distribution distances between train and test) are described. Without these, the reported L1/CLIP/DINO margins and +33% BEV IoU could be attributable to reduced distributional shift rather than the language-guided masks or joint losses.
Authors: We agree that the manuscript would benefit from explicit documentation of these elements to better isolate the contributions of our components. The paired dataset was constructed from Boreas, nuScenes, and Argoverse2 specifically to introduce diversity in climates, layouts, and traffic during training, with evaluation on held-out portions intended to test generalization. However, cross-dataset hold-out splits, climate-specific test sets, and quantitative metrics such as FID are not detailed. In the revision, we will add: (1) a clear description of the train/test splits across the three sources with no scene overlap, (2) performance breakdowns on climate-specific subsets where feasible, and (3) FID and distribution distance metrics between training and test distributions. These additions will help substantiate that the L1/CLIP/DINO and BEV IoU improvements arise from the language-guided masks and joint losses. revision: yes
-
Referee: [Evaluation] The quantitative support for the +46.4% user preference and +33% BEV IoU gains lacks sufficient detail on baseline implementations, data splits, ablation studies isolating the joint losses and mask generation, or controls for post-hoc selection. This makes it difficult to verify that the gains are load-bearing on the proposed model components rather than dataset construction alone.
Authors: We acknowledge that greater transparency is required to confirm the load-bearing role of our proposed components. The reported gains were measured against adapted prior editors on our 255K-image dataset using the described metrics and user study protocol. In the revised manuscript and supplementary material, we will provide: (1) detailed specifications of baseline implementations and any adaptations to driving scenes, (2) exact data splits and sample counts for each quantitative result, (3) full ablation studies isolating the joint losses and language-guided mask generation, and (4) controls such as results on randomly sampled (non-cherry-picked) outputs to address post-hoc selection. These changes will enable verification that the improvements are attributable to the model rather than dataset construction. revision: yes
Circularity Check
No significant circularity in HorizonWeaver derivation
full rationale
The paper presents an empirical method with three contributions: constructing a paired real/synthetic dataset from public sources (Boreas, nuScenes, Argoverse2), language-guided masks for editing, and joint losses for consistency. Results are reported on external metrics (L1, CLIP, DINO) plus user study and BEV IoU on a newly collected 255K-image dataset across 13 categories. No equations, predictions, or claims reduce by construction to fitted parameters, self-definitions, or self-citation chains; all load-bearing steps rely on independent benchmarks and hold-out evaluations outside the training process.
Axiom & Free-Parameter Ledger
free parameters (1)
- Hyperparameters for the joint losses and mask generation
axioms (2)
- domain assumption Language-enriched masks and prompts can enable precise, coherent edits at multiple granularities in dense scenes
- domain assumption Joint losses can simultaneously enforce scene consistency and instruction fidelity without trade-offs that degrade photorealism
Reference graph
Works this paper leans on
-
[1]
Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, et al. Cosmos-transfer1: Conditional world generation with adaptive multimodal control.arXiv preprint arXiv:2503.14492, 2025. 3
-
[2]
Improving image genera- tion with better captions
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image genera- tion with better captions. 1, 2
-
[3]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023. 1, 2, 3
2023
-
[4]
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structpix2pix: Learning to follow image editing instructions,
-
[5]
Boreas: A multi-season au- tonomous driving dataset.The International Journal of Robotics Research, 42(1-2):33–42, 2023
Keenan Burnett, David J Yoon, Yuchen Wu, Andrew Z Li, Haowei Zhang, Shichen Lu, Jingxing Qian, Wei-Kang Tseng, Andrew Lambert, Keith YK Leung, Angela P Schoel- lig, and Timothy D Barfoot. Boreas: A multi-season au- tonomous driving dataset.The International Journal of Robotics Research, 42(1-2):33–42, 2023. 1, 2, 3, 6
2023
-
[6]
nuscenes: A multi- modal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020. 1, 2, 6
2020
-
[7]
Nuplan: A closed-loop ml-based plan- ning benchmark for autonomous vehicles
Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. Nuplan: A closed-loop ml-based plan- ning benchmark for autonomous vehicles. InCVPR ADP3 workshop, 2021. 6
2021
-
[8]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 6
2021
-
[9]
Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Ruiz, Xuhui Jia, Ming-Wei Chang, and William W. Cohen. Subject-driven text-to-image generation via apprenticeship learning, 2023. 3
2023
-
[10]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 3, 4, 1
2024
-
[11]
Omnire: Omni urban scene reconstruction.arXiv preprint arXiv:2408.16760, 2024
Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lutio, Janick Martinez Esturo, Boris Ivanovic, Or Litany, Zan Go- jcic, Sanja Fidler, Marco Pavone, et al. Omnire: Omni ur- ban scene reconstruction.arXiv preprint arXiv:2408.16760,
-
[12]
VQGAN-CLIP: open domain image generation and editing with natural language guidance
Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Ed- ward Raff. VQGAN-CLIP: open domain image generation and editing with natural language guidance. InComputer Vi- sion - ECCV 2022 - 17th European Conference, Tel Aviv, Is- rael, October 23-27, 2022, Proceedings, Part XXXVII, pages 88–105. Springer, 2022. 3
2022
-
[13]
Emerging properties in unified multimodal pretraining, 2025
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. 3, 7, 8, 9
2025
-
[14]
Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023. 1, 3, 7, 10
-
[15]
Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhen- guo Li, and Qiang Xu. Magicdrivedit: High-resolution long video generation for autonomous driving with adaptive con- trol.arXiv preprint arXiv:2411.13807, 2024. 3
-
[16]
Clova: A closed-loop visual assistant with tool usage and update
Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, and Qing Li. Clova: A closed-loop visual assistant with tool usage and update. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13258–13268, 2024. 3
2024
-
[17]
Prompt-to-Prompt Image Editing with Cross Attention Control
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.CoRR, abs/2208.01626,
work page internal anchor Pith review arXiv
-
[18]
Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1
2024
-
[19]
arXiv preprint arXiv:2404.09990 , year=
Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990, 2024. 2
-
[20]
Imagic: Text-based real image editing with diffusion models
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models.CoRR, abs/2210.09276, 2022. 3
-
[21]
Segment any- thing
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, 2023. 3, 1
2023
-
[22]
Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2024
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2024. 7
2024
-
[23]
Learning an image editing model without image editing pairs, 2025
Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao, Yotam Nitzan, Yuheng Li, Krishna Kumar Singh, Richard Zhang, Eli Shechtman, Jun-Yan Zhu, and Xun Huang. Learning an image editing model without image editing pairs, 2025. 2
2025
-
[24]
Uniscene: Unified occupancy-centric driving scene generation, 2025
Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, Shuchang Zhou, Li Zhang, Xiaojuan Qi, Hao Zhao, Mu Yang, Wenjun Zeng, and Xin Jin. Uniscene: Unified occupancy-centric driving scene generation, 2025. 3
2025
-
[25]
Dongxu Li, Junnan Li, and Steven C. H. Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to- image generation and editing, 2023. 3
2023
-
[26]
Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2024. 7
2024
-
[27]
Driveeditor: A unified 3d information-guided framework for controllable object editing in driving scenes
Yiyuan Liang, Zhiying Yan, Liqun Chen, Jiahuan Zhou, Luxin Yan, Sheng Zhong, and Xu Zou. Driveeditor: A unified 3d information-guided framework for controllable object editing in driving scenes. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5164– 5172, 2025. 3
2025
-
[28]
Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang
Haotong Lin, Sili Chen, Junhao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views,
-
[29]
Open-edit: Open- domain image manipulation with open-vocabulary instruc- tions
Xihui Liu, Zhe Lin, Jianming Zhang, Handong Zhao, Quan Tran, Xiaogang Wang, and Hongsheng Li. Open-edit: Open- domain image manipulation with open-vocabulary instruc- tions. InComputer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceed- ings, Part XI, pages 89–106. Springer, 2020. 3
2020
-
[30]
Bevfusion: Multi- task multi-sensor fusion with unified bird’s-eye view repre- sentation
Zhijian Liu, Haotian Tang, Alexander Amini, Xingyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multi- task multi-sensor fusion with unified bird’s-eye view repre- sentation. InIEEE International Conference on Robotics and Automation (ICRA), 2023. 8, 7
2023
-
[31]
Yifan Lu, Xuanchi Ren, Jiawei Yang, Tianchang Shen, Zhangjie Wu, Jun Gao, Yue Wang, Siheng Chen, Mike Chen, Sanja Fidler, et al. Infinicube: Unbounded and con- trollable dynamic 3d driving scene generation with world- guided video models.arXiv preprint arXiv:2412.03934,
-
[32]
Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning.arXiv preprint arXiv:2307.11410, 2023. 3
-
[33]
Ruling the operational boundaries: A survey on operational design domains of autonomous driving systems
Marcel Aguirre Mehlhorn, Andreas Richter, and Yuri AW Shardt. Ruling the operational boundaries: A survey on operational design domains of autonomous driving systems. IFAC-PapersOnLine, 56(2):2202–2213, 2023. 1
2023
-
[34]
SDEdit: Guided image synthesis and editing with stochastic differential equa- tions
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. InInternational Conference on Learning Representa- tions, 2022. 3
2022
-
[35]
Scaling open-vocabulary object detection, 2024
Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling open-vocabulary object detection, 2024. 3, 1
2024
-
[36]
Dreamland: Controllable world creation with simulator and generative models, 2025
Sicheng Mo, Ziyang Leng, Leon Liu, Weizhen Wang, Honglin He, and Bolei Zhou. Dreamland: Controllable world creation with simulator and generative models, 2025. 10
2025
-
[37]
Null-text inversion for editing real im- ages using guided diffusion models
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real im- ages using guided diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023. 3
2023
-
[38]
GLIDE: towards photorealis- tic image generation and editing with text-guided diffusion models
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: towards photorealis- tic image generation and editing with text-guided diffusion models. InInternational Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, pages 16784–16804. PMLR, 2022. 3
2022
-
[39]
NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, ...
-
[40]
Kosmos-g: Generating images in context with multimodal large language models, 2024
Xichen Pan, Li Dong, Shaohan Huang, Zhiliang Peng, Wenhu Chen, and Furu Wei. Kosmos-g: Generating images in context with multimodal large language models, 2024. 3
2024
-
[41]
Poisson image editing.ACM Trans
Patrick P ´erez, Michel Gangnet, and Andrew Blake. Poisson image editing.ACM Trans. Graph., 22(3):313–318, 2003. 4
2003
-
[42]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8821–8831. PMLR, 2021. 2
2021
-
[43]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with CLIP latents.CoRR, abs/2204.06125, 2022. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
High-resolution image syn- thesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674– 10685. IEEE, 2022. 1, 2
2022
-
[45]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22500–22510, 2023. 3
2023
-
[46]
Fleet, and Mohammad Norouzi
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. InAdvances in Neural Infor- mation Processing Systems, 2022. 3
2022
-
[47]
Neural atlas graphs for dynamic scene decomposition and editing,
Jan Philipp Schneider, Pratik Singh Bisht, Ilya Chugunov, Andreas Kolb, Michael Moeller, and Felix Heide. Neural atlas graphs for dynamic scene decomposition and editing,
-
[48]
Laion-5b: An open large-scale dataset for training next generation image-text models, 2022
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text model...
2022
-
[49]
Emu edit: Precise image editing via recognition and generation tasks
Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and genera- tion tasks.arXiv preprint arXiv:2311.10089, 2023. 3
-
[50]
Seededit: Align image re-generation to image editing
Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024. 2
-
[51]
Scalability in perception for autonomous driving: Waymo open dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et- tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in percepti...
2020
-
[52]
Lidarf: Delving into lidar for neural radiance field on street scenes
Shanlin Sun, Bingbing Zhuang, Ziyu Jiang, Buyu Liu, Xiao- hui Xie, and Manmohan Chandraker. Lidarf: Delving into lidar for neural radiance field on street scenes. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19563–19572, 2024. 3
2024
-
[53]
Street- view image generation from a bird’s-eye view layout.IEEE Robotics and Automation Letters, 2024
Alexander Swerdlow, Runsheng Xu, and Bolei Zhou. Street- view image generation from a bird’s-eye view layout.IEEE Robotics and Automation Letters, 2024. 3
2024
-
[54]
Neurad: Neural rendering for autonomous driving
Adam Tonderski, Carl Lindstr ¨om, Georg Hess, William Ljungbergh, Lennart Svensson, and Christoffer Petersson. Neurad: Neural rendering for autonomous driving. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14895–14904, 2024. 3
2024
-
[55]
Plug-and-play diffusion features for text-driven image-to-image translation, 2022
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation, 2022. 3
2022
-
[56]
Imagen editor and editbench: Advancing and evaluating text-guided im- age inpainting
Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont- Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah Laszlo, David J Fleet, Radu Soricut, et al. Imagen editor and editbench: Advancing and evaluating text-guided im- age inpainting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18359– 18369, 2023. 2
2023
-
[57]
Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- driven world models for autonomous driving.arXiv preprint arXiv:2309.09777, 2023. 3
-
[58]
Panacea: Panoramic and controllable video generation for autonomous driving
Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3
2024
-
[59]
Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting
Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023. 1, 2, 6
work page internal anchor Pith review arXiv 2023
-
[60]
Qwen-image technical report,
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...
-
[61]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Jun- jie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jia- hao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omni- gen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.1887...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Kairui Yang, Enhui Ma, Jibin Peng, Qing Guo, Di Lin, and Kaicheng Yu. Bevcontrol: Accurately controlling street- view elements with multi-perspective consistency via bev sketch layout.arXiv preprint arXiv:2308.01661, 2023. 3
-
[63]
Unisim: A neural closed-loop sensor simulator
Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Mani- vasagam, Wei-Chiu Ma, Anqi Joyce Yang, and Raquel Ur- tasun. Unisim: A neural closed-loop sensor simulator. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 1389–1399, 2023. 3
2023
-
[64]
Genassets: Gener- ating in-the-wild 3d assets in latent space.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22392–22403, 2025
Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Mani- vasagam, Yun Chen, and Raquel Urtasun. Genassets: Gener- ating in-the-wild 3d assets in latent space.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22392–22403, 2025. 3
2025
-
[65]
ifinder: Structured zero-shot vision-based llm grounding for dash-cam video reasoning.Advances in Neural Information Processing Systems, 2025
Manyi Yao, Bingbing Zhuang, Sparsh Garg, Amit Roy- Chowdhury, Christian Shelton, Manmohan Chandraker, and Abhishek Aich. ifinder: Structured zero-shot vision-based llm grounding for dash-cam video reasoning.Advances in Neural Information Processing Systems, 2025. 3, 1, 11
2025
-
[66]
Arbitrary-steps image super-resolution via diffusion inver- sion, 2025
Zongsheng Yue, Kang Liao, and Chen Change Loy. Arbitrary-steps image super-resolution via diffusion inver- sion, 2025. 4
2025
-
[67]
Magicbrush: A manually annotated dataset for instruction- guided image editing.Advances in Neural Information Pro- cessing Systems, 36, 2024
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction- guided image editing.Advances in Neural Information Pro- cessing Systems, 36, 2024. 2, 3
2024
-
[68]
Magicbrush: A manually annotated dataset for instruction- guided image editing, 2024
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction- guided image editing, 2024. 6
2024
-
[69]
Adding conditional control to text-to-image diffusion models, 2023
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 3
2023
-
[70]
Efros, Eli Shecht- man, and Oliver Wang
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric, 2018. 5
2018
-
[71]
Shu Zhang, Xinyi Yang, Yihao Feng, Can Qin, Chia- Chih Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese, Stefano Ermon, et al. Hive: Harnessing human feedback for instructional visual editing.arXiv preprint arXiv:2303.09618, 2023. 3
-
[72]
Ultraedit: Instruction-based fine-grained im- age editing at scale, 2024
Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Ru- jie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained im- age editing at scale, 2024. 1, 2, 5, 6, 7, 8, 9
2024
-
[73]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle- consistent adversarial networks. InIEEE International Con- ference on Computer Vision, ICCV 2017, Venice, Italy, Octo- ber 22-29, 2017, pages 2242–2251. IEEE Computer Society,
2017
-
[74]
3 HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes Supplementary Material
-
[75]
Image Descriptor
Dataset Collection Details 6.1. Real-World Data Pairing Details Given a multi-season driving dataset with repeated routes and calibrated camera poses, we convert unpaired record- ings into pose-aligned image pairs using a simple geometric matching rule. LetI source be a frame with camera pose(⃗ xs, ϕs, θs, ψs), where⃗ xs ∈R 3 is the camera position and(ϕ ...
-
[76]
We show the VLM prompt in Appendix 8.1
We use an image-based vision-language model (VLM) [10] for a global interpretation of extremely fine-grained attributes. We show the VLM prompt in Appendix 8.1
-
[77]
Instance-Level Semantic Decomposition.After preparing the global description, we then record objects present in the scene:
To estimate object distances, we apply a metric depth es- timation model, Metric3d [18], to the full image, produc- ing a depth map whose values correspond to real-world distances. Instance-Level Semantic Decomposition.After preparing the global description, we then record objects present in the scene:
-
[78]
We run a 2D object detector (Owlv2 [35]) that returns, for each detected object, a bounding box, a class label (from the set ‘ambulance’, ‘bicycle’, ‘traffic light’, ‘traf- fic cone’, ‘person’, ‘car’, ‘motorcycle’, ‘bus’, ‘building’, ‘fire truck’), and a unique object ID
-
[79]
The object’s distance is taken as the mean depth over this masked area
For each object, we crop the global depth map (from the 2nd step above) to its bounding box and then refine that region with a binary mask from the Segment Anything Model (SAM [21]), ensuring we exclude background pixels. The object’s distance is taken as the mean depth over this masked area
-
[80]
We show an example annotation in Sec 8.1
We invoke the VLM [10] on each object’s bounding box to extract additional attributes, such as vehicle color or traffic-light state. We show an example annotation in Sec 8.1. 6.3. Global Editing Details We define three categories of global scene edits that capture the dominant real-world variations in driving environments: •Weather:Sunny, Cloudy, Foggy, R...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.