Recognition: 1 theorem link
· Lean TheoremTowards Robust Sequential Decomposition for Complex Image Editing
Pith reviewed 2026-05-12 01:50 UTC · model grok-4.3
The pith
Finetuning on synthetic decomposed sequences makes sequential editing robust for complex image instructions and transferable to real photos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By examining single-turn and sequential paradigms inside one in-context editing setup and training on a synthetic dataset of tasks with controlled complexity, the authors found that properly designed sequential decomposition produces robust gains even as the number of operations and inter-step dependencies increase. The decomposition abilities acquired from synthetic data further transfer to real images through co-training with real-world editing examples, showing that sim-to-real generalization is feasible for complex image editing.
What carries the argument
A synthetic data pipeline that generates editing tasks of graded complexity together with their correct decomposed sequences, used to finetune models under a unified in-context framework.
If this is right
- Sequential decomposition scales with task complexity when editing paradigms are designed to limit error accumulation.
- Decomposition skills acquired on synthetic sequences generalize to real images through co-training with real editing data.
- A single unified framework can directly compare single-turn versus sequential approaches on identical inputs.
- Complex instructions involving multiple operations or dependent steps become more reliably executable.
Where Pith is reading between the lines
- The same synthetic-to-real transfer pattern could be tested on video editing or 3D scene manipulation where sequential dependencies are stronger.
- If the pipeline truly replicates real dependencies, analogous synthetic construction might improve other generative tasks such as following detailed text prompts.
- Evaluating the method on open-ended user instructions collected from actual editing sessions would provide a stronger test of practical utility.
Load-bearing premise
The synthetic data pipeline accurately constructs editing tasks that capture real inter-step dependencies and combinatorial complexity without introducing artifacts that do not appear in actual user instructions.
What would settle it
A benchmark of real user instructions with high combinatorial complexity where the co-trained sequential model shows no improvement or clear degradation relative to single-turn editing or baseline sequential methods.
Figures
read the original abstract
Recent advances in visual generative models have enabled high-fidelity image editing guided by human instructions. However, these models often struggle with complex instructions involving combinatorial editing operations or inter-step dependencies. This difficulty stems from the limitations of two canonical paradigms: (1) single-turn editing, which attempts to apply all instructed edits in one pass, often fails to parse the complex instruction accurately and causes undesired edits; and (2) sequential editing can decompose the task into simpler steps but suffers from compounding errors introduced by the sequential execution, leading to low-fidelity results. To derive a robust solution for complex image editing, we examine editing behaviors of different paradigms under a unified in-context editing framework, and study how the benefits of sequential decomposition can be balanced against its error-accumulation drawbacks. We further develop a synthetic data pipeline that constructs editing tasks of varying instruction complexity, allowing us to curate a large-scale editing dataset with high-quality decomposed sequences. By finetuning on synthetic data, we discovered that with properly designed editing paradigms, sequential decomposition yields robust improvements even as task complexity increases. Furthermore, the decomposition skills learned from synthetic tasks can transfer to real images by co-training with real-world editing data, demonstrating the promise of sim-to-real generalization for tackling complex image editing across broader domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines limitations of single-turn and sequential editing paradigms for complex image instructions in generative models. It introduces a synthetic data pipeline to construct tasks of varying combinatorial complexity with high-quality decomposed sequences, finetunes models on this data within a unified in-context framework, and claims that properly designed sequential decomposition yields robust improvements that scale with task complexity. It further claims that decomposition skills transfer to real images via co-training with real-world editing data, enabling sim-to-real generalization.
Significance. If the empirical results hold with proper validation, the work would be significant for advancing reliable complex image editing, as it directly addresses error accumulation in sequential approaches while leveraging synthetic data for scalable training. The demonstrated transfer to real domains could influence practical applications in instruction-guided generative AI.
major comments (2)
- [Synthetic data pipeline] The synthetic data pipeline description provides no validation of fidelity to real inter-step dependencies or combinatorial complexity (e.g., no human realism ratings, distributional comparisons to real editing logs, or pipeline ablations). This is load-bearing for the central claims of robust improvements and sim-to-real transfer, as unverified artifacts could mean the gains do not generalize beyond the synthetic setting.
- [Experiments] Experiments section: the manuscript reports 'robust improvements' and transfer but supplies no quantitative results, error bars, specific baseline comparisons, or measurement details to support these claims. Without such evidence, the strength of the empirical findings cannot be assessed.
minor comments (1)
- [Abstract] The abstract would benefit from including at least one key quantitative result (e.g., a performance delta or scaling trend) to better convey the findings.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for highlighting areas where additional evidence would strengthen the manuscript. We address each major comment below and will incorporate the suggested validations and quantitative details in the revised version.
read point-by-point responses
-
Referee: [Synthetic data pipeline] The synthetic data pipeline description provides no validation of fidelity to real inter-step dependencies or combinatorial complexity (e.g., no human realism ratings, distributional comparisons to real editing logs, or pipeline ablations). This is load-bearing for the central claims of robust improvements and sim-to-real transfer, as unverified artifacts could mean the gains do not generalize beyond the synthetic setting.
Authors: We agree that explicit validation of the synthetic pipeline is necessary to support the central claims. The current manuscript describes the pipeline for generating tasks of controlled combinatorial complexity but does not report human realism ratings, distributional comparisons against real editing logs, or component ablations. In the revision we will add these elements: (1) human ratings of sequence realism and inter-step dependency fidelity on a sampled subset, (2) statistical comparisons of generated editing distributions to real-world logs, and (3) ablations isolating the effect of each pipeline stage. These additions will directly address concerns about generalization beyond the synthetic setting. revision: yes
-
Referee: [Experiments] Experiments section: the manuscript reports 'robust improvements' and transfer but supplies no quantitative results, error bars, specific baseline comparisons, or measurement details to support these claims. Without such evidence, the strength of the empirical findings cannot be assessed.
Authors: We acknowledge that the experiments section currently emphasizes qualitative observations of robust improvements and sim-to-real transfer without sufficient quantitative backing. To enable proper assessment, the revision will include: specific numerical metrics with error bars from multiple runs, direct comparisons against single-turn and standard sequential baselines, and precise descriptions of evaluation protocols (e.g., how instruction adherence and edit fidelity are measured). These details will be added to both the synthetic and co-training experiments. revision: yes
Circularity Check
No circularity: empirical finetuning and data construction are self-contained
full rationale
The paper presents an empirical study: it examines editing paradigms under a unified framework, builds a synthetic data pipeline to generate tasks of varying complexity, curates a dataset, finetunes models, and reports observed improvements plus sim-to-real transfer via co-training. No equations, parameter fits, or derivations are described that reduce any claimed result to its own inputs by construction. The central claims rest on experimental outcomes rather than self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that lack independent verification. This is a standard empirical ML pipeline whose validity can be checked against external real-image benchmarks, so the derivation chain is self-contained.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
synthetic data pipeline... Blender... atomic editing operations... inter-step dependencies... context-guided sequential editing (CGSE)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506, 2025. 2, 3
work page 2025
-
[2]
Improving image genera- tion with better captions
James Betker, Gabriel Goh, Li Jing, † TimBrooks, Jianfeng Wang, Linjie Li, † LongOuyang, † Jun- tangZhuang, † JoyceLee, † YufeiGuo, † Wesam- Manassra, † PrafullaDhariwal, † CaseyChu, † Yunx- inJiao, and Aditya Ramesh. Improving image genera- tion with better captions. 1
-
[3]
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image edit- ing instructions.2023 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 18392–18402, 2022. 1, 2, 6
work page 2023
-
[4]
Masactrl: Tuning-free mutual self-attention control for consis- tent image synthesis and editing
Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consis- tent image synthesis and editing. InProceedings of the IEEE/CVF international conference on computer vision, pages 22560–22570, 2023. 6
work page 2023
-
[5]
Instruction-based image manipulation by watching how things move
Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, and Zhihao Xia. Instruction-based image manipulation by watching how things move. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2704–2713, 2025. 6
work page 2025
-
[6]
Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gor- don Wetzstein, Mohammad Soleymani, and Peng Wang. Bytemorph: Benchmarking instruction- guided image editing with non-rigid motions.ArXiv, abs/2506.03107, 2025. 6
-
[7]
Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization.2024 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 6593–6602, 2023. 6
work page 2024
-
[8]
Unireal: Universal image generation and editing via learning real-world dynamics
Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, et al. Unireal: Universal image generation and editing via learning real-world dynamics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12501–12511,
-
[9]
DiffEdit: Diffusion-based seman- tic image editing with mask guidance
Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion- based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022. 6
-
[10]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging proper- ties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 5, 6, 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Strobl, Matthias Humt, and Rudolph Triebel
Maximilian Denninger, Dominik Winkelbauer, Mar- tin Sundermeyer, Wout Boerdijk, Markus Knauer, Klaus H. Strobl, Matthias Humt, and Rudolph Triebel. Blenderproc2: A procedural pipeline for photorealis- tic rendering.Journal of Open Source Software, 8(82): 4901, 2023. 1
work page 2023
-
[12]
Digi- tal twin catalog: A large-scale photorealistic 3d object digital twin dataset
Zhao Dong, Ka Chen, Zhaoyang Lv, Hong-Xing Yu, Yunzhi Zhang, Cheng Zhang, Yufeng Zhu, Stephen Tian, Zhengqin Li, Geordie Moffatt, Sean Christof- ferson, James Fort, Xiaqing Pan, Mingfei Yan, Jiajun Wu, Carl Yuheng Ren, and Richard Newcombe. Digi- tal twin catalog: A large-scale photorealistic 3d object digital twin dataset. InProceedings of the IEEE/CVF C...
work page 2025
-
[13]
Scaling rectified flow transformers for high- resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rom- bach. Scaling rectified flow transformers for high- resolution image synthesis. InProceedings of the 41st International Conference on Machine Learning, ...
work page 2024
-
[14]
arXiv preprint arXiv:2309.17102 (2023)
Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guiding instruction-based image editing via multimodal large language models.arXiv preprint arXiv:2309.17102,
-
[15]
Poly Haven.https://polyhaven.com/. 5
-
[16]
Prompt-to-Prompt Image Editing with Cross Attention Control
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aber- man, Yael Pritch, and Daniel Cohen-Or. Prompt- to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 6
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Smartedit: Exploring complex instruction-based image editing with multimodal large language models
Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction-based image editing with multimodal large language models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8362–8371, 2024. 6
work page 2024
-
[18]
Imagic: Text-based real image editing with dif- fusion models
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with dif- fusion models. InProceedings of the IEEE/CVF con- 9 ference on computer vision and pattern recognition, pages 6007–6017, 2023. 6
work page 2023
-
[19]
Zebra-cot: A dataset for interleaved vision language reasoning
Ang Li, Charles L. Wang, Kaiyu Yue, Zikui Cai, Ol- lie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, Willie Neiswanger, Furong Huang, Tom Goldstein, and Micah Goldblum. Zebra-cot: A dataset for interleaved vision language reasoning. ArXiv, abs/2507.16746, 2025. 4
-
[20]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Image syn- thesis and editing with stochastic differential equa- tions.arXiv preprint arXiv:2108.01073, 2021. 6
work page internal anchor Pith review arXiv 2021
-
[21]
GLIDE: To- wards photorealistic image generation and editing with text-guided diffusion models
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc- grew, Ilya Sutskever, and Mark Chen. GLIDE: To- wards photorealistic image generation and editing with text-guided diffusion models. InProceedings of the 39th International Conference on Machine Learn- ing, pages 16784–16804. PMLR, 2022. 1
work page 2022
-
[22]
Addendum to gpt-4o system card: Native image generation
OpenAI. Addendum to gpt-4o system card: Native image generation. OpenAI, 2025. 2, 3, 6, 4
work page 2025
-
[23]
Pico-banana-400k: A large-scale dataset for text- guided image editing, 2025
Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jial- ing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text- guided image editing, 2025. 6
work page 2025
-
[24]
Vincie: Unlock- ing in-context image editing from video.ArXiv, abs/2506.10941, 2025
Leigang Qu, Feng Cheng, Ziyan Yang, Qi Zhao, Shanchuan Lin, Yichun Shi, Yicong Li, Wenjie Wang, Tat-Seng Chua, and Lu Jiang. Vincie: Unlock- ing in-context image editing from video.ArXiv, abs/2506.10941, 2025. 2
-
[25]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, pages 8821–8831. PMLR, 2021. 1
work page 2021
-
[26]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents.ArXiv, abs/2204.06125, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer
Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 10674–10685, 2021. 6
work page 2022
-
[28]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Seyedeh Sara Mahdavi, Raphael Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Moham- mad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding.ArXiv, abs/2205.11487, 2022. 1
work page internal anchor Pith review arXiv 2022
-
[29]
Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks.2024 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 8871–8879, 2023. 2
work page 2024
-
[30]
Emu edit: Precise image editing via recognition and generation tasks
Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024. 6
work page 2024
-
[31]
Oriane Sim’eoni, Huy V . V o, Maximilian Seitzer, Fed- erico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michael Ra- mamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth ´ee Darcet, Th ´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Coupr...
work page 2025
-
[32]
Generative multi- modal models are in-context learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multi- modal models are in-context learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024. 2
work page 2024
-
[33]
Gemini 2.5 flash image (nano banana).https : / / gemini
Google Gemini Team. Gemini 2.5 flash image (nano banana).https : / / gemini . google . com/,
-
[34]
CC0 Textures.https://cc0-textures.com/. 5
-
[35]
Plug-and-play diffusion features for text-driven image-to-image translation
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1921–1930, 2023. 6
work page 1921
-
[36]
Om- niedit: Building image editing generalist models through specialist supervision
Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, and Wenhu Chen. Omniedit: Building im- age editing generalist models through specialist super- vision.ArXiv, abs/2411.07199, 2024. 1
-
[37]
Qwen-image techni- cal report, 2025
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...
work page 2025
-
[38]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Da-Wei Liu, De mei Li, Hang Zhang, Hao Meng, Hu Wei, Ji-Li Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Min Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wens...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Omni- gen: Unified image generation
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shut- ing Wang, Tiejun Huang, and Zheng Liu. Omni- gen: Unified image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025. 2
work page 2025
-
[40]
arXiv preprint arXiv:2504.13143 (2025)
Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, and Cihang Xie. Complex-edit: Cot- like instruction generation for complexity-controllable image editing benchmark.ArXiv, abs/2504.13143,
-
[41]
Anyedit: Mastering unified high-quality image editing for any idea
Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mas- tering unified high-quality image editing for any idea. arXiv preprint arXiv:2411.15738, 2024. 6
-
[42]
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.ArXiv, abs/2306.10012, 2023. 2, 6
-
[43]
Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction- based fine-grained image editing at scale.ArXiv, abs/2407.05282, 2024. 1, 2, 6
-
[44]
Zijun Zhou, Yingying Deng, Xiangyu He, Weiming Dong, and Fan Tang. Multi-turn consistent image edit- ing.ArXiv, abs/2505.04320, 2025. 2 11 Towards Robust Sequential Decomposition for Complex Image Editing Supplementary Material A. Synthetic Data Generation A.1. Scene Initialization We implement our synthetic data generation pipeline on top of BlenderProc ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.