pith. sign in

arxiv: 2605.17294 · v1 · pith:NHQ6XX5Anew · submitted 2026-05-17 · 💻 cs.CV

HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing

Pith reviewed 2026-05-20 14:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords high-resolution image editingdiffusion modelshierarchical editingregion-aware processingefficient inference4K resolutionlocal window diffusion
0
0 comments X

The pith

HierEdit refines only edited regions at high resolution by first editing a low-resolution proxy to localize changes and supply guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that high-resolution image editing can be made efficient and scalable by breaking the task into a coarse low-resolution edit followed by targeted refinement of only the modified areas. A low-resolution proxy first applies an off-the-shelf editor to identify changed regions and generate structural signals. These signals then steer a local-window diffusion process that updates just those patches in the original high-resolution image while holding the rest fixed as conditioning. If the approach succeeds, professional and creative workflows could edit 4K images quickly on ordinary hardware without collecting or training on special high-resolution datasets.

Core claim

The central claim is that performing an initial edit on a low-resolution proxy localizes modified regions and supplies reliable structural guidance plus intermediate denoising supervision; a hierarchical Local-Window MMDiT then refines only the edited regions inside the full-resolution canvas while reusing unaltered areas as conditioning inputs, delivering fast high-fidelity results up to 4K resolution without any specialized high-resolution training data.

What carries the argument

The Local-Window MMDiT, a hierarchical diffusion model that restricts attention and refinement to localized edited windows while conditioning on both the original high-resolution unaltered regions and low-resolution proxy signals.

If this is right

  • Editing inference becomes substantially faster because full-resolution attention is never computed over the entire image.
  • The same model trained only on commodity-resolution data can be applied directly to 4K inputs.
  • Visual quality remains competitive with existing methods on standard-resolution benchmarks.
  • The framework extends without modification to images larger than 4K by scaling the local windows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same proxy-plus-local-refinement pattern could be tested on video clips to enforce temporal consistency by propagating the low-resolution edit across frames.
  • Because the high-resolution stage only needs the low-resolution output as conditioning, the method might combine with any future low-resolution editor without retraining the local model.
  • Similar hierarchical localization could reduce memory and compute in related tasks such as high-resolution image synthesis or texture generation.

Load-bearing premise

The low-resolution proxy must correctly identify which regions changed and must supply accurate structural guidance plus reliable intermediate denoising steps for the high-resolution local refinements.

What would settle it

Apply HierEdit to a 4K test image whose edit mask is known in advance, then compare the output to a full-canvas high-resolution baseline: visible artifacts or semantic drift in the refined patches, or unintended changes in the untouched areas, would contradict the claim.

Figures

Figures reproduced from arXiv: 2605.17294 by Alexander Huang-Menders, Yu-Wing Tai, Yuyao Zhang.

Figure 1
Figure 1. Figure 1: Demo of HierEdit . Our method enables efficient, high-fidelity local editing at ultra-high resolution (2K and 4K here) without the need of 4K data for training. We support both instruction editing and both text- and image-guided inpainting. On commodity resolution (1K) we are much faster. Abstract High-resolution image editing is essential for professional and creative applications, yet existing multimodal… view at source ↗
Figure 2
Figure 2. Figure 2: Schematic of the HierEdit framework. We employ editing of a downsampled input image and region bounding to identify the edited patches and obtain the low-resolution proxy. We then proceed to input concatenation, re-permutation, and positional encoding. Finally, we pass this input to our hierarchical local-window MMDiT model, which generates the high resolution edited results with less denoising steps as up… view at source ↗
Figure 3
Figure 3. Figure 3: The illustration of jointly denoising the integrated to [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison for instructional editing. The asterisk [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of inpainting-based editing methods. The proposed HierEdit demonstrates better performance in preserv [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison at 4K resolution for text guided editing, text-guided inpainting, and subject-guided inpainting. GPT-Image-1 was [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The inappropriate bounding box will lead to issues in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison with low-resolution editing followed by upscaling. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Our model also supports full image editing by removing [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ablation on the number of timesteps skipped. Results [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The inappropriate bounding box will lead to issues in [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: More examples on 4K in-the-wild data [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: More examples on 4K synthetic data [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: More examples on 2K synthetic data [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
read the original abstract

High-resolution image editing is essential for professional and creative applications, yet existing multimodal diffusion-based editors remain computationally inefficient and constrained to relatively low resolutions. Current approaches redundantly process the entire image canvas or rely on large-scale high-resolution datasets, resulting in substantial training and inference costs. We introduce HierEdit, a region-aware hierarchical diffusion framework designed for efficient and scalable high-resolution image editing. Our method first performs edits on a low-resolution proxy using an off-the-shelf editing model to generate a reference and to localize the modified regions. A hierarchical local-window diffusion model (\textbf{Local-Window MMDiT}) that refines only edited regions within the original high-res image, while reusing the unaltered regions as conditioning inputs. The low-resolution proxy further provides structural guidance and intermediate denoising supervision (\textbf{Inference Acceleration}) , ensuring consistent global semantics and stable generation without the need for full-resolution attention computation. This targeted and hierarchical design enables fast, high-fidelity editing of images up to 4K resolution without any specialized high-resolution training data. Extensive experiments demonstrate that HierEdit achieves competitive visual quality on commodity-resolution datasets while significantly accelerating inference and extending seamlessly to ultra-high-resolution 4K editing. Please check our {\href{https://peteryyzhang.github.io/HierEdit-page/}{\textbf{Project Page}}}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces HierEdit, a region-aware hierarchical diffusion framework for efficient high-resolution image editing. It first applies an off-the-shelf editing model to a low-resolution proxy to generate a reference edit and localize modified regions. A Local-Window MMDiT then refines only the edited regions within the original high-resolution image while reusing unaltered regions as conditioning inputs; the low-resolution proxy additionally supplies structural guidance and intermediate denoising supervision to maintain global consistency without full-resolution attention. The targeted design is claimed to enable fast, high-fidelity editing up to 4K resolution without any specialized high-resolution training data, with experiments demonstrating competitive visual quality on commodity-resolution datasets and significant inference acceleration.

Significance. If the low-resolution localization and guidance transfer reliably to high resolution, the hierarchical construction could meaningfully lower the computational barrier for professional-grade 4K image editing by avoiding full-canvas diffusion and high-resolution training data requirements, offering a practical route to scalable editing on commodity hardware.

major comments (1)
  1. [Abstract and method overview] Abstract / hierarchical local-window diffusion model description: the central claim that the low-resolution proxy produces a mask that correctly identifies all modified pixels at native 4K scale and supplies reliable structural guidance plus intermediate denoising targets for the Local-Window MMDiT rests on an unverified transfer assumption; no analysis or bound is provided on localization error for edits whose effects appear only at high spatial frequencies, which directly undermines the fidelity and efficiency guarantees for ultra-high-resolution cases.
minor comments (2)
  1. [Abstract] The sentence introducing the Local-Window MMDiT is grammatically incomplete and should be rephrased for clarity.
  2. [Abstract] The parenthetical label '(Inference Acceleration)' is introduced without prior definition or cross-reference; consider moving the description to a dedicated subsection.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the localization transfer assumption. We address the concern directly below and commit to revisions that strengthen the empirical grounding of the method for high-resolution cases.

read point-by-point responses
  1. Referee: [Abstract and method overview] Abstract / hierarchical local-window diffusion model description: the central claim that the low-resolution proxy produces a mask that correctly identifies all modified pixels at native 4K scale and supplies reliable structural guidance plus intermediate denoising targets for the Local-Window MMDiT rests on an unverified transfer assumption; no analysis or bound is provided on localization error for edits whose effects appear only at high spatial frequencies, which directly undermines the fidelity and efficiency guarantees for ultra-high-resolution cases.

    Authors: We agree that the current manuscript lacks explicit analysis or quantitative bounds on how localization errors from the low-resolution proxy transfer to native 4K resolution, particularly when edit effects are confined to high spatial frequencies. The design relies on the low-resolution edit capturing the dominant semantic and structural modifications, with the Local-Window MMDiT then performing localized refinement on the high-resolution canvas while conditioning on unaltered regions and low-resolution structural guidance. This hierarchical reuse is intended to preserve global consistency even if the mask is imperfect at fine scales. To directly address the referee's point, we will revise the manuscript to include a new subsection with empirical evaluation of localization accuracy (e.g., pixel-level overlap metrics between low-res-derived masks and ground-truth high-res change maps) across edit categories, including synthetic high-frequency modifications. We will also discuss observed failure modes and the role of window overlap in mitigating boundary localization errors. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering construction with external validation

full rationale

The paper presents HierEdit as a practical hierarchical framework: low-resolution proxy editing with an off-the-shelf model for localization and guidance, followed by local-window MMDiT refinement on high-resolution edited regions. This is an engineering design rather than a first-principles derivation or closed-form result. No equations, fitted parameters, or self-citations reduce the performance claims to quantities defined by the method itself. The approach relies on external models and empirical experiments on commodity datasets, remaining self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method relies on standard diffusion model assumptions plus the untested transfer of localization accuracy from low-res to high-res. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Off-the-shelf low-resolution editing model produces accurate region localization and structural guidance usable at high resolution.
    Invoked in the description of the first stage that generates the reference and localizes modified regions.
  • domain assumption Local-window attention in the MMDiT can maintain global consistency when conditioned on unaltered high-resolution regions.
    Central to the claim that only edited regions need refinement while reusing the rest of the image.

pith-pipeline@v0.9.0 · 5768 in / 1355 out tokens · 43480 ms · 2026-05-20T14:35:50.490152+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 10 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020. 3

  3. [3]

    Black Forest Labs. Flux. Black Forest Labs GitHub reposi- tory, 2024. Model release. 2, 4, 6, 1

  4. [4]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs. Flux.1 kontext: Flow matching for in- context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025. 1, 2, 6

  5. [5]

    Hiflow: Training-free high-resolution image gen- eration with flow-aligned guidance

    Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, and Ji- aqi Wang. Hiflow: Training-free high-resolution image gen- eration with flow-aligned guidance. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 3

  6. [6]

    Hidream-i1: An open-source high-efficient image genera- tive foundation model

    Qi Cai, Yehao Li, Yingwei Pan, Ting Yao, and Tao Mei. Hidream-i1: An open-source high-efficient image genera- tive foundation model. InACM International Conference on Multimedia (ACM MM), page 13636–13639, New York, NY , USA, 2025. Association for Computing Machinery. 2, 4

  7. [7]

    Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–10, 2023

    Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–10, 2023. 2

  8. [8]

    Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

    Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, and Zhenguo Li. Pixart-δ: Fast and controllable image generation with latent consistency mod- els.arXiv preprint arXiv:2401.05252, 2024. 2

  9. [9]

    Pixart-$\alpha$: Fast train- ing of diffusion transformer for photorealistic text-to-image synthesis

    Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-$\alpha$: Fast train- ing of diffusion transformer for photorealistic text-to-image synthesis. InInternational Conference on Learning Repre- sentations (ICLR), 2024. 2

  10. [10]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

  11. [11]

    Flashattention-2: Faster attention with better par- allelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better par- allelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024. 2, 3

  12. [12]

    Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), pages 16344– 16359, 2022. 2, 3

  13. [13]

    How sparse attention approximates exact atten- tion? your attention is naturallyn c-sparse.arXiv preprint arXiv:2404.02690, 2025

    Yichuan Deng, Zhao Song, Jing Xiong, and Chiwun Yang. How sparse attention approximates exact atten- tion? your attention is naturallyn c-sparse.arXiv preprint arXiv:2404.02690, 2025. 3

  14. [14]

    Cogview: mastering text-to- image generation via transformers

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. Cogview: mastering text-to- image generation via transformers. InConference on Neural Information Processing Systems (NeurIPS), Red Hook, NY , USA, 2021. Curran Associates Inc. 2

  15. [15]

    I-max: Maximize the resolu- tion potential of pre-trained rectified flow transformers with projected flow.arXiv preprint arXiv:2410.07536, 2024

    Ruoyi Du, Dongyang Liu, Le Zhuo, Qin Qi, Hongsheng Li, Zhanyu Ma, and Peng Gao. I-max: Maximize the resolu- tion potential of pre-trained rectified flow transformers with projected flow.arXiv preprint arXiv:2410.07536, 2024. 3

  16. [16]

    An image is worth one word: Personalizing text-to-image gen- eration using textual inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image gen- eration using textual inversion. InInternational Conference on Learning Representations (ICLR), 2023. 2

  17. [17]

    Seedream 3.0 Technical Report

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025. 2, 4

  18. [18]

    Make a cheap scaling: A self- cascade diffusion model for higher-resolution adaptation

    Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xi- aodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, et al. Make a cheap scaling: A self- cascade diffusion model for higher-resolution adaptation. In European Conference on Computer Vision (ECCV), pages 39–55. Springer, 2024. 3

  19. [19]

    Bridging the divide: Reconsidering softmax and linear atten- tion

    Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xu- ran Pan, Xiu Li, Jiwen Lu, Shiji Song, and Gao Huang. Bridging the divide: Reconsidering softmax and linear atten- tion. InAdvances in Neural Information Processing Systems (NeurIPS), pages 79221–79245, 2024. 3

  20. [20]

    sim- ple diffusion: End-to-end diffusion for high resolution im- ages

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. InInternational Conference on Machine Learning (ICML), pages 13213–13232. PMLR, 2023. 3

  21. [21]

    Fouriscale: A frequency perspective on training-free high-resolution im- age synthesis

    Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution im- age synthesis. InEuropean Conference on Computer Vision (ECCV), pages 196–212. Springer, 2024. 3

  22. [22]

    Compbench: Benchmarking complex instruction-guided im- age editing.arXiv preprint arXiv:2505.12200, 2025

    Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, Lei Bai, Wanli Ouyang, Lin Chen, Fei Zhao, Zihan Wang, Yuan Xie, and Shaohui Lin. Compbench: Benchmarking complex instruction-guided im- age editing.arXiv preprint arXiv:2505.12200, 2025. 1

  23. [23]

    Diffusehigh: Training-free progressive high- resolution image synthesis through structure guidance

    Younghyun Kim, Geunmin Hwang, Junyu Zhang, and Eun- byung Park. Diffusehigh: Training-free progressive high- resolution image synthesis through structure guidance. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 4338–4346, 2025. 3

  24. [24]

    Flexprefill: A context-aware sparse attention mecha- nism for efficient long-sequence inference

    Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context-aware sparse attention mecha- nism for efficient long-sequence inference. InInternational Conference on Learning Representations (ICLR), 2025. 3

  25. [25]

    Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Lin- miao Xu, and Suhail Doshi. Playground v2. 5: Three in- sights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024. 2

  26. [26]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023. 2

  27. [27]

    Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chi- nese understanding.arXiv preprint arXiv:2405.08748, 2024. 2

  28. [28]

    Dynamic sparse attention for scalable transformer acceleration.IEEE Transactions on Computers (TC), 71(12):3165–3178, 2022

    Liu Liu, Zheng Qu, Zhaodong Chen, Fengbin Tu, Yufei Ding, and Yuan Xie. Dynamic sparse attention for scalable transformer acceleration.IEEE Transactions on Computers (TC), 71(12):3165–3178, 2022. 3

  29. [29]

    Linfusion: 1 gpu, 1 minute, 16k image.arXiv preprint arXiv:2409.02097, 2024

    Songhua Liu, Weihao Yu, Zhenxiong Tan, and Xinchao Wang. Linfusion: 1 gpu, 1 minute, 16k image.arXiv preprint arXiv:2409.02097, 2024. 3

  30. [30]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Rep- resentations (ICLR), 2023. 4

  31. [31]

    Hiprompt: Tuning-free higher-resolution gen- eration with hierarchical mllm prompts

    Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, et al. Hiprompt: Tuning-free higher-resolution gen- eration with hierarchical mllm prompts.arXiv preprint arXiv:2409.02919, 2024. 3

  32. [32]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021. 3

  33. [33]

    I2ebench: A comprehensive benchmark for instruction- based image editing

    Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, and Rongrong Ji. I2ebench: A comprehensive benchmark for instruction- based image editing. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), pages 41494–41516. Curran As- sociates, Inc., 2024. 1

  34. [34]

    ACE++: Instruction-Based Image Creation and Editing via Context- Aware Content Filling

    Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction- based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025. 6

  35. [35]

    SDEdit: Guided image synthesis and editing with stochastic differential equa- tions

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. InInternational Conference on Learning Representa- tions (ICLR), 2022. 6

  36. [36]

    Polaformer: Polarity-aware linear attention for vision transformers

    Weikang Meng, Yadan Luo, Xin Li, Dongmei Jiang, and Zheng Zhang. Polaformer: Polarity-aware linear attention for vision transformers. InInternational Conference on Learning Representations (ICLR), 2025. 3

  37. [37]

    Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models

    Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models. InInternational Conference on Machine Learning (ICML), pages 16784–16804. PMLR, 2022. 2

  38. [38]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

  39. [39]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InInternational Con- ference on Learning Representations (ICLR), pages 1862– 1874, 2024. 1, 2

  40. [40]

    Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion

    Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, and Ziwei Liu. Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16893– 16903, 2025. 3

  41. [41]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 2

  42. [42]

    Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks

    Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks. InAdvances in Neural Information Process- ing Systems (NeurIPS), pages 111131–111171. Curran As- sociates, Inc., 2024. 3

  43. [43]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 2

  44. [44]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500– 22510, 2023. 2

  45. [45]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in Neural Information Processing Systems (NeurIPS), 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in Neural Information Processing Systems (NeurIPS), 35:36479–36494, 2022. 2

  46. [46]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InAd- vances in Neural Information Processing Systems (NeurIPS), pages 68658–68685, 2024. 3

  47. [47]

    Emu edit: Precise image editing via recognition and gen- eration tasks

    Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and gen- eration tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871– 8879, 2024. 1

  48. [48]

    Trainable dynamic mask sparse attention.arXiv preprint arXiv:2508.02124, 2025

    Jingze Shi, Yifan Wu, Bingheng Wu, Yiran Peng, Liangdong Wang, Guang Liu, and Yuyu Luo. Trainable dynamic mask sparse attention.arXiv preprint arXiv:2508.02124, 2025. 1

  49. [49]

    Resmaster: Mastering high- resolution image generation via structural and fine-grained guidance

    Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. Resmaster: Mastering high- resolution image generation via structural and fine-grained guidance. InProceedings of the AAAI Conference on Artifi- cial Intelligence, pages 6887–6895, 2025. 3

  50. [50]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  51. [51]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024. 2

  52. [52]

    Ominicontrol: Minimal and univer- sal control for diffusion transformer

    Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and univer- sal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14940–14950, 2025. 3

  53. [53]

    Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang

    Zhenxiong Tan, Qiaochu Xue, Xingyi Yang, Songhua Liu, and Xinchao Wang. Ominicontrol2: Efficient conditioning for diffusion transformers.arXiv preprint arXiv:2503.08280,

  54. [54]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6

  55. [55]

    Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis

    Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jian- qiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis. InInternational Conference on Learning Representa- tions (ICLR), 2024. 3

  56. [56]

    Megafusion: Extend dif- fusion models towards higher-resolution image generation without further tuning

    Haoning Wu, Shaocheng Shen, Qiang Hu, Xiaoyun Zhang, Ya Zhang, and Yanfeng Wang. Megafusion: Extend dif- fusion models towards higher-resolution image generation without further tuning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3944–3953. IEEE, 2025. 3

  57. [57]

    Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity

    Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity. InIn- ternational Conference on Machine Learning (ICML), pages 68208–68224, 2025. 3

  58. [58]

    Difffit: Un- locking transferability of large diffusion models via sim- ple parameter-efficient fine-tuning

    Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, and Zhenguo Li. Difffit: Un- locking transferability of large diffusion models via sim- ple parameter-efficient fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4230–4239, 2023. 3

  59. [59]

    SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers. InInternational Conference on Learning Representations (ICLR), 2025. 2, 3

  60. [60]

    SANA 1.5: Efficient scaling of training-time and inference- time compute in linear diffusion transformer

    Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng YU, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, Bingchen Liu, Daquan Zhou, and Song Han. SANA 1.5: Efficient scaling of training-time and inference- time compute in linear diffusion transformer. InInterna- tional Conference on Machine Learning (ICML), 2025. 2, 3

  61. [61]

    Show-o: One single transformer to unify multimodal understanding and generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InInternational Conference on Learning Rep- resentations (ICLR), 2025. 2

  62. [62]

    Xattention: Block sparse attention with an- tidiagonal scoring

    Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. Xattention: Block sparse attention with an- tidiagonal scoring. InInternational Conference on Machine Learning (ICML), pages 69819–69831, 2025. 3

  63. [63]

    Raphael: text-to-image generation via large mixture of diffusion paths

    Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuo- fan Zong, Yu Liu, and Ping Luo. Raphael: text-to-image generation via large mixture of diffusion paths. InConfer- ence on Neural Information Processing Systems (NeurIPS), Red Hook, NY , USA, 2023. Curran Associates Inc. 2

  64. [64]

    Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation

    Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation. InAd- vances in Neural Information Processing Systems (NeurIPS),

  65. [65]

    Imgedit: A uni- fied image editing dataset and benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A uni- fied image editing dataset and benchmark. InNeural Infor- mation Processing Systems Datasets and Benchmarks Track (NeurIPS D&B), 2025. 1

  66. [66]

    Ultra-resolution adaptation with ease

    Ruonan Yu, Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Ultra-resolution adaptation with ease. InInterna- tional Conference on Machine Learning (ICML), 2025. 3

  67. [67]

    Ditfastattn: Attention compression for diffusion transformer models

    Zhihang Yuan, Hanling Zhang, Lu Pu, Xuefei Ning, Lin- feng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, and Yu Wang. Ditfastattn: Attention compression for diffusion transformer models. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), pages 1196–1219, 2024. 3

  68. [68]

    Ditfastattnv2: Head-wise attention compression for multi-modality diffusion transformers

    Hanling Zhang, Rundong Su, Zhihang Yuan, Pengtao Chen, Mingzhu Shen, Yibo Fan, Shengen Yan, Guohao Dai, and Yu Wang. Ditfastattnv2: Head-wise attention compression for multi-modality diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 16399–16409, 2025. 3

  69. [69]

    Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quan- tization

    Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quan- tization. InInternational Conference on Machine Learning (ICML), 2025. 3

  70. [70]

    Diffusion-4k: Ultra-high-resolution image syn- thesis with latent diffusion models

    Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution image syn- thesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23464–23473, 2025. 3

  71. [71]

    Sageattention: Accurate 8-bit attention for plug-and- play inference acceleration

    Jintao Zhang, Jia Wei, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention: Accurate 8-bit attention for plug-and- play inference acceleration. InInternational Conference on Learning Representations (ICLR), 2025. 3

  72. [72]

    Spargeattn: Accu- rate sparse attention accelerating any model inference

    Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattn: Accu- rate sparse attention accelerating any model inference. InIn- ternational Conference on Machine Learning (ICML), pages 76397–76413, 2025. 3

  73. [73]

    Designing a practical degradation model for deep blind image super-resolution

    Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo- fte. Designing a practical degradation model for deep blind image super-resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791– 4800, 2021. 1

  74. [74]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 2

  75. [75]

    The hedgehog & the porcupine: Expres- sive linear attentions with softmax mimicry

    Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher R ´e. The hedgehog & the porcupine: Expres- sive linear attentions with softmax mimicry. InInternational Conference on Learning Representations (ICLR), 2024. 3

  76. [76]

    Scale-dit: Ultra-high- resolution image generation with hierarchical local attention

    Yuyao Zhang and Yu-Wing Tai. Scale-dit: Ultra-high- resolution image generation with hierarchical local attention. arXiv preprint arXiv:2510.16325, 2025. 3

  77. [77]

    Layercraft: En- hancing text-to-image generation with cot reasoning and lay- ered object integration

    Yuyao Zhang, Jinghao Li, and Yu-Wing Tai. Layercraft: En- hancing text-to-image generation with cot reasoning and lay- ered object integration. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 1

  78. [78]

    Easycontrol: Adding efficient and flexible control for diffusion transformer

    Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19513–19524, 2025. 6

  79. [79]

    Any-size- diffusion: Toward efficient text-driven synthesis for any-size hd images

    Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, and Hang Xu. Any-size- diffusion: Toward efficient text-driven synthesis for any-size hd images. InProceedings of the AAAI Conference on Arti- ficial Intelligence, pages 7571–7578, 2024. 3

  80. [80]

    Dig: Scal- able and efficient diffusion models with gated linear atten- tion

    Lianghui Zhu, Zilong Huang, Bencheng Liao, Jun Hao Liew, Hanshu Yan, Jiashi Feng, and Xinggang Wang. Dig: Scal- able and efficient diffusion models with gated linear atten- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 7664–7674,

Showing first 80 references.