HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing

Alexander Huang-Menders; Yu-Wing Tai; Yuyao Zhang

arxiv: 2605.17294 · v1 · pith:NHQ6XX5Anew · submitted 2026-05-17 · 💻 cs.CV

HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing

Yuyao Zhang , Alexander Huang-Menders , Yu-Wing Tai This is my paper

Pith reviewed 2026-05-20 14:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords high-resolution image editingdiffusion modelshierarchical editingregion-aware processingefficient inference4K resolutionlocal window diffusion

0 comments

The pith

HierEdit refines only edited regions at high resolution by first editing a low-resolution proxy to localize changes and supply guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that high-resolution image editing can be made efficient and scalable by breaking the task into a coarse low-resolution edit followed by targeted refinement of only the modified areas. A low-resolution proxy first applies an off-the-shelf editor to identify changed regions and generate structural signals. These signals then steer a local-window diffusion process that updates just those patches in the original high-resolution image while holding the rest fixed as conditioning. If the approach succeeds, professional and creative workflows could edit 4K images quickly on ordinary hardware without collecting or training on special high-resolution datasets.

Core claim

The central claim is that performing an initial edit on a low-resolution proxy localizes modified regions and supplies reliable structural guidance plus intermediate denoising supervision; a hierarchical Local-Window MMDiT then refines only the edited regions inside the full-resolution canvas while reusing unaltered areas as conditioning inputs, delivering fast high-fidelity results up to 4K resolution without any specialized high-resolution training data.

What carries the argument

The Local-Window MMDiT, a hierarchical diffusion model that restricts attention and refinement to localized edited windows while conditioning on both the original high-resolution unaltered regions and low-resolution proxy signals.

If this is right

Editing inference becomes substantially faster because full-resolution attention is never computed over the entire image.
The same model trained only on commodity-resolution data can be applied directly to 4K inputs.
Visual quality remains competitive with existing methods on standard-resolution benchmarks.
The framework extends without modification to images larger than 4K by scaling the local windows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same proxy-plus-local-refinement pattern could be tested on video clips to enforce temporal consistency by propagating the low-resolution edit across frames.
Because the high-resolution stage only needs the low-resolution output as conditioning, the method might combine with any future low-resolution editor without retraining the local model.
Similar hierarchical localization could reduce memory and compute in related tasks such as high-resolution image synthesis or texture generation.

Load-bearing premise

The low-resolution proxy must correctly identify which regions changed and must supply accurate structural guidance plus reliable intermediate denoising steps for the high-resolution local refinements.

What would settle it

Apply HierEdit to a 4K test image whose edit mask is known in advance, then compare the output to a full-canvas high-resolution baseline: visible artifacts or semantic drift in the refined patches, or unintended changes in the untouched areas, would contradict the claim.

Figures

Figures reproduced from arXiv: 2605.17294 by Alexander Huang-Menders, Yu-Wing Tai, Yuyao Zhang.

**Figure 1.** Figure 1: Demo of HierEdit . Our method enables efficient, high-fidelity local editing at ultra-high resolution (2K and 4K here) without the need of 4K data for training. We support both instruction editing and both text- and image-guided inpainting. On commodity resolution (1K) we are much faster. Abstract High-resolution image editing is essential for professional and creative applications, yet existing multimodal… view at source ↗

**Figure 2.** Figure 2: Schematic of the HierEdit framework. We employ editing of a downsampled input image and region bounding to identify the edited patches and obtain the low-resolution proxy. We then proceed to input concatenation, re-permutation, and positional encoding. Finally, we pass this input to our hierarchical local-window MMDiT model, which generates the high resolution edited results with less denoising steps as up… view at source ↗

**Figure 3.** Figure 3: The illustration of jointly denoising the integrated to [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison for instructional editing. The asterisk [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of inpainting-based editing methods. The proposed HierEdit demonstrates better performance in preserv [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison at 4K resolution for text guided editing, text-guided inpainting, and subject-guided inpainting. GPT-Image-1 was [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: The inappropriate bounding box will lead to issues in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison with low-resolution editing followed by upscaling. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Our model also supports full image editing by removing [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Ablation on the number of timesteps skipped. Results [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: The inappropriate bounding box will lead to issues in [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: More examples on 4K in-the-wild data [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: More examples on 4K synthetic data [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: More examples on 2K synthetic data [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

read the original abstract

High-resolution image editing is essential for professional and creative applications, yet existing multimodal diffusion-based editors remain computationally inefficient and constrained to relatively low resolutions. Current approaches redundantly process the entire image canvas or rely on large-scale high-resolution datasets, resulting in substantial training and inference costs. We introduce HierEdit, a region-aware hierarchical diffusion framework designed for efficient and scalable high-resolution image editing. Our method first performs edits on a low-resolution proxy using an off-the-shelf editing model to generate a reference and to localize the modified regions. A hierarchical local-window diffusion model (\textbf{Local-Window MMDiT}) that refines only edited regions within the original high-res image, while reusing the unaltered regions as conditioning inputs. The low-resolution proxy further provides structural guidance and intermediate denoising supervision (\textbf{Inference Acceleration}) , ensuring consistent global semantics and stable generation without the need for full-resolution attention computation. This targeted and hierarchical design enables fast, high-fidelity editing of images up to 4K resolution without any specialized high-resolution training data. Extensive experiments demonstrate that HierEdit achieves competitive visual quality on commodity-resolution datasets while significantly accelerating inference and extending seamlessly to ultra-high-resolution 4K editing. Please check our {\href{https://peteryyzhang.github.io/HierEdit-page/}{\textbf{Project Page}}}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HierEdit's low-res proxy plus local-window high-res refinement is a practical engineering pattern for 4K editing without high-res training data, though it rests on the proxy mask and guidance transferring cleanly.

read the letter

Hi colleague, the main thing here is that HierEdit first runs an off-the-shelf edit at low resolution to localize changes and generate guidance, then refines only those regions at native high resolution with a Local-Window MMDiT that conditions directly on the unaltered high-res parts of the image. The low-res proxy also supplies structural cues and intermediate denoising targets to keep global consistency without full-resolution attention or any specialized high-res training set. This targeted hierarchy is the actual new piece; prior diffusion editors either process the whole canvas or require heavy high-res data, so the combination of proxy localization with windowed conditioning on fixed regions looks like a useful adaptation of existing MMDiT blocks. The reported results show competitive quality on standard-resolution benchmarks plus clean extension to 4K, which matters for anyone who needs fast edits in creative pipelines. The experiments apparently include speed and visual comparisons that support the efficiency claim. The soft spot is exactly the one the stress-test flags: the low-res proxy has to produce a mask and guidance that still work once upsampled to 4K. If an edit involves fine textures, thin lines, or subtle lighting that disappears under downsampling, the localization can shift or miss pixels, and the local refinement may then either leave artifacts or invent content outside the true edit area. I would want to see ablations on high-frequency edit types and some quantification of boundary consistency or failure cases to judge how bounded the error really is. The math is standard diffusion adapted to local windows, with no obvious circularity or invented entities. This paper is for engineers building practical high-res editing tools rather than theorists. A reader who cares about scalable inference in diffusion models will get concrete value from the hierarchical design and the 4K results. It shows clear thinking on the efficiency bottlenecks in earlier work, so it deserves a serious referee to pressure-test the proxy transfer assumption. I would send it out for review.

Referee Report

1 major / 2 minor

Summary. The paper introduces HierEdit, a region-aware hierarchical diffusion framework for efficient high-resolution image editing. It first applies an off-the-shelf editing model to a low-resolution proxy to generate a reference edit and localize modified regions. A Local-Window MMDiT then refines only the edited regions within the original high-resolution image while reusing unaltered regions as conditioning inputs; the low-resolution proxy additionally supplies structural guidance and intermediate denoising supervision to maintain global consistency without full-resolution attention. The targeted design is claimed to enable fast, high-fidelity editing up to 4K resolution without any specialized high-resolution training data, with experiments demonstrating competitive visual quality on commodity-resolution datasets and significant inference acceleration.

Significance. If the low-resolution localization and guidance transfer reliably to high resolution, the hierarchical construction could meaningfully lower the computational barrier for professional-grade 4K image editing by avoiding full-canvas diffusion and high-resolution training data requirements, offering a practical route to scalable editing on commodity hardware.

major comments (1)

[Abstract and method overview] Abstract / hierarchical local-window diffusion model description: the central claim that the low-resolution proxy produces a mask that correctly identifies all modified pixels at native 4K scale and supplies reliable structural guidance plus intermediate denoising targets for the Local-Window MMDiT rests on an unverified transfer assumption; no analysis or bound is provided on localization error for edits whose effects appear only at high spatial frequencies, which directly undermines the fidelity and efficiency guarantees for ultra-high-resolution cases.

minor comments (2)

[Abstract] The sentence introducing the Local-Window MMDiT is grammatically incomplete and should be rephrased for clarity.
[Abstract] The parenthetical label '(Inference Acceleration)' is introduced without prior definition or cross-reference; consider moving the description to a dedicated subsection.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the localization transfer assumption. We address the concern directly below and commit to revisions that strengthen the empirical grounding of the method for high-resolution cases.

read point-by-point responses

Referee: [Abstract and method overview] Abstract / hierarchical local-window diffusion model description: the central claim that the low-resolution proxy produces a mask that correctly identifies all modified pixels at native 4K scale and supplies reliable structural guidance plus intermediate denoising targets for the Local-Window MMDiT rests on an unverified transfer assumption; no analysis or bound is provided on localization error for edits whose effects appear only at high spatial frequencies, which directly undermines the fidelity and efficiency guarantees for ultra-high-resolution cases.

Authors: We agree that the current manuscript lacks explicit analysis or quantitative bounds on how localization errors from the low-resolution proxy transfer to native 4K resolution, particularly when edit effects are confined to high spatial frequencies. The design relies on the low-resolution edit capturing the dominant semantic and structural modifications, with the Local-Window MMDiT then performing localized refinement on the high-resolution canvas while conditioning on unaltered regions and low-resolution structural guidance. This hierarchical reuse is intended to preserve global consistency even if the mask is imperfect at fine scales. To directly address the referee's point, we will revise the manuscript to include a new subsection with empirical evaluation of localization accuracy (e.g., pixel-level overlap metrics between low-res-derived masks and ground-truth high-res change maps) across edit categories, including synthetic high-frequency modifications. We will also discuss observed failure modes and the role of window overlap in mitigating boundary localization errors. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering construction with external validation

full rationale

The paper presents HierEdit as a practical hierarchical framework: low-resolution proxy editing with an off-the-shelf model for localization and guidance, followed by local-window MMDiT refinement on high-resolution edited regions. This is an engineering design rather than a first-principles derivation or closed-form result. No equations, fitted parameters, or self-citations reduce the performance claims to quantities defined by the method itself. The approach relies on external models and empirical experiments on commodity datasets, remaining self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method relies on standard diffusion model assumptions plus the untested transfer of localization accuracy from low-res to high-res. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Off-the-shelf low-resolution editing model produces accurate region localization and structural guidance usable at high resolution.
Invoked in the description of the first stage that generates the reference and localizes modified regions.
domain assumption Local-window attention in the MMDiT can maintain global consistency when conditioned on unaltered high-resolution regions.
Central to the claim that only edited regions need refinement while reusing the rest of the image.

pith-pipeline@v0.9.0 · 5768 in / 1355 out tokens · 43480 ms · 2026-05-20T14:35:50.490152+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This targeted and hierarchical design enables fast, high-fidelity editing of images up to 4K resolution without any specialized high-resolution training data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 10 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2004
[3]

Black Forest Labs. Flux. Black Forest Labs GitHub reposi- tory, 2024. Model release. 2, 4, 6, 1

work page 2024
[4]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs. Flux.1 kontext: Flow matching for in- context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Hiflow: Training-free high-resolution image gen- eration with flow-aligned guidance

Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, and Ji- aqi Wang. Hiflow: Training-free high-resolution image gen- eration with flow-aligned guidance. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 3

work page 2025
[6]

Hidream-i1: An open-source high-efficient image genera- tive foundation model

Qi Cai, Yehao Li, Yingwei Pan, Ting Yao, and Tao Mei. Hidream-i1: An open-source high-efficient image genera- tive foundation model. InACM International Conference on Multimedia (ACM MM), page 13636–13639, New York, NY , USA, 2025. Association for Computing Machinery. 2, 4

work page 2025
[7]

Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–10, 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–10, 2023. 2

work page 2023
[8]

Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, and Zhenguo Li. Pixart-δ: Fast and controllable image generation with latent consistency mod- els.arXiv preprint arXiv:2401.05252, 2024. 2

work page arXiv 2024
[9]

Pixart-$\alpha$: Fast train- ing of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-$\alpha$: Fast train- ing of diffusion transformer for photorealistic text-to-image synthesis. InInternational Conference on Learning Repre- sentations (ICLR), 2024. 2

work page 2024
[10]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Flashattention-2: Faster attention with better par- allelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better par- allelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024. 2, 3

work page 2024
[12]

Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), pages 16344– 16359, 2022. 2, 3

work page 2022
[13]

How sparse attention approximates exact atten- tion? your attention is naturallyn c-sparse.arXiv preprint arXiv:2404.02690, 2025

Yichuan Deng, Zhao Song, Jing Xiong, and Chiwun Yang. How sparse attention approximates exact atten- tion? your attention is naturallyn c-sparse.arXiv preprint arXiv:2404.02690, 2025. 3

work page arXiv 2025
[14]

Cogview: mastering text-to- image generation via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. Cogview: mastering text-to- image generation via transformers. InConference on Neural Information Processing Systems (NeurIPS), Red Hook, NY , USA, 2021. Curran Associates Inc. 2

work page 2021
[15]

I-max: Maximize the resolu- tion potential of pre-trained rectified flow transformers with projected flow.arXiv preprint arXiv:2410.07536, 2024

Ruoyi Du, Dongyang Liu, Le Zhuo, Qin Qi, Hongsheng Li, Zhanyu Ma, and Peng Gao. I-max: Maximize the resolu- tion potential of pre-trained rectified flow transformers with projected flow.arXiv preprint arXiv:2410.07536, 2024. 3

work page arXiv 2024
[16]

An image is worth one word: Personalizing text-to-image gen- eration using textual inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image gen- eration using textual inversion. InInternational Conference on Learning Representations (ICLR), 2023. 2

work page 2023
[17]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Make a cheap scaling: A self- cascade diffusion model for higher-resolution adaptation

Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xi- aodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, et al. Make a cheap scaling: A self- cascade diffusion model for higher-resolution adaptation. In European Conference on Computer Vision (ECCV), pages 39–55. Springer, 2024. 3

work page 2024
[19]

Bridging the divide: Reconsidering softmax and linear atten- tion

Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xu- ran Pan, Xiu Li, Jiwen Lu, Shiji Song, and Gao Huang. Bridging the divide: Reconsidering softmax and linear atten- tion. InAdvances in Neural Information Processing Systems (NeurIPS), pages 79221–79245, 2024. 3

work page 2024
[20]

sim- ple diffusion: End-to-end diffusion for high resolution im- ages

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. InInternational Conference on Machine Learning (ICML), pages 13213–13232. PMLR, 2023. 3

work page 2023
[21]

Fouriscale: A frequency perspective on training-free high-resolution im- age synthesis

Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution im- age synthesis. InEuropean Conference on Computer Vision (ECCV), pages 196–212. Springer, 2024. 3

work page 2024
[22]

Compbench: Benchmarking complex instruction-guided im- age editing.arXiv preprint arXiv:2505.12200, 2025

Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, Lei Bai, Wanli Ouyang, Lin Chen, Fei Zhao, Zihan Wang, Yuan Xie, and Shaohui Lin. Compbench: Benchmarking complex instruction-guided im- age editing.arXiv preprint arXiv:2505.12200, 2025. 1

work page arXiv 2025
[23]

Diffusehigh: Training-free progressive high- resolution image synthesis through structure guidance

Younghyun Kim, Geunmin Hwang, Junyu Zhang, and Eun- byung Park. Diffusehigh: Training-free progressive high- resolution image synthesis through structure guidance. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 4338–4346, 2025. 3

work page 2025
[24]

Flexprefill: A context-aware sparse attention mecha- nism for efficient long-sequence inference

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context-aware sparse attention mecha- nism for efficient long-sequence inference. InInternational Conference on Learning Representations (ICLR), 2025. 3

work page 2025
[25]

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Lin- miao Xu, and Suhail Doshi. Playground v2. 5: Three in- sights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023. 2

work page 2023
[27]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chi- nese understanding.arXiv preprint arXiv:2405.08748, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Dynamic sparse attention for scalable transformer acceleration.IEEE Transactions on Computers (TC), 71(12):3165–3178, 2022

Liu Liu, Zheng Qu, Zhaodong Chen, Fengbin Tu, Yufei Ding, and Yuan Xie. Dynamic sparse attention for scalable transformer acceleration.IEEE Transactions on Computers (TC), 71(12):3165–3178, 2022. 3

work page 2022
[29]

Linfusion: 1 gpu, 1 minute, 16k image.arXiv preprint arXiv:2409.02097, 2024

Songhua Liu, Weihao Yu, Zhenxiong Tan, and Xinchao Wang. Linfusion: 1 gpu, 1 minute, 16k image.arXiv preprint arXiv:2409.02097, 2024. 3

work page arXiv 2024
[30]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Rep- resentations (ICLR), 2023. 4

work page 2023
[31]

Hiprompt: Tuning-free higher-resolution gen- eration with hierarchical mllm prompts

Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, et al. Hiprompt: Tuning-free higher-resolution gen- eration with hierarchical mllm prompts.arXiv preprint arXiv:2409.02919, 2024. 3

work page arXiv 2024
[32]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021. 3

work page 2021
[33]

I2ebench: A comprehensive benchmark for instruction- based image editing

Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, and Rongrong Ji. I2ebench: A comprehensive benchmark for instruction- based image editing. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), pages 41494–41516. Curran As- sociates, Inc., 2024. 1

work page 2024
[34]

ACE++: Instruction-Based Image Creation and Editing via Context- Aware Content Filling

Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction- based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025. 6

work page arXiv 2025
[35]

SDEdit: Guided image synthesis and editing with stochastic differential equa- tions

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. InInternational Conference on Learning Representa- tions (ICLR), 2022. 6

work page 2022
[36]

Polaformer: Polarity-aware linear attention for vision transformers

Weikang Meng, Yadan Luo, Xin Li, Dongmei Jiang, and Zheng Zhang. Polaformer: Polarity-aware linear attention for vision transformers. InInternational Conference on Learning Representations (ICLR), 2025. 3

work page 2025
[37]

Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models. InInternational Conference on Machine Learning (ICML), pages 16784–16804. PMLR, 2022. 2

work page 2022
[38]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

work page
[39]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InInternational Con- ference on Learning Representations (ICLR), pages 1862– 1874, 2024. 1, 2

work page 2024
[40]

Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion

Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, and Ziwei Liu. Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16893– 16903, 2025. 3

work page 2025
[41]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks

Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks. InAdvances in Neural Information Process- ing Systems (NeurIPS), pages 111131–111171. Curran As- sociates, Inc., 2024. 3

work page 2024
[43]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 2

work page 2022
[44]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500– 22510, 2023. 2

work page 2023
[45]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in Neural Information Processing Systems (NeurIPS), 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in Neural Information Processing Systems (NeurIPS), 35:36479–36494, 2022. 2

work page 2022
[46]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InAd- vances in Neural Information Processing Systems (NeurIPS), pages 68658–68685, 2024. 3

work page 2024
[47]

Emu edit: Precise image editing via recognition and gen- eration tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and gen- eration tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871– 8879, 2024. 1

work page 2024
[48]

Trainable dynamic mask sparse attention.arXiv preprint arXiv:2508.02124, 2025

Jingze Shi, Yifan Wu, Bingheng Wu, Yiran Peng, Liangdong Wang, Guang Liu, and Yuyu Luo. Trainable dynamic mask sparse attention.arXiv preprint arXiv:2508.02124, 2025. 1

work page arXiv 2025
[49]

Resmaster: Mastering high- resolution image generation via structural and fine-grained guidance

Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. Resmaster: Mastering high- resolution image generation via structural and fine-grained guidance. InProceedings of the AAAI Conference on Artifi- cial Intelligence, pages 6887–6895, 2025. 3

work page 2025
[50]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page
[51]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Ominicontrol: Minimal and univer- sal control for diffusion transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and univer- sal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14940–14950, 2025. 3

work page 2025
[53]

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang

Zhenxiong Tan, Qiaochu Xue, Xingyi Yang, Songhua Liu, and Xinchao Wang. Ominicontrol2: Efficient conditioning for diffusion transformers.arXiv preprint arXiv:2503.08280,

work page arXiv
[54]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis

Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jian- qiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis. InInternational Conference on Learning Representa- tions (ICLR), 2024. 3

work page 2024
[56]

Megafusion: Extend dif- fusion models towards higher-resolution image generation without further tuning

Haoning Wu, Shaocheng Shen, Qiang Hu, Xiaoyun Zhang, Ya Zhang, and Yanfeng Wang. Megafusion: Extend dif- fusion models towards higher-resolution image generation without further tuning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3944–3953. IEEE, 2025. 3

work page 2025
[57]

Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity. InIn- ternational Conference on Machine Learning (ICML), pages 68208–68224, 2025. 3

work page 2025
[58]

Difffit: Un- locking transferability of large diffusion models via sim- ple parameter-efficient fine-tuning

Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, and Zhenguo Li. Difffit: Un- locking transferability of large diffusion models via sim- ple parameter-efficient fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4230–4239, 2023. 3

work page 2023
[59]

SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers. InInternational Conference on Learning Representations (ICLR), 2025. 2, 3

work page 2025
[60]

SANA 1.5: Efficient scaling of training-time and inference- time compute in linear diffusion transformer

Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng YU, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, Bingchen Liu, Daquan Zhou, and Song Han. SANA 1.5: Efficient scaling of training-time and inference- time compute in linear diffusion transformer. InInterna- tional Conference on Machine Learning (ICML), 2025. 2, 3

work page 2025
[61]

Show-o: One single transformer to unify multimodal understanding and generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InInternational Conference on Learning Rep- resentations (ICLR), 2025. 2

work page 2025
[62]

Xattention: Block sparse attention with an- tidiagonal scoring

Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. Xattention: Block sparse attention with an- tidiagonal scoring. InInternational Conference on Machine Learning (ICML), pages 69819–69831, 2025. 3

work page 2025
[63]

Raphael: text-to-image generation via large mixture of diffusion paths

Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuo- fan Zong, Yu Liu, and Ping Luo. Raphael: text-to-image generation via large mixture of diffusion paths. InConfer- ence on Neural Information Processing Systems (NeurIPS), Red Hook, NY , USA, 2023. Curran Associates Inc. 2

work page 2023
[64]

Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation. InAd- vances in Neural Information Processing Systems (NeurIPS),

work page
[65]

Imgedit: A uni- fied image editing dataset and benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A uni- fied image editing dataset and benchmark. InNeural Infor- mation Processing Systems Datasets and Benchmarks Track (NeurIPS D&B), 2025. 1

work page 2025
[66]

Ultra-resolution adaptation with ease

Ruonan Yu, Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Ultra-resolution adaptation with ease. InInterna- tional Conference on Machine Learning (ICML), 2025. 3

work page 2025
[67]

Ditfastattn: Attention compression for diffusion transformer models

Zhihang Yuan, Hanling Zhang, Lu Pu, Xuefei Ning, Lin- feng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, and Yu Wang. Ditfastattn: Attention compression for diffusion transformer models. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), pages 1196–1219, 2024. 3

work page 2024
[68]

Ditfastattnv2: Head-wise attention compression for multi-modality diffusion transformers

Hanling Zhang, Rundong Su, Zhihang Yuan, Pengtao Chen, Mingzhu Shen, Yibo Fan, Shengen Yan, Guohao Dai, and Yu Wang. Ditfastattnv2: Head-wise attention compression for multi-modality diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 16399–16409, 2025. 3

work page 2025
[69]

Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quan- tization

Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quan- tization. InInternational Conference on Machine Learning (ICML), 2025. 3

work page 2025
[70]

Diffusion-4k: Ultra-high-resolution image syn- thesis with latent diffusion models

Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution image syn- thesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23464–23473, 2025. 3

work page 2025
[71]

Sageattention: Accurate 8-bit attention for plug-and- play inference acceleration

Jintao Zhang, Jia Wei, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention: Accurate 8-bit attention for plug-and- play inference acceleration. InInternational Conference on Learning Representations (ICLR), 2025. 3

work page 2025
[72]

Spargeattn: Accu- rate sparse attention accelerating any model inference

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattn: Accu- rate sparse attention accelerating any model inference. InIn- ternational Conference on Machine Learning (ICML), pages 76397–76413, 2025. 3

work page 2025
[73]

Designing a practical degradation model for deep blind image super-resolution

Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo- fte. Designing a practical degradation model for deep blind image super-resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791– 4800, 2021. 1

work page 2021
[74]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 2

work page 2023
[75]

The hedgehog & the porcupine: Expres- sive linear attentions with softmax mimicry

Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher R ´e. The hedgehog & the porcupine: Expres- sive linear attentions with softmax mimicry. InInternational Conference on Learning Representations (ICLR), 2024. 3

work page 2024
[76]

Scale-dit: Ultra-high- resolution image generation with hierarchical local attention

Yuyao Zhang and Yu-Wing Tai. Scale-dit: Ultra-high- resolution image generation with hierarchical local attention. arXiv preprint arXiv:2510.16325, 2025. 3

work page arXiv 2025
[77]

Layercraft: En- hancing text-to-image generation with cot reasoning and lay- ered object integration

Yuyao Zhang, Jinghao Li, and Yu-Wing Tai. Layercraft: En- hancing text-to-image generation with cot reasoning and lay- ered object integration. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 1

work page 2025
[78]

Easycontrol: Adding efficient and flexible control for diffusion transformer

Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19513–19524, 2025. 6

work page 2025
[79]

Any-size- diffusion: Toward efficient text-driven synthesis for any-size hd images

Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, and Hang Xu. Any-size- diffusion: Toward efficient text-driven synthesis for any-size hd images. InProceedings of the AAAI Conference on Arti- ficial Intelligence, pages 7571–7578, 2024. 3

work page 2024
[80]

Dig: Scal- able and efficient diffusion models with gated linear atten- tion

Lianghui Zhu, Zilong Huang, Bencheng Liao, Jun Hao Liew, Hanshu Yan, Jiashi Feng, and Xinggang Wang. Dig: Scal- able and efficient diffusion models with gated linear atten- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 7664–7674,

work page

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2004

[3] [3]

Black Forest Labs. Flux. Black Forest Labs GitHub reposi- tory, 2024. Model release. 2, 4, 6, 1

work page 2024

[4] [4]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs. Flux.1 kontext: Flow matching for in- context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Hiflow: Training-free high-resolution image gen- eration with flow-aligned guidance

Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, and Ji- aqi Wang. Hiflow: Training-free high-resolution image gen- eration with flow-aligned guidance. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 3

work page 2025

[6] [6]

Hidream-i1: An open-source high-efficient image genera- tive foundation model

Qi Cai, Yehao Li, Yingwei Pan, Ting Yao, and Tao Mei. Hidream-i1: An open-source high-efficient image genera- tive foundation model. InACM International Conference on Multimedia (ACM MM), page 13636–13639, New York, NY , USA, 2025. Association for Computing Machinery. 2, 4

work page 2025

[7] [7]

Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–10, 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–10, 2023. 2

work page 2023

[8] [8]

Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, and Zhenguo Li. Pixart-δ: Fast and controllable image generation with latent consistency mod- els.arXiv preprint arXiv:2401.05252, 2024. 2

work page arXiv 2024

[9] [9]

Pixart-$\alpha$: Fast train- ing of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-$\alpha$: Fast train- ing of diffusion transformer for photorealistic text-to-image synthesis. InInternational Conference on Learning Repre- sentations (ICLR), 2024. 2

work page 2024

[10] [10]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Flashattention-2: Faster attention with better par- allelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better par- allelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024. 2, 3

work page 2024

[12] [12]

Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), pages 16344– 16359, 2022. 2, 3

work page 2022

[13] [13]

How sparse attention approximates exact atten- tion? your attention is naturallyn c-sparse.arXiv preprint arXiv:2404.02690, 2025

Yichuan Deng, Zhao Song, Jing Xiong, and Chiwun Yang. How sparse attention approximates exact atten- tion? your attention is naturallyn c-sparse.arXiv preprint arXiv:2404.02690, 2025. 3

work page arXiv 2025

[14] [14]

Cogview: mastering text-to- image generation via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. Cogview: mastering text-to- image generation via transformers. InConference on Neural Information Processing Systems (NeurIPS), Red Hook, NY , USA, 2021. Curran Associates Inc. 2

work page 2021

[15] [15]

I-max: Maximize the resolu- tion potential of pre-trained rectified flow transformers with projected flow.arXiv preprint arXiv:2410.07536, 2024

Ruoyi Du, Dongyang Liu, Le Zhuo, Qin Qi, Hongsheng Li, Zhanyu Ma, and Peng Gao. I-max: Maximize the resolu- tion potential of pre-trained rectified flow transformers with projected flow.arXiv preprint arXiv:2410.07536, 2024. 3

work page arXiv 2024

[16] [16]

An image is worth one word: Personalizing text-to-image gen- eration using textual inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image gen- eration using textual inversion. InInternational Conference on Learning Representations (ICLR), 2023. 2

work page 2023

[17] [17]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Make a cheap scaling: A self- cascade diffusion model for higher-resolution adaptation

Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xi- aodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, et al. Make a cheap scaling: A self- cascade diffusion model for higher-resolution adaptation. In European Conference on Computer Vision (ECCV), pages 39–55. Springer, 2024. 3

work page 2024

[19] [19]

Bridging the divide: Reconsidering softmax and linear atten- tion

Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xu- ran Pan, Xiu Li, Jiwen Lu, Shiji Song, and Gao Huang. Bridging the divide: Reconsidering softmax and linear atten- tion. InAdvances in Neural Information Processing Systems (NeurIPS), pages 79221–79245, 2024. 3

work page 2024

[20] [20]

sim- ple diffusion: End-to-end diffusion for high resolution im- ages

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. InInternational Conference on Machine Learning (ICML), pages 13213–13232. PMLR, 2023. 3

work page 2023

[21] [21]

Fouriscale: A frequency perspective on training-free high-resolution im- age synthesis

Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution im- age synthesis. InEuropean Conference on Computer Vision (ECCV), pages 196–212. Springer, 2024. 3

work page 2024

[22] [22]

Compbench: Benchmarking complex instruction-guided im- age editing.arXiv preprint arXiv:2505.12200, 2025

Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng Feng, Zhouhong Gu, Zhenfei Yin, Lei Bai, Wanli Ouyang, Lin Chen, Fei Zhao, Zihan Wang, Yuan Xie, and Shaohui Lin. Compbench: Benchmarking complex instruction-guided im- age editing.arXiv preprint arXiv:2505.12200, 2025. 1

work page arXiv 2025

[23] [23]

Diffusehigh: Training-free progressive high- resolution image synthesis through structure guidance

Younghyun Kim, Geunmin Hwang, Junyu Zhang, and Eun- byung Park. Diffusehigh: Training-free progressive high- resolution image synthesis through structure guidance. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 4338–4346, 2025. 3

work page 2025

[24] [24]

Flexprefill: A context-aware sparse attention mecha- nism for efficient long-sequence inference

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context-aware sparse attention mecha- nism for efficient long-sequence inference. InInternational Conference on Learning Representations (ICLR), 2025. 3

work page 2025

[25] [25]

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Lin- miao Xu, and Suhail Doshi. Playground v2. 5: Three in- sights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023. 2

work page 2023

[27] [27]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chi- nese understanding.arXiv preprint arXiv:2405.08748, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Dynamic sparse attention for scalable transformer acceleration.IEEE Transactions on Computers (TC), 71(12):3165–3178, 2022

Liu Liu, Zheng Qu, Zhaodong Chen, Fengbin Tu, Yufei Ding, and Yuan Xie. Dynamic sparse attention for scalable transformer acceleration.IEEE Transactions on Computers (TC), 71(12):3165–3178, 2022. 3

work page 2022

[29] [29]

Linfusion: 1 gpu, 1 minute, 16k image.arXiv preprint arXiv:2409.02097, 2024

Songhua Liu, Weihao Yu, Zhenxiong Tan, and Xinchao Wang. Linfusion: 1 gpu, 1 minute, 16k image.arXiv preprint arXiv:2409.02097, 2024. 3

work page arXiv 2024

[30] [30]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Rep- resentations (ICLR), 2023. 4

work page 2023

[31] [31]

Hiprompt: Tuning-free higher-resolution gen- eration with hierarchical mllm prompts

Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, et al. Hiprompt: Tuning-free higher-resolution gen- eration with hierarchical mllm prompts.arXiv preprint arXiv:2409.02919, 2024. 3

work page arXiv 2024

[32] [32]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021. 3

work page 2021

[33] [33]

I2ebench: A comprehensive benchmark for instruction- based image editing

Yiwei Ma, Jiayi Ji, Ke Ye, Weihuang Lin, Zhibin Wang, Yonghan Zheng, Qiang Zhou, Xiaoshuai Sun, and Rongrong Ji. I2ebench: A comprehensive benchmark for instruction- based image editing. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), pages 41494–41516. Curran As- sociates, Inc., 2024. 1

work page 2024

[34] [34]

ACE++: Instruction-Based Image Creation and Editing via Context- Aware Content Filling

Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction- based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025. 6

work page arXiv 2025

[35] [35]

SDEdit: Guided image synthesis and editing with stochastic differential equa- tions

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equa- tions. InInternational Conference on Learning Representa- tions (ICLR), 2022. 6

work page 2022

[36] [36]

Polaformer: Polarity-aware linear attention for vision transformers

Weikang Meng, Yadan Luo, Xin Li, Dongmei Jiang, and Zheng Zhang. Polaformer: Polarity-aware linear attention for vision transformers. InInternational Conference on Learning Representations (ICLR), 2025. 3

work page 2025

[37] [37]

Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealis- tic image generation and editing with text-guided diffusion models. InInternational Conference on Machine Learning (ICML), pages 16784–16804. PMLR, 2022. 2

work page 2022

[38] [38]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

work page

[39] [39]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InInternational Con- ference on Learning Representations (ICLR), pages 1862– 1874, 2024. 1, 2

work page 2024

[40] [40]

Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion

Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, and Ziwei Liu. Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16893– 16903, 2025. 3

work page 2025

[41] [41]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[42] [42]

Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks

Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks. InAdvances in Neural Information Process- ing Systems (NeurIPS), pages 111131–111171. Curran As- sociates, Inc., 2024. 3

work page 2024

[43] [43]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 2

work page 2022

[44] [44]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500– 22510, 2023. 2

work page 2023

[45] [45]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in Neural Information Processing Systems (NeurIPS), 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in Neural Information Processing Systems (NeurIPS), 35:36479–36494, 2022. 2

work page 2022

[46] [46]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InAd- vances in Neural Information Processing Systems (NeurIPS), pages 68658–68685, 2024. 3

work page 2024

[47] [47]

Emu edit: Precise image editing via recognition and gen- eration tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and gen- eration tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871– 8879, 2024. 1

work page 2024

[48] [48]

Trainable dynamic mask sparse attention.arXiv preprint arXiv:2508.02124, 2025

Jingze Shi, Yifan Wu, Bingheng Wu, Yiran Peng, Liangdong Wang, Guang Liu, and Yuyu Luo. Trainable dynamic mask sparse attention.arXiv preprint arXiv:2508.02124, 2025. 1

work page arXiv 2025

[49] [49]

Resmaster: Mastering high- resolution image generation via structural and fine-grained guidance

Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. Resmaster: Mastering high- resolution image generation via structural and fine-grained guidance. InProceedings of the AAAI Conference on Artifi- cial Intelligence, pages 6887–6895, 2025. 3

work page 2025

[50] [50]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page

[51] [51]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Ominicontrol: Minimal and univer- sal control for diffusion transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and univer- sal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14940–14950, 2025. 3

work page 2025

[53] [53]

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang

Zhenxiong Tan, Qiaochu Xue, Xingyi Yang, Songhua Liu, and Xinchao Wang. Ominicontrol2: Efficient conditioning for diffusion transformers.arXiv preprint arXiv:2503.08280,

work page arXiv

[54] [54]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis

Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jian- qiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis. InInternational Conference on Learning Representa- tions (ICLR), 2024. 3

work page 2024

[56] [56]

Megafusion: Extend dif- fusion models towards higher-resolution image generation without further tuning

Haoning Wu, Shaocheng Shen, Qiang Hu, Xiaoyun Zhang, Ya Zhang, and Yanfeng Wang. Megafusion: Extend dif- fusion models towards higher-resolution image generation without further tuning. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3944–3953. IEEE, 2025. 3

work page 2025

[57] [57]

Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity. InIn- ternational Conference on Machine Learning (ICML), pages 68208–68224, 2025. 3

work page 2025

[58] [58]

Difffit: Un- locking transferability of large diffusion models via sim- ple parameter-efficient fine-tuning

Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, and Zhenguo Li. Difffit: Un- locking transferability of large diffusion models via sim- ple parameter-efficient fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4230–4239, 2023. 3

work page 2023

[59] [59]

SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers. InInternational Conference on Learning Representations (ICLR), 2025. 2, 3

work page 2025

[60] [60]

SANA 1.5: Efficient scaling of training-time and inference- time compute in linear diffusion transformer

Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng YU, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, Bingchen Liu, Daquan Zhou, and Song Han. SANA 1.5: Efficient scaling of training-time and inference- time compute in linear diffusion transformer. InInterna- tional Conference on Machine Learning (ICML), 2025. 2, 3

work page 2025

[61] [61]

Show-o: One single transformer to unify multimodal understanding and generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InInternational Conference on Learning Rep- resentations (ICLR), 2025. 2

work page 2025

[62] [62]

Xattention: Block sparse attention with an- tidiagonal scoring

Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. Xattention: Block sparse attention with an- tidiagonal scoring. InInternational Conference on Machine Learning (ICML), pages 69819–69831, 2025. 3

work page 2025

[63] [63]

Raphael: text-to-image generation via large mixture of diffusion paths

Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuo- fan Zong, Yu Liu, and Ping Luo. Raphael: text-to-image generation via large mixture of diffusion paths. InConfer- ence on Neural Information Processing Systems (NeurIPS), Red Hook, NY , USA, 2023. Curran Associates Inc. 2

work page 2023

[64] [64]

Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation. InAd- vances in Neural Information Processing Systems (NeurIPS),

work page

[65] [65]

Imgedit: A uni- fied image editing dataset and benchmark

Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A uni- fied image editing dataset and benchmark. InNeural Infor- mation Processing Systems Datasets and Benchmarks Track (NeurIPS D&B), 2025. 1

work page 2025

[66] [66]

Ultra-resolution adaptation with ease

Ruonan Yu, Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Ultra-resolution adaptation with ease. InInterna- tional Conference on Machine Learning (ICML), 2025. 3

work page 2025

[67] [67]

Ditfastattn: Attention compression for diffusion transformer models

Zhihang Yuan, Hanling Zhang, Lu Pu, Xuefei Ning, Lin- feng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, and Yu Wang. Ditfastattn: Attention compression for diffusion transformer models. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), pages 1196–1219, 2024. 3

work page 2024

[68] [68]

Ditfastattnv2: Head-wise attention compression for multi-modality diffusion transformers

Hanling Zhang, Rundong Su, Zhihang Yuan, Pengtao Chen, Mingzhu Shen, Yibo Fan, Shengen Yan, Guohao Dai, and Yu Wang. Ditfastattnv2: Head-wise attention compression for multi-modality diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 16399–16409, 2025. 3

work page 2025

[69] [69]

Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quan- tization

Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quan- tization. InInternational Conference on Machine Learning (ICML), 2025. 3

work page 2025

[70] [70]

Diffusion-4k: Ultra-high-resolution image syn- thesis with latent diffusion models

Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution image syn- thesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23464–23473, 2025. 3

work page 2025

[71] [71]

Sageattention: Accurate 8-bit attention for plug-and- play inference acceleration

Jintao Zhang, Jia Wei, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention: Accurate 8-bit attention for plug-and- play inference acceleration. InInternational Conference on Learning Representations (ICLR), 2025. 3

work page 2025

[72] [72]

Spargeattn: Accu- rate sparse attention accelerating any model inference

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattn: Accu- rate sparse attention accelerating any model inference. InIn- ternational Conference on Machine Learning (ICML), pages 76397–76413, 2025. 3

work page 2025

[73] [73]

Designing a practical degradation model for deep blind image super-resolution

Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo- fte. Designing a practical degradation model for deep blind image super-resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791– 4800, 2021. 1

work page 2021

[74] [74]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023. 2

work page 2023

[75] [75]

The hedgehog & the porcupine: Expres- sive linear attentions with softmax mimicry

Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher R ´e. The hedgehog & the porcupine: Expres- sive linear attentions with softmax mimicry. InInternational Conference on Learning Representations (ICLR), 2024. 3

work page 2024

[76] [76]

Scale-dit: Ultra-high- resolution image generation with hierarchical local attention

Yuyao Zhang and Yu-Wing Tai. Scale-dit: Ultra-high- resolution image generation with hierarchical local attention. arXiv preprint arXiv:2510.16325, 2025. 3

work page arXiv 2025

[77] [77]

Layercraft: En- hancing text-to-image generation with cot reasoning and lay- ered object integration

Yuyao Zhang, Jinghao Li, and Yu-Wing Tai. Layercraft: En- hancing text-to-image generation with cot reasoning and lay- ered object integration. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 1

work page 2025

[78] [78]

Easycontrol: Adding efficient and flexible control for diffusion transformer

Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19513–19524, 2025. 6

work page 2025

[79] [79]

Any-size- diffusion: Toward efficient text-driven synthesis for any-size hd images

Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, and Hang Xu. Any-size- diffusion: Toward efficient text-driven synthesis for any-size hd images. InProceedings of the AAAI Conference on Arti- ficial Intelligence, pages 7571–7578, 2024. 3

work page 2024

[80] [80]

Dig: Scal- able and efficient diffusion models with gated linear atten- tion

Lianghui Zhu, Zilong Huang, Bencheng Liao, Jun Hao Liew, Hanshu Yan, Jiashi Feng, and Xinggang Wang. Dig: Scal- able and efficient diffusion models with gated linear atten- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 7664–7674,

work page