PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

Aoyu Li; Jiannan Wang; Jiarui Fang; Jinzhe Pan; Xibo Sun

arxiv: 2405.14430 · v4 · submitted 2024-05-23 · 💻 cs.CV · cs.AI· cs.PF

PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

Jiarui Fang , Jinzhe Pan , Aoyu Li , Xibo Sun , Jiannan Wang This is my paper

Pith reviewed 2026-05-24 01:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.PF

keywords diffusion transformerspipeline parallelismpatch partitioninginference optimizationmulti-GPUimage generationcommunication reduction

0 comments

The pith

PipeFusion partitions diffusion transformer images into patches and reuses one-step-old feature maps to cut communication costs during multi-GPU inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PipeFusion to lower the latency of high-resolution image generation with diffusion transformers on multiple GPUs. It splits both the image into patches and the model layers across devices, then runs a patch-level pipeline that overlaps computation with reduced data movement. Because inputs change only gradually across diffusion steps, the method reuses feature maps computed one step earlier to supply context for the current step. This reuse lowers the volume of inter-GPU transfers relative to tensor parallelism, sequence parallelism, and DistriFusion while also spreading parameters to improve memory efficiency on large models. Experiments on eight L40 GPUs report faster runtimes for PixArt, Stable Diffusion 3, and Flux.1.

Core claim

PipeFusion achieves state-of-the-art inference speed for diffusion transformers by applying patch-level pipeline parallelism that reuses one-step stale feature maps, thereby reducing communication volume compared with prior parallelism schemes while distributing parameters for better memory use on large models.

What carries the argument

Patch-level pipeline parallelism that reuses one-step stale feature maps to supply context for the current pipeline step.

If this is right

Larger DiT models such as Flux.1 become practical on PCIe GPU clusters because parameters are distributed rather than replicated.
Communication volume drops below that of tensor, sequence, and DistriFusion baselines, directly lowering end-to-end latency.
Memory footprint per device shrinks, allowing higher-resolution generation or batch sizes on the same hardware.
The pipeline schedule can be applied to any DiT variant that exhibits gradual input change across denoising steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reuse idea could be tested on video or 3D diffusion models where temporal or spatial coherence is even stronger.
If the similarity assumption weakens at very late denoising steps, a hybrid schedule that switches to fresh maps near the end might preserve quality while retaining most of the speedup.
Extending the patch pipeline to heterogeneous GPU clusters would require only adjustments to the communication pattern, not to the core reuse logic.

Load-bearing premise

High similarity between inputs from successive diffusion steps lets one-step-old feature maps supply adequate context without materially harming final image quality.

What would settle it

Run the same models with and without the stale-map reuse on the eight-GPU setup and compare both wall-clock time and standard image-quality metrics such as FID; a large quality drop would falsify the central claim.

Figures

Figures reproduced from arXiv: 2405.14430 by Aoyu Li, Jiannan Wang, Jiarui Fang, Jinzhe Pan, Xibo Sun.

**Figure 1.** Figure 1: Workflow of DiTs inference. Input Temporal Redundancy: Diffusion Model entails the iterative prediction of noise from the input image or video. Recent research has highlighted the concept of input temporal redundancy, 2 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison of DistriFusion with Sequence Parallelism methods (DeepSpeed-Ulysses and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Above: partitioning strategy for Input and DiTs backbone network. Below: Workflow of [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The fresh part of activations during diffusion timestep [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Latency on Pixart of various parallel approaches on two image generation tasks with the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Latency on SD3-medium of various parallel approaches on two image generation tasks [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Latency on Flux.1-dev of various parallel approaches on three image generation tasks with [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Scalability Analysis of Three DiTs on 8×L40: Leftmost figures depict Pixart performance on 1024px, 2048px, and 4096px images. The second from the right shows SD3 performance on 1024px. The rightmost figure illustrates Flux.1-dev performance on 1024px. The speedup to 1 GPU baseline is labeled on the points of pipefusion curves. Stable Diffusion 3 [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of GPU memory usage across different models and resolutions. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Above: Showcases for 1024px Generation Images of PipeFusion and DistriFusion using [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Latency of PipeFusion (without warmup) for various patch numbers [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

read the original abstract

This paper presents PipeFusion, an innovative parallel methodology to tackle the high latency issues associated with generating high-resolution images using diffusion transformers (DiTs) models. PipeFusion partitions images into patches and the model layers across multiple GPUs. It employs a patch-level pipeline parallel strategy to orchestrate communication and computation efficiently. By capitalizing on the high similarity between inputs from successive diffusion steps, PipeFusion reuses one-step stale feature maps to provide context for the current pipeline step. This approach notably reduces communication costs compared to existing DiTs inference parallelism, including tensor parallel, sequence parallel and DistriFusion. PipeFusion enhances memory efficiency through parameter distribution across devices, ideal for large DiTs like Flux.1. Experimental results demonstrate that PipeFusion achieves state-of-the-art performance on 8$\times$L40 PCIe GPUs for Pixart, Stable-Diffusion 3, and Flux.1 models. Our source code is available at https://github.com/xdit-project/xDiT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PipeFusion adapts pipeline parallelism to DiT inference by reusing one-step stale features to cut communication, but the quality impact of that reuse is not clearly quantified in the reported results.

read the letter

The core idea is to split an image into patches, distribute layers across GPUs, and run a pipeline where each stage gets feature maps computed on the previous diffusion timestep instead of fresh ones. This exploits the gradual change between denoising steps to avoid sending full activations every time. The paper positions this as better than tensor parallel, sequence parallel, or DistriFusion on 8 L40 PCIe GPUs for Pixart, SD3, and Flux.1, with code released on GitHub. That combination of patch-level pipelining plus explicit stale reuse for diffusion is the concrete addition beyond the cited prior methods. The engineering focus on memory distribution for large models like Flux.1 and the public implementation are the parts that actually help practitioners. The approach itself is standard parallelism adapted to the diffusion setting, with no hidden parameters or circular claims. The soft spot is the central assumption that one-step stale maps supply enough context without hurting output. The abstract states SOTA latency and communication savings but gives no numbers on FID, CLIP score, or human preference when the reuse is turned on versus off. If the full paper contains those ablations they are not foregrounded, and the concern about later denoising steps or Flux.1's larger features remains open. Without that evidence the claimed performance cannot be fully assessed. This is for people who need to run large DiTs across multiple GPUs where PCIe bandwidth is the limiter. A reader building inference systems would find the implementation details and baseline comparisons useful even if they have to re-verify the quality numbers themselves. It should go to peer review because the method is implemented, tested on current models, and released, so referees can check the missing ablations directly.

Referee Report

2 major / 1 minor

Summary. The paper proposes PipeFusion, a patch-level pipeline parallelism technique for Diffusion Transformer (DiT) inference. It partitions both image patches and model layers across GPUs and reuses one-step-stale feature maps (exploiting similarity between successive diffusion steps) to reduce inter-GPU communication relative to tensor parallelism, sequence parallelism, and DistriFusion. The method is evaluated on PixArt, Stable Diffusion 3, and Flux.1, claiming state-of-the-art latency on 8×L40 PCIe GPUs while preserving output quality; source code is released.

Significance. If the quality-preservation claim holds, PipeFusion would offer a practical route to lower-latency, memory-efficient inference for large DiTs on commodity multi-GPU hardware. The open-source release is a concrete strength that supports reproducibility and follow-on work.

major comments (2)

[Experimental results / Evaluation] The central performance claim rests on the unverified assumption that one-step stale feature maps supply adequate context without materially degrading image quality. No ablation isolating the quality impact of this reuse (FID, CLIP score, or human preference with vs. without stale maps) is referenced in the evaluation description, which is load-bearing for the communication-savings argument.
[Abstract and Experimental results] The abstract asserts SOTA latency on 8×L40 GPUs for PixArt, SD3, and Flux.1 yet supplies no quantitative numbers, baselines, or quality metrics. Without these data the SOTA claim cannot be assessed.

minor comments (1)

Notation for patch partitioning and pipeline stages should be defined once with a consistent symbol table or figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Experimental results / Evaluation] The central performance claim rests on the unverified assumption that one-step stale feature maps supply adequate context without materially degrading image quality. No ablation isolating the quality impact of this reuse (FID, CLIP score, or human preference with vs. without stale maps) is referenced in the evaluation description, which is load-bearing for the communication-savings argument.

Authors: We agree that an explicit ablation isolating the quality impact of one-step stale feature reuse would provide stronger evidence. The current manuscript reports overall quality metrics (FID, CLIP) for PipeFusion versus baselines but does not include a controlled with/without-stale comparison. We will add this ablation study in the revised version. revision: yes
Referee: [Abstract and Experimental results] The abstract asserts SOTA latency on 8×L40 GPUs for PixArt, SD3, and Flux.1 yet supplies no quantitative numbers, baselines, or quality metrics. Without these data the SOTA claim cannot be assessed.

Authors: We acknowledge that the abstract currently lacks specific numerical results. In the revision we will expand the abstract to include key latency figures, the compared baselines, and quality metrics that support the SOTA claim on 8×L40 GPUs. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering method with external benchmarks and no self-referential derivations

full rationale

The paper describes a patch-level pipeline parallelism technique for DiT inference that reuses one-step stale feature maps based on observed input similarity across diffusion steps. This is presented as an empirical engineering choice, not a derived prediction or first-principles result. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. Performance is benchmarked against external baselines (tensor parallel, sequence parallel, DistriFusion) on Pixart, SD3, and Flux.1, with code released for reproduction. The similarity assumption is stated explicitly but does not reduce any result to its own inputs by construction. This is a standard non-circular systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of sufficient feature similarity across diffusion steps and standard assumptions about GPU interconnect costs; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Inputs from successive diffusion steps have high similarity allowing reuse of one-step stale feature maps
Invoked to justify reduced communication while preserving pipeline correctness.

pith-pipeline@v0.9.0 · 5711 in / 1178 out tokens · 31182 ms · 2026-05-24T01:15:25.711684+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference
cs.DC 2026-05 unverdicted novelty 6.0

ChunkFlow achieves up to 1.28x step-time speedup and up to 49% lower peak GPU memory for DiT inference by using a first-order model to guide communication-aware chunked prefetching.
CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism
cs.DC 2026-04 unverdicted novelty 6.0

CoCoDiff achieves 3.6x average and 8.4x peak speedup for distributed DiT inference on up to 96 GPU tiles via tile-aware all-to-all, V-first scheduling, and selective V communication.
GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads
cs.DC 2026-04 unverdicted novelty 6.0

GENSERVE improves SLO attainment by up to 44% for co-serving heterogeneous T2I and T2V diffusion workloads via step-level preemption, elastic parallelism, and joint scheduling.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 3 Pith papers · 14 internal anchors

[1]

Announcing black forest labs

BlackForestLabs. Announcing black forest labs. https://blackforestlabs.ai/ announcing-black-forest-labs/, 2024. Accessed: [2024.10]

work page 2024
[2]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, 2024

work page 2024
[3]

Pixart- \sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation.arXiv preprint arXiv:2403.04692, 2024

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- \sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation.arXiv preprint arXiv:2403.04692, 2024

work page arXiv 2024
[4]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023
[5]

Video generation models as world simulators

OpenAI. Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators/, 2024. Accessed: May 2024

work page 2024
[6]

Movie gen: A cast of media foundation models

MetaAI. Movie gen: A cast of media foundation models. https: // ai. meta. com/ static-resource/ movie-gen-research-paper, 2024. Accessed: 2024-10-14

work page 2024
[7]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015

work page 2015
[8]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[9]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Frdiff: Feature reuse for exquisite zero-shot acceleration of diffusion models.CoRR, abs/2312.03517, 2023

Junhyuk So, Jungwon Lee, and Eunhyeok Park. Frdiff: Feature reuse for exquisite zero-shot acceleration of diffusion models.CoRR, abs/2312.03517, 2023

work page arXiv 2023
[12]

Deepcache: Accelerating diffusion models for free

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15762–15772, 2024

work page 2024
[13]

Learning-to-cache: Accelerating diffusion transformer via layer caching.arXiv preprint arXiv:2406.01733, 2024

Xinyin Ma, Gongfan Fang, Michael Bi Mi, and Xinchao Wang. Learning-to-cache: Accelerating diffusion transformer via layer caching.arXiv preprint arXiv:2406.01733, 2024

work page arXiv 2024
[14]

Ditfastattn: Attention compression for diffusion transformer models.arXiv preprint arXiv:2406.08552, 2024

Zhihang Yuan, Pu Lu, Hanling Zhang, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, and Yu Wang. Ditfastattn: Attention compression for diffusion transformer models.arXiv preprint arXiv:2406.08552, 2024

work page arXiv 2024
[15]

Real-time video generation with pyramid attention broad- cast.arXiv preprint arXiv:2408.12588, 2024

Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. Real-time video generation with pyramid attention broadcast.arXiv preprint arXiv:2408.12588, 2024

work page arXiv 2024
[16]

Distrifusion: Distributed parallel inference for high-resolution diffusion models

Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming- Yu Liu, Kai Li, and Song Han. Distrifusion: Distributed parallel inference for high-resolution diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[17]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 11

work page internal anchor Pith review Pith/arXiv arXiv 2010
[18]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

work page 2022
[19]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[20]

Megascale: Scaling large language model training to more than 10,000 gpus.arXiv preprint arXiv:2402.15627, 2024

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al. Megascale: Scaling large language model training to more than 10,000 gpus.arXiv preprint arXiv:2402.15627, 2024

work page arXiv 2024
[21]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[24]

Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, and Tao Chen.\delta-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024

work page arXiv 2024
[25]

A unified sequence parallelism approach for long context generative ai.arXiv preprint arXiv:2405.07719, 2024

Jiarui Fang and Shangchun Zhao. A unified sequence parallelism approach for long context generative ai.arXiv preprint arXiv:2405.07719, 2024

work page arXiv 2024
[26]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

work page 2019
[28]

Terapipe: Token-level pipeline parallelism for training large-scale language models

Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning, pages 6543–6552. PMLR, 2021

work page 2021
[29]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[31]

black-forest-labs/flux.1-schnell · hugging face

work page
[32]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

work page 2024
[33]

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[38]

Limitations

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11410–11420, 2022. NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s cont...

work page 2022
[39]

Guidelines: 19 • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page 2025
[40]

To mitigate the performance degradation caused by warmup, we can separate the warmup steps from the remaining working steps and allocate different computational resources to them

In other related work, it is common to use 50 steps [16] or even 100 steps [12], which helps to alleviate the overhead of warmup. To mitigate the performance degradation caused by warmup, we can separate the warmup steps from the remaining working steps and allocate different computational resources to them. The output feature maps after the warmup steps ...

work page 2014

[1] [1]

Announcing black forest labs

BlackForestLabs. Announcing black forest labs. https://blackforestlabs.ai/ announcing-black-forest-labs/, 2024. Accessed: [2024.10]

work page 2024

[2] [2]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, 2024

work page 2024

[3] [3]

Pixart- \sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation.arXiv preprint arXiv:2403.04692, 2024

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- \sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation.arXiv preprint arXiv:2403.04692, 2024

work page arXiv 2024

[4] [4]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023

[5] [5]

Video generation models as world simulators

OpenAI. Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators/, 2024. Accessed: May 2024

work page 2024

[6] [6]

Movie gen: A cast of media foundation models

MetaAI. Movie gen: A cast of media foundation models. https: // ai. meta. com/ static-resource/ movie-gen-research-paper, 2024. Accessed: 2024-10-14

work page 2024

[7] [7]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015

work page 2015

[8] [8]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[9] [9]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Frdiff: Feature reuse for exquisite zero-shot acceleration of diffusion models.CoRR, abs/2312.03517, 2023

Junhyuk So, Jungwon Lee, and Eunhyeok Park. Frdiff: Feature reuse for exquisite zero-shot acceleration of diffusion models.CoRR, abs/2312.03517, 2023

work page arXiv 2023

[12] [12]

Deepcache: Accelerating diffusion models for free

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15762–15772, 2024

work page 2024

[13] [13]

Learning-to-cache: Accelerating diffusion transformer via layer caching.arXiv preprint arXiv:2406.01733, 2024

Xinyin Ma, Gongfan Fang, Michael Bi Mi, and Xinchao Wang. Learning-to-cache: Accelerating diffusion transformer via layer caching.arXiv preprint arXiv:2406.01733, 2024

work page arXiv 2024

[14] [14]

Ditfastattn: Attention compression for diffusion transformer models.arXiv preprint arXiv:2406.08552, 2024

Zhihang Yuan, Pu Lu, Hanling Zhang, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, and Yu Wang. Ditfastattn: Attention compression for diffusion transformer models.arXiv preprint arXiv:2406.08552, 2024

work page arXiv 2024

[15] [15]

Real-time video generation with pyramid attention broad- cast.arXiv preprint arXiv:2408.12588, 2024

Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. Real-time video generation with pyramid attention broadcast.arXiv preprint arXiv:2408.12588, 2024

work page arXiv 2024

[16] [16]

Distrifusion: Distributed parallel inference for high-resolution diffusion models

Muyang Li, Tianle Cai, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming- Yu Liu, Kai Li, and Song Han. Distrifusion: Distributed parallel inference for high-resolution diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[17] [17]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 11

work page internal anchor Pith review Pith/arXiv arXiv 2010

[18] [18]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787, 2022

work page 2022

[19] [19]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[20] [20]

Megascale: Scaling large language model training to more than 10,000 gpus.arXiv preprint arXiv:2402.15627, 2024

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al. Megascale: Scaling large language model training to more than 10,000 gpus.arXiv preprint arXiv:2402.15627, 2024

work page arXiv 2024

[21] [21]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[24] [24]

Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, and Tao Chen.\delta-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024

work page arXiv 2024

[25] [25]

A unified sequence parallelism approach for long context generative ai.arXiv preprint arXiv:2405.07719, 2024

Jiarui Fang and Shangchun Zhao. A unified sequence parallelism approach for long context generative ai.arXiv preprint arXiv:2405.07719, 2024

work page arXiv 2024

[26] [26]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

work page 2019

[28] [28]

Terapipe: Token-level pipeline parallelism for training large-scale language models

Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning, pages 6543–6552. PMLR, 2021

work page 2021

[29] [29]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [30]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017

[31] [31]

black-forest-labs/flux.1-schnell · hugging face

work page

[32] [32]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

work page 2024

[33] [33]

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[38] [38]

Limitations

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11410–11420, 2022. NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s cont...

work page 2022

[39] [39]

Guidelines: 19 • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page 2025

[40] [40]

To mitigate the performance degradation caused by warmup, we can separate the warmup steps from the remaining working steps and allocate different computational resources to them

In other related work, it is common to use 50 steps [16] or even 100 steps [12], which helps to alleviate the overhead of warmup. To mitigate the performance degradation caused by warmup, we can separate the warmup steps from the remaining working steps and allocate different computational resources to them. The output feature maps after the warmup steps ...

work page 2014