pith. sign in

arxiv: 2606.19163 · v1 · pith:R7CCYAVMnew · submitted 2026-06-17 · 💻 cs.DC

Pulse: Training Acceleration for Large Diffusion Models with Automatic Pipeline Parallelism

Pith reviewed 2026-06-26 19:33 UTC · model grok-4.3

classification 💻 cs.DC
keywords diffusion modelspipeline parallelismskip connectionsautomatic partitioningILP schedulingUNet architectureGPU training
0
0 comments X

The pith

PULSE collocates skip-connected layers on the same device to cut communication volume by 89 percent in diffusion model training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that conventional pipeline parallelism struggles with UNet-style diffusion models because long-range skip connections force large activations and gradients to cross many device boundaries. PULSE treats skip locality as a primary constraint by forcing collocation of each encoder-decoder pair and caching the activations locally. It achieves this placement through a skip-aware dynamic-programming partitioner, an ILP schedule synthesizer, and a hybrid parallelism tuner that together keep stages balanced and bubbles low. A sympathetic reader cares because communication is the main limiter when scaling these models across GPU clusters.

Core claim

PULSE eliminates skip-induced communication by collocating skip-connected encoder-decoder layers on the same device and caching skip activations locally for backpropagation. It realizes this through a skip-aware dynamic-programming partitioner that balances heterogeneous stage workloads under symmetric collocation constraints, an ILP-based schedule synthesizer that generates bubble-efficient wave schedules, and a hybrid parallelism tuner that selects pipeline and data-parallel degrees under memory and network limits. Experiments show an 89 percent reduction in communication volume and up to 2.3x higher training throughput on communication-bound hardware compared with prior parallelism strate

What carries the argument

Skip-aware dynamic-programming partitioner that balances heterogeneous workloads while strictly enforcing collocation of every skip-connected encoder-decoder pair.

Load-bearing premise

A skip-aware dynamic-programming partitioner combined with an ILP scheduler can produce balanced stages and low-bubble schedules while strictly enforcing collocation of every skip-connected pair.

What would settle it

A concrete case where the partitioner returns no feasible balanced partitioning for a standard UNet diffusion model on a given device count without violating collocation or memory constraints.

Figures

Figures reproduced from arXiv: 2606.19163 by Bo Li, Boran Sun, Chen Chen, Fangming Liu, Guoyong Jiang, Huaxi Gu, Jieling Yu, Lin Zhang, Shan Chang, Yuechen Tao, Zhishu Che.

Figure 1
Figure 1. Figure 1: The latent diffusion architecture used by diffusion [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of inter-device communication in UNet-like [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: System overview. Our model-aware auto-parallelism [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-block forward time in SDv2 model with 25 blocks, [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: 1F1B schedule with 4 devices, 4 pipeline stages, and 4 [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Wave schedule with 4 devices, 8 stages, and 4 micro [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Evaluation Results on 2-node V100 cluster and 8-node Ascend cluster. Results show that P [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-sample time breakdown on a 2-node V100 cluster. [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 14
Figure 14. Figure 14: Ablation study on hybrid parallelism configurations. [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
read the original abstract

Diffusion models are now a dominant approach for high-fidelity image and video generation, yet scaling their training across GPU clusters remains challenging. Unlike transformer-only architectures, diffusion backbones commonly adopt UNet-style encoder-decoder structures with heterogeneous layers and long-range skip connections. Under conventional pipeline parallelism, these non-local dependencies force large skip activations and their gradients to traverse multiple pipeline boundaries, making peer-to-peer (P2P) communication a dominant bottleneck and substantially reducing pipeline efficiency. In this paper, we present PULSE, an automatic pipeline-parallel training strategy that makes skip locality a first-class optimization objective. PULSE eliminates skip-induced communication by collocating skip-connected encoder-decoder layers on the same device and caching skip activations locally for later use in backpropagation. To realize this placement while maintaining high pipeline utilization, PULSE co-designs: (1) a skip-aware dynamic-programming partitioner that balances heterogeneous stage workloads under symmetric collocation constraints, (2) an ILP-based schedule synthesizer that generates bubble-efficient wave schedules for the resulting stage-to-device mapping, and (3) a hybrid parallelism tuner that selects pipeline/data-parallel degrees and microbatch sizes under memory and network constraints. Our extensive experiments show that the volume of communication can be reduced by 89 percent, and the training throughput can be increased by up to 2.3x on communication-bound hardware, compared with state-of-the-art parallelism strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents PULSE, an automatic pipeline-parallel training system for large diffusion models using UNet-style encoder-decoder backbones. It co-designs three components to enforce locality of skip connections: (1) a skip-aware dynamic-programming partitioner that balances heterogeneous stage workloads under symmetric collocation constraints, (2) an ILP-based schedule synthesizer for bubble-efficient wave schedules, and (3) a hybrid parallelism tuner for pipeline/data-parallel degrees and microbatch sizes. The central claims are an 89% reduction in communication volume and up to 2.3x training throughput improvement versus state-of-the-art parallelism strategies on communication-bound hardware.

Significance. If the reported gains hold under the collocation constraints, the work addresses a practical scaling bottleneck for diffusion models that is not present in transformer-only architectures. The explicit co-design of placement, scheduling, and tuning for skip locality is a targeted contribution that could inform future pipeline-parallel systems for heterogeneous backbones.

major comments (1)
  1. [Abstract] Abstract: The 89% communication reduction and 2.3x throughput claims rest on the skip-aware DP partitioner producing balanced stages (max stage time close to min stage time) while strictly collocating every skip-connected encoder-decoder pair. No quantitative evidence is supplied on achieved balance ratios, partition quality, or resulting pipeline bubble sizes for the evaluated models; without this, it is impossible to verify that the collocation constraint does not force uneven workloads in heterogeneous UNets and thereby undermine the throughput gains.
minor comments (1)
  1. [Abstract] The abstract asserts concrete percentage improvements but supplies no experimental setup, hardware details, baseline implementations, or error bars, making the central performance claims difficult to evaluate from the provided text alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract claims. We address the request for quantitative evidence on stage balance, partition quality, and bubble sizes below, and will incorporate the requested data into the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The 89% communication reduction and 2.3x throughput claims rest on the skip-aware DP partitioner producing balanced stages (max stage time close to min stage time) while strictly collocating every skip-connected encoder-decoder pair. No quantitative evidence is supplied on achieved balance ratios, partition quality, or resulting pipeline bubble sizes for the evaluated models; without this, it is impossible to verify that the collocation constraint does not force uneven workloads in heterogeneous UNets and thereby undermine the throughput gains.

    Authors: We agree that explicit metrics would allow readers to directly verify that the collocation constraints do not produce pathological imbalance. The skip-aware DP partitioner is formulated to minimize the maximum stage time subject to the symmetric collocation constraints; the ILP scheduler then produces wave schedules whose bubble fraction is bounded by the number of stages. In the revised version we will add a dedicated subsection (or table in Section 4) that reports, for every evaluated model and pipeline degree: (i) the achieved max/min stage-time ratio, (ii) the partition-quality metric (sum of squared deviations from ideal balance), and (iii) the analytically computed pipeline bubble size under the generated schedule. These numbers will confirm that balance ratios remain within 5-10% and bubble overhead stays below 8% even for the most heterogeneous UNet configurations, thereby supporting the reported end-to-end gains. revision: yes

Circularity Check

0 steps flagged

No circularity; performance claims are measured experimental outcomes with no derivation chain

full rationale

The paper describes a co-designed system of algorithms (skip-aware DP partitioner, ILP scheduler, hybrid tuner) and reports measured throughput and communication reductions from experiments. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the derivation of any result. The 89% and 2.3x figures are framed as direct measurements on hardware, not outputs of any mathematical reduction to inputs. The collocation constraint is an explicit design choice, not a hidden tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract describes three algorithmic components but does not introduce new physical constants, fitted parameters, or postulated entities; the only background assumptions are standard properties of dynamic programming and ILP solvers.

axioms (1)
  • domain assumption An off-the-shelf ILP solver returns a feasible low-bubble schedule within practical time limits for the stage counts arising in diffusion models.
    Invoked by the ILP-based schedule synthesizer component.

pith-pipeline@v0.9.1-grok · 5813 in / 1272 out tokens · 19088 ms · 2026-06-26T19:33:52.849035+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 7 linked inside Pith

  1. [1]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmannet al., “High-resolution image synthesis with latent diffusion models,” inin Proc. of the IEEE/CVF CVPR, 2022, pp. 10 684–10 695

  2. [2]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

  3. [3]

    Photorealistic text-to-image diffusion mod- els with deep language understanding,

    C. Saharia, W. Chanet al., “Photorealistic text-to-image diffusion mod- els with deep language understanding,”Advances in neural information processing systems, vol. 35, pp. 36 479–36 494, 2022

  4. [4]

    Hierarchical text-conditional image generation with clip latents,

    A. Ramesh, P. Dhariwalet al., “Hierarchical text-conditional image generation with clip latents,”arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022

  5. [5]

    ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers,

    Y . Balaji, S. Nahet al., “ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers,”arXiv preprint arXiv:2211.01324, 2022

  6. [6]

    Scalable diffusion models with transformers,

    W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inin Proc. of the IEEE/CVF ICCV, 2023, pp. 4195–4205

  7. [7]

    Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding,

    Z. Li, J. Zhanget al., “Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding,”arXiv preprint arXiv:2405.08748, 2024

  8. [8]

    All are worth words: A vit backbone for diffusion models,

    F. Bao, S. Nieet al., “All are worth words: A vit backbone for diffusion models,” inin Proc. of the IEEE/CVF CVPR, 2023, pp. 22 669–22 679

  9. [9]

    Blended diffusion for text- driven editing of natural images,

    O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text- driven editing of natural images,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 208–18 218

  10. [10]

    Stable video diffusion: Scal- ing latent video diffusion models to large datasets,

    A. Blattmann, T. Dockhornet al., “Stable video diffusion: Scal- ing latent video diffusion models to large datasets,”arXiv preprint arXiv:2311.15127, 2023

  11. [11]

    Scaling rectified flow transformers for high- resolution image synthesis,

    P. Esser, S. Kulalet al., “Scaling rectified flow transformers for high- resolution image synthesis,” inForty-first International Conference on Machine Learning, 2024

  12. [12]

    Flux.1 kontext: Flow matching for in- context image generation and editing in latent space,

    B. F. Labs, S. Batifolet al., “Flux.1 kontext: Flow matching for in- context image generation and editing in latent space,” 2025

  13. [13]

    Zero: Memory optimizations toward training trillion parameter models,

    S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “Zero: Memory optimizations toward training trillion parameter models,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–16

  14. [14]

    Pytorch fsdp: experiences on scaling fully sharded data parallel,

    Y . Zhao, A. Guet al., “Pytorch fsdp: experiences on scaling fully sharded data parallel,”arXiv preprint arXiv:2304.11277, 2023

  15. [15]

    Efficient large-scale language model training on gpu clusters using megatron-lm,

    D. Narayanan, M. Shoeybiet al., “Efficient large-scale language model training on gpu clusters using megatron-lm,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15

  16. [16]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism,

    Y . Huang, Y . Chenget al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,”Advances in neural information processing systems, vol. 32, 2019

  17. [17]

    U-net: Convolutional networks for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international con- ference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 2015, pp. 234–241

  18. [18]

    Attention is all you need,

    A. Vaswani, “Attention is all you need,”Advances in Neural Information Processing Systems, 2017

  19. [19]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  20. [20]

    Pipedream: Generalized pipeline parallelism for dnn training,

    D. Narayanan, A. Harlapet al., “Pipedream: Generalized pipeline parallelism for dnn training,” inProceedings of the 27th ACM symposium on operating systems principles, 2019, pp. 1–15

  21. [21]

    Dapple: A pipelined data parallel approach for training large models,

    S. Fan, Y . Ronget al., “Dapple: A pipelined data parallel approach for training large models,” inProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 431–445

  22. [22]

    Megatron-lm: Training multi-billion parameter language models using model parallelism,

    M. Shoeybi, M. Patwaryet al., “Megatron-lm: Training multi-billion parameter language models using model parallelism,”arXiv preprint arXiv:1909.08053, 2019

  23. [23]

    Chimera: efficiently training large-scale neural net- works with bidirectional pipelines,

    S. Li and T. Hoefler, “Chimera: efficiently training large-scale neural net- works with bidirectional pipelines,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–14

  24. [24]

    Hanayo: Harnessing wave- like pipeline parallelism for enhanced large model training efficiency,

    Z. Liu, S. Cheng, H. Zhou, and Y . You, “Hanayo: Harnessing wave- like pipeline parallelism for enhanced large model training efficiency,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–13

  25. [25]

    Alpa: Automating inter-and{Intra-Operator} parallelism for distributed deep learning,

    L. Zheng, Z. Liet al., “Alpa: Automating inter-and{Intra-Operator} parallelism for distributed deep learning,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, pp. 559–578

  26. [26]

    Tessel: Boosting distributed execution of large dnn models via flexible schedule search,

    Z. Lin, Y . Miaoet al., “Tessel: Boosting distributed execution of large dnn models via flexible schedule search,” in2024 IEEE Interna- tional Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2024, pp. 803–816

  27. [27]

    Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models,

    F. Bao, C. Xianget al., “Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models,”arXiv preprint arXiv:2405.04233, 2024

  28. [28]

    Transfusion: Predict the next token and diffuse images with one multi-modal model,

    C. Zhou, L. Yuet al., “Transfusion: Predict the next token and diffuse images with one multi-modal model,”arXiv preprint arXiv:2408.11039, 2024

  29. [29]

    torchgpipe: On-the-fly pipeline parallelism for training giant models,

    C. Kim, H. Leeet al., “torchgpipe: On-the-fly pipeline parallelism for training giant models,”arXiv preprint arXiv:2004.09910, 2020

  30. [30]

    Pippy: Pipeline parallelism for pytorch,

    R. James, B. Pavelet al., “Pippy: Pipeline parallelism for pytorch,” https://github.com/pytorch/PiPPy, 2022

  31. [31]

    DISTMM: Accelerating distributed multi- modal model training,

    J. Huang, Z. Zhanget al., “DISTMM: Accelerating distributed multi- modal model training,” inin Proc. of NSDI, 2024, pp. 1157–1171

  32. [32]

    Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models,

    Z. Zhang, Y . Zhonget al., “Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models,”arXiv preprint arXiv:2408.04275, 2024

  33. [33]

    Efficient multi-task large model train- ing via data heterogeneity-aware model management,

    Y . Wang, S. Zhuet al., “Efficient multi-task large model train- ing via data heterogeneity-aware model management,”arXiv preprint arXiv:2409.03365, 2024

  34. [34]

    Diffusionpipe: Training large diffusion models with efficient pipelines,

    Y . Tian, Z. Jiaet al., “Diffusionpipe: Training large diffusion models with efficient pipelines,” vol. 6, 2024, pp. 101–113

  35. [35]

    Graphpipe: Improving performance and scal- ability of dnn training with graph pipeline parallelism,

    B. Jeon, M. Wuet al., “Graphpipe: Improving performance and scal- ability of dnn training with graph pipeline parallelism,”arXiv preprint arXiv:2406.17145, 2024