pith. machine review for the scientific record. sign in

arxiv: 2604.14561 · v2 · submitted 2026-04-16 · 💻 cs.DC

Recognition: unknown

CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism

Bin Ma, Dong Li, Pengfei Su, Tekin Bicer, Xingjian Ding

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:19 UTC · model grok-4.3

classification 💻 cs.DC
keywords Diffusion TransformersUlysses sequence parallelismcollective communicationdistributed inferenceall-to-all optimizationtemporal redundancy
0
0 comments X

The pith

CoCoDiff reduces latency for distributed Diffusion Transformer inference by overlapping V communication with Q/K computation and using selective transmission based on tensor similarity across steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Ulysses sequence parallelism for DiTs incurs high latency from repeated all-to-all collectives, which cannot be easily overlapped due to data dependencies and large message sizes. It shows that the lighter computation required for V projections compared to Q/K, combined with similarity between adjacent denoising steps, creates exploitable redundancy. CoCoDiff introduces mechanisms to decompose collectives in a topology-aware way, schedule V communication first, and transmit only active V parts on slower links. If correct, this would allow multi-node DiT inference to scale with far less communication overhead, making larger models and resolutions practical on existing hardware clusters.

Core claim

CoCoDiff is a distributed DiT inference engine that exploits the asymmetry where V needs only linear projection while Q/K require normalization and RoPE, plus temporal redundancy in adjacent denoising steps. It uses Tile-Aware Parallel All-to-all decomposition, V-First scheduling to hide V communication behind Q/K work, and V-Major selective communication to limit data on slow interconnects, delivering average 3.6x speedup and peak 8.4x across 1-8 nodes with up to 96 GPU tiles.

What carries the argument

V-First scheduling paired with V-Major selective communication, which overlaps V's all-to-all behind Q/K computation and transmits only active projections by leveraging similarity between consecutive denoising steps.

Load-bearing premise

Adjacent denoising steps produce sufficiently similar tensors to allow selective V communication without degrading output quality, and the data dependencies permit reliable overlap of V communication behind Q/K computation.

What would settle it

Measuring image quality degradation or failure of communication overlap when CoCoDiff is applied to a DiT model where consecutive denoising steps produce dissimilar V tensors on hardware with asymmetric interconnect bandwidth.

Figures

Figures reproduced from arXiv: 2604.14561 by Bin Ma, Dong Li, Pengfei Su, Tekin Bicer, Xingjian Ding.

Figure 1
Figure 1. Figure 1: Communication overhead in Qwen-Image (20B) infer [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: QKV processing in the attention. steps. Hence, selectively reusing them across time steps can reduce the latency while causing minor impacts on the model accuracy. This is called the temporal redundancy, and has been leveraged to build a caching mechanism across step steps (e.g., DeepCache [16], ∆-DiT [17], Learning-to-Cache [18], and AttMEMO [19]). The temporal redundancy has been exploited for compute sa… view at source ↗
Figure 4
Figure 4. Figure 4: CoCoDiff overview. 1 happens within each GPU using higher bandwidth and Phase 2 happens across GPUs using lower bandwidth. In each phase, multiple communications happen in parallel either within GPUs or across GPUs, leveraging aggregated com￾munication bandwidth. The second phase especially benefits from the decomposition, because it allows two 6-rank all-to￾all communications to occur in parallel, fully a… view at source ↗
Figure 5
Figure 5. Figure 5: Execution timeline comparison. (a) Baseline: flat all [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Tile-aware parallel all-to-all. However, there is a risk of violating the inequality in Equation 2 when the batch size is large. In particular, V ’s projection time scales linearly with the batch size and can grow faster than T(Q, K), As a result, there is less time gap for V-First to tap; T(V p1 ) may not be completely overlapped with T(Q, K) and hence exposed to the critical path. Even so, T(V p1 ) can s… view at source ↗
Figure 7
Figure 7. Figure 7: End-to-end speedups of TAPA (blue) and CoCoDiff (orange) over the Flat baseline across three image resolutions and [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Speedup breakdown of CoCoDiff on SD3.5 at [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Inpainting results of FLUX.1-dev with CoCoDiff on a [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Inpainting quality vs. speedup for SD3.5 at [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: All-to-all communication latency on a single Aurora [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
read the original abstract

Diffusion Transformers (DiTs) are increasingly adopted in scientific computing, yet growing model sizes and resolutions make distributed multi-GPU inference essential. Ulysses sequence parallelism scales DiT inference but introduces frequent all-to-all collectives that dominate latency. Overlapping these with computation is difficult due to tight data dependencies, large message volumes, and asymmetric interconnect bandwidths. We introduce CoCoDiff, a distributed DiT inference engine exploiting two observations: (1) V requires only linear projection while Q/K need additional normalization and RoPE, creating opportunities to overlap V's communication with Q/K computation; (2) adjacent denoising steps produce similar tensors, yielding temporal redundancy. CoCoDiff introduces three mechanisms: Tile-Aware Parallel All-to-all (TAPA) decomposes collectives into topology-aligned phases; V-First scheduling hides V's communication behind Q/K computation; and V-Major selective communication transmits only active projections on slow interconnects. On the Aurora supercomputer with four DiT models across 1-8 nodes (up to 96 Intel GPU tiles), CoCoDiff achieves an average speedup of 3.6x, peaking at 8.4x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CoCoDiff, a distributed inference engine for Diffusion Transformers (DiTs) under Ulysses sequence parallelism. It exploits two observations—V projections require only linear projection while Q/K need normalization and RoPE, and adjacent denoising steps yield similar tensors—to propose three mechanisms: Tile-Aware Parallel All-to-all (TAPA) for topology-aligned collectives, V-First scheduling to overlap V all-to-all behind Q/K computation, and V-Major selective communication that transmits only active V projections over slow links. On the Aurora supercomputer with four DiT models across 1-8 nodes (up to 96 Intel GPU tiles), it reports an average 3.6x speedup peaking at 8.4x.

Significance. If the performance claims are substantiated with quality-preserving ablations and schedulability analysis, CoCoDiff would address a key latency bottleneck in scaling DiT inference for scientific computing on systems with asymmetric interconnects. The combination of computation-communication asymmetry exploitation and temporal redundancy is a practical contribution that could extend to other transformer workloads; the concrete supercomputer measurements are a positive aspect.

major comments (2)
  1. [Abstract] Abstract: The headline speedups (3.6x average, 8.4x peak) rest on V-Major selective communication and V-First overlap, yet the abstract supplies neither the selection threshold for 'active' V projections, quality-ablation results (e.g., FID or perceptual metrics showing statistical indistinguishability from full communication), nor dependency analysis confirming that normalization/RoPE asymmetry permits reliable overlap without stalls.
  2. [Abstract] Abstract: No baselines, error bars, ablation studies, or experimental configuration details (model sizes, exact tile/node mappings, interconnect bandwidths) are provided to allow verification of the reported speedups or to isolate the contribution of each mechanism (TAPA, V-First, V-Major).
minor comments (1)
  1. The abstract refers to 'four DiT models' without naming the specific architectures or parameter counts, which would help readers assess generalizability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major comment on the abstract below and will revise the abstract to supply additional context while preserving its brevity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline speedups (3.6x average, 8.4x peak) rest on V-Major selective communication and V-First overlap, yet the abstract supplies neither the selection threshold for 'active' V projections, quality-ablation results (e.g., FID or perceptual metrics showing statistical indistinguishability from full communication), nor dependency analysis confirming that normalization/RoPE asymmetry permits reliable overlap without stalls.

    Authors: We agree the abstract would benefit from more specifics. The selection threshold for active V projections and the dependency analysis confirming reliable overlap (due to V requiring only linear projection while Q/K require normalization and RoPE) are detailed in Sections 3 and 4 of the manuscript. We will revise the abstract to briefly summarize these elements and the basis for stall-free overlap. The manuscript does not include quality-ablation results such as FID or perceptual metrics. revision: partial

  2. Referee: [Abstract] Abstract: No baselines, error bars, ablation studies, or experimental configuration details (model sizes, exact tile/node mappings, interconnect bandwidths) are provided to allow verification of the reported speedups or to isolate the contribution of each mechanism (TAPA, V-First, V-Major).

    Authors: We acknowledge that the abstract omits these details. The paper uses the standard Ulysses sequence parallelism as baseline, with ablations isolating TAPA, V-First, and V-Major presented in the evaluation section. Experimental configuration (four DiT models, 1-8 nodes up to 96 GPU tiles on Aurora) is described in Section 5, including interconnect details. We will revise the abstract to reference the baseline, experimental scale, and ablations. Error bars from repeated runs are not reported in the current manuscript. revision: yes

standing simulated objections not resolved
  • The manuscript does not include quality-ablation results (e.g., FID or perceptual metrics) or error bars; while the mechanisms are designed around temporal redundancy and computation asymmetry to preserve quality and enable overlap, these specific quantifications are absent.

Circularity Check

0 steps flagged

No circularity: purely empirical system measurements with no derivations or self-referential reductions

full rationale

The paper contains no equations, mathematical derivations, or predictive models. Its core contributions are three engineering mechanisms (TAPA, V-First scheduling, V-Major selective communication) motivated by two stated observations about tensor properties and data dependencies. Performance claims consist exclusively of direct wall-clock measurements on Aurora hardware across multiple DiT models and node counts. These results do not reduce to fitted parameters, self-citations, or inputs by construction; they are independent experimental outcomes. The absence of any load-bearing derivation chain makes circularity analysis inapplicable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about DiT attention computation order and interconnect asymmetry; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • domain assumption Ulysses sequence parallelism for DiT inference introduces frequent all-to-all collectives that dominate latency
    Directly stated as the scaling method and source of the performance problem.

pith-pipeline@v0.9.0 · 5513 in / 1111 out tokens · 29669 ms · 2026-05-10T10:19:12.178635+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 32 canonical work pages · 7 internal anchors

  1. [1]

    Scaling rectified flow transformers for high-resolution image synthesis,

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, and R. Rombach, “Scaling rectified flow transformers for high-resolution image synthesis,” inForty-first International Conference on Machine Learning, ICML 2024, ser. Proceedings of Machine Learning Research. ...

  2. [2]

    Black Forest Labs, “FLUX,” https://github.com/black-forest-labs/flux, 2024

  3. [3]

    Qwen2.5-VL Technical Report

    A. Yang, B. Li, B. Yanget al., “Qwen2.5-VL technical report,” arXiv preprint arXiv:2502.13923, 2025. [Online]. Available: https: //doi.org/10.48550/arXiv.2502.13923

  4. [4]

    Video generation models as world simulators,

    T. Brooks, B. Peebles, C. Holmes, W. DePue, Y . Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh, “Video generation models as world simulators,” OpenAI Technical Report, 2024. [Online]. Available: https://openai.com/index/ video-generation-models-as-world-simulators/

  5. [5]

    AERIS: Argonne earth systems model for reliable and skillful predictions,

    V . Hatanp ¨a¨a, E. Ku, J. Stock, M. Emani, S. Foreman, C. Jung, S. Madireddy, T. Nguyen, V . Sastry, R. A. O. Sinurat, H. Zheng, S. Wheeler, T. Arcomano, V . Vishwanath, and R. Kotamarthi, “AERIS: Argonne earth systems model for reliable and skillful predictions,” inProceedings of the International Conference for High Performance Computing, Networking, S...

  6. [6]

    40 Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel

    J. Fang and S. Zhao, “USP: A unified sequence parallelism approach for long context generative AI,”CoRR, vol. abs/2405.07719, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2405.07719

  7. [7]

    Ring Attention with blockwise transformers for near-infinite context,

    H. Liu, M. Zaharia, and P. Abbeel, “Ring Attention with blockwise transformers for near-infinite context,” inThe Twelfth International Conference on Learning Representations, ICLR 2024. OpenReview.net,

  8. [8]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    [Online]. Available: https://doi.org/10.48550/arXiv.2310.01889

  9. [9]

    Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Train- ing,

    J. Liu, D. Li, G. Kestor, and J. Vetter, “Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Train- ing,” inInternational Parallel and Distributed Processing Symposium (IPDPS), 2019

  10. [10]

    Intel Data Center GPU Max Series product brief,

    Intel Corporation, “Intel Data Center GPU Max Series product brief,” https://www.intel.com/content/www/us/en/products/sku/232873/ intel-data-center-gpu-max-1550/specifications.html, 2023

  11. [11]

    Aurora system overview,

    Argonne Leadership Computing Facility, “Aurora system overview,” https://docs.alcf.anl.gov/aurora/getting-started-on-aurora/, 2024

  12. [12]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    S. A. Jacobs, M. Tanaka, C. Zhang, M. Zhang, S. L. Song, S. Rajbhandari, and Y . He, “Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models,” arXiv preprint arXiv:2309.14509, 2023. [Online]. Available: https: //doi.org/10.48550/arXiv.2309.14509

  13. [13]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-LM: Training multi-billion parameter language models using model parallelism,”arXiv preprint arXiv:1909.08053,

  14. [14]
  15. [15]

    Efficient memory management for large language model serving with pagedattention

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with PagedAttention,” inProceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 611–626. [Online]. Available: https://doi.org/10.1145/3600006.3613165

  16. [16]

    Accelerate: Training and infer- ence at scale made simple, efficient and adaptable,

    S. Gugger, L. Debut, T. Wolfet al., “Accelerate: Training and infer- ence at scale made simple, efficient and adaptable,” https://github.com/ huggingface/accelerate, 2022

  17. [17]

    RoFormer: En- hanced transformer with rotary position embedding,

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “RoFormer: En- hanced transformer with rotary position embedding,”Neurocomputing, vol. 568, p. 127063, 2024

  18. [18]

    T-VSL: text-guided visual sound source localization in mixtures

    X. Ma, G. Fang, and X. Wang, “DeepCache: Accelerating diffusion models for free,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 15 762– 15 772. [Online]. Available: https://doi.org/10.1109/CVPR52733.2024. 01492

  19. [19]

    δ-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024

    P. Chen, M. Shen, P. Ye, J. Cao, C. Tu, C. Bouganis, Y . Zhao, and T. Chen, “∆-dit: A training-free acceleration method tailored for diffusion transformers,”CoRR, vol. abs/2406.01125, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.01125

  20. [20]

    Learning-to-cache: Accelerating diffusion transformer via layer caching,

    X. Ma, G. Fang, M. B. Mi, and X. Wang, “Learning-to-cache: Accelerating diffusion transformer via layer caching,” inAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.01733

  21. [21]

    AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems,

    Y . Feng, H. Jeon, F. Blagojevic, C. Guyot, Q. Li, and D. Li, “AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems,”

  22. [22]

    Available: https://arxiv.org/abs/2301.09262

    [Online]. Available: https://arxiv.org/abs/2301.09262

  23. [23]

    NCCL: NVIDIA collective communications library,

    NVIDIA, “NCCL: NVIDIA collective communications library,” https: //github.com/NVIDIA/nccl, 2025

  24. [24]

    Optimization of collective communication operations in MPICH,

    R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective communication operations in MPICH,”The International Journal of High Performance Computing Applications, vol. 19, no. 1, pp. 49–66,

  25. [25]

    Available: https://doi.org/10.1177/1094342005051521

    [Online]. Available: https://doi.org/10.1177/1094342005051521

  26. [26]

    Blueconnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy,

    M. Cho, U. Finkler, D. S. Kung, and H. C. Hunter, “Blueconnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy,” inProceedings of Machine Learning and Systems, SysML

  27. [27]

    [Online]

    mlsys.org, 2019. [Online]. Available: https://doi.org/10.1147/ JRD.2019.2947013

  28. [28]

    MPICH: A high performance and widely portable implementation of the MPI standard,

    MPICH Team, “MPICH: A high performance and widely portable implementation of the MPI standard,” https://www.mpich.org/, 2025

  29. [29]

    Intel oneAPI Collective Communications Library (oneCCL),

    Intel Corporation, “Intel oneAPI Collective Communications Library (oneCCL),” https://github.com/oneapi-src/oneCCL, 2025

  30. [30]

    CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling

    D. Xu, H. Meng, X. Chen, D. Zhu, W. Tang, F. Liu, L. Xie, W. Xiang, R. Shi, Y . Li, H. Hu, H. Zhang, J. Jiang, and D. Li, “Cccl: Node-spanning gpu collectives with cxl memory pooling,” 2026. [Online]. Available: https://arxiv.org/abs/2602.22457

  31. [31]

    cmpi: Using cxl memory sharing for mpi one-sided and two-sided inter-node com- munications,

    X. Wang, B. Ma, J. Kim, B. Koh, H. Kim, and D. Li, “cmpi: Using cxl memory sharing for mpi one-sided and two-sided inter-node com- munications,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025

  32. [32]

    Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link,

    D. Xu, Y . Feng, K. Shin, D. Kim, H. Jeon, and D. Li, “Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link,” in36th ACM/IEEE International Conference for High Performance Computing, Performance Measurement, Modeling and Tools (SC), 2024

  33. [33]

    Timestep embedding tells: It’s time to cache for video diffusion model,

    F. Liu, S. Zhang, X. Wang, Y . Wei, H. Qiu, Y . Zhao, Y . Zhang, Q. Ye, and F. Wan, “Timestep embedding tells: It’s time to cache for video diffusion model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 7353–

  34. [34]
  35. [35]

    Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory,

    B. Ma, J. Ren, S. Yang, B. Francis, E. Ardestani, M. Si, and D. Li, “Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory,” inInternational Symposium on High Performance Computer Architecture (HPCA), 2025

  36. [36]

    Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning,

    J. Ren, J. Luo, K. Wu, M. Zhang, H. Jeon, and D. Li, “Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning,” inInternational Symposium on High Performance Computer Architecture (HPCA), 2020

  37. [37]

    xdit: an inference engine for diffusion transformers (dits) with massive parallelism,

    J. Fang, J. Pan, J. Wang, A. Li, and X. Sun, “xDiT: An inference engine for diffusion transformers (DiTs) with massive parallelism,” arXiv preprint arXiv:2411.01738, 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2411.01738

  38. [38]

    Diffusers: State-of-the-art diffusion models,

    P. von Platen, S. Patil, A. Lozhkovet al., “Diffusers: State-of-the-art diffusion models,”GitHub, 2022

  39. [39]

    Black Forest Labs, “FLUX.2,” https://blackforestlabs.ai/flux-2/, 2025

  40. [40]

    2010 20th International Conference on Pattern Recognition , author =

    A. Hore and D. Ziou, “Image quality metrics: PSNR vs. SSIM,” in International Conference on Pattern Recognition (ICPR), 2010, pp. 2366–2369. [Online]. Available: https://doi.org/10.1109/ICPR.2010.579

  41. [41]

    Bovik, H.R

    Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004. [Online]. Available: https://doi.org/10.1109/TIP.2003.819861 11

  42. [42]

    Demystifying the resilience of large language model infer- ence: An end-to-end perspective,

    B. Ma, V . Nikitin, X. Wang, T. Bicer, and D. Li, “mlr: Scalable laminography reconstruction based on memoization,” ser. SC ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 265–280. [Online]. Available: https://doi.org/10.1145/3712285.3759805

  43. [43]

    T-VSL: text-guided visual sound source localization in mixtures

    M. Li, T. Cai, J. Cao, Q. Zhang, H. Cai, J. Bai, Y . Jia, M. Liu, K. Li, and S. Han, “DistriFusion: Distributed parallel inference for high-resolution diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 7183–7193. [Online]. Available: https://doi.org/10.1109/CVPR52733.2024.00686

  44. [44]

    PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

    J. Wang, J. Fang, J. Pan, A. Li, and P. Yang, “PipeFusion: Displaced patch pipeline parallelism for inference of diffusion transformer models,”arXiv preprint arXiv:2405.14430, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2405.14430

  45. [45]

    AsyncDiff: Parallelizing diffusion models by asynchronous denoising,

    Z. Chen, X. Ma, G. Fang, Z. Tan, and X. Wang, “AsyncDiff: Parallelizing diffusion models by asynchronous denoising,”arXiv preprint arXiv:2406.06911, 2024. [Online]. Available: https://doi.org/ 10.48550/arXiv.2406.06911

  46. [46]

    Scale- Fusion: Scalable inference of spatial-temporal diffusion transformers for high-resolution long video generation,

    H. Chen, Z. He, R. Chen, Z. Ye, J. Xue, C. Zhuo, and L. Ma, “Scale- Fusion: Scalable inference of spatial-temporal diffusion transformers for high-resolution long video generation,” inProceedings of Machine Learning and Systems (MLSys), 2025

  47. [47]

    Streamfusion: Scalable sequence parallelism for distributed inference of diffusion transformers on gpus.arXiv preprint arXiv:2601.20273, 2026

    J. Luo, Y . Li, and C. Zhang, “StreamFusion: Scalable sequence parallelism for distributed inference of diffusion transformers on GPUs,”arXiv preprint arXiv:2601.20273, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2601.20273

  48. [48]

    CacheQuant: Comprehensively accelerated diffusion models,

    J. Liu and Z. Wang, “CacheQuant: Comprehensively accelerated diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2503.01323

  49. [49]

    Kim, Y ., Jang, J., and Shin, S

    K. Kahatapitiya, H. Zheng, M. Jia, X. Zhang, M. S. Ryoo, and T. Xie, “AdaCache: Adaptive caching for faster video generation with diffusion transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2411.02397

  50. [50]

    From reusing to forecasting: Accelerating diffusion models with taylorseers.arXiv preprint arXiv:2503.06923, 2025

    F. Liu, S. Huang, H. Liu, Y . Tang, K. Han, and Y . Wang, “From reusing to forecasting: Accelerating diffusion models with TaylorSeers,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. [Online]. Available: https: //doi.org/10.48550/arXiv.2503.06923

  51. [51]

    SpeCa: Accelerating diffusion transformers with speculative feature caching,

    J. Zhao, J. Pan, and H. Zhu, “SpeCa: Accelerating diffusion transformers with speculative feature caching,”arXiv preprint arXiv:2509.11628,

  52. [52]

    Available: https://doi.org/10.1145/3746027.3755331

    [Online]. Available: https://doi.org/10.1145/3746027.3755331

  53. [53]

    MD-HM: Memoization-based Molecular Dynamics Simulations on Big Memory System,

    Z. Xie, W. Dong, J. Liu, I. Peng, Y . Ma, and D. Li, “MD-HM: Memoization-based Molecular Dynamics Simulations on Big Memory System,” inInternational Conference on Supercomputing (ICS), 2021

  54. [54]

    Reducing activation recomputation in large transformer models, 2022

    V . A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,” inProceedings of Machine Learning and Systems, vol. 5, 2023, pp. 341–353. [Online]. Available: https://doi.org/10.48550/arXiv.2205.05198

  55. [55]

    arXiv preprint arXiv:2311.09431 (2023)

    W. Brandon, A. Nrusimha, K. Qian, Z. Ankner, T. Jin, Z. Song, and J. Ragan-Kelley, “Striped attention: Faster ring attention for causal transformers,”CoRR, vol. abs/2311.09431, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2311.09431

  56. [56]

    Xing, Xuezhe Ma, Ion Stoica, Joseph E

    D. Li, R. Shao, A. Xie, E. P. Xing, J. E. Gonzalez, I. Stoica, X. Ma, and H. Zhang, “Lightseq: Sequence level parallelism for distributed training of long context transformers,”arXiv preprint arXiv:2310.03294, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.03294

  57. [57]

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai

    D. Gu, P. Sun, Q. Hu, T. Huang, X. Chen, Y . Xiong, G. Wang, Q. Chen, S. Zhao, J. Fang, Y . Wen, T. Zhang, X. Jin, and X. Liu, “Loongtrain: Efficient training of long-sequence LLMs with head- context parallelism,”arXiv preprint arXiv:2406.18485, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.18485 12