Piper: A Programmable Distributed Training System

Andy Ruan; Gilbert Bernstein; Mat Jacob; Megan Frisella; Parker Gustafson; Shubham Tiwari; Stephanie Wang; Yi Pan

arxiv: 2606.11169 · v1 · pith:FBEM6QHFnew · submitted 2026-06-09 · 💻 cs.DC · cs.AI

Piper: A Programmable Distributed Training System

Megan Frisella , Shubham Tiwari , Andy Ruan , Yi Pan , Parker Gustafson , Mat Jacob , Gilbert Bernstein , Stephanie Wang This is my paper

Pith reviewed 2026-06-27 11:32 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords distributed trainingparallelism strategiesDAG intermediate representationZeRODualPipecomposed parallelismuser annotationsscheduling directives

0 comments

The pith

Piper decouples distributed training strategies from the runtime by transforming a unified global DAG IR with user annotations and directives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Piper as a system that lets users declare parallelism strategies through a small set of model annotations and scheduling directives. These directives transform a single intermediate representation consisting of a global training DAG that encodes all computation and communication. From this IR the system generates per-device plans and runs them on a runtime that does not embed knowledge of any particular strategy. The design supports both standard approaches such as ZeRO and more intricate compositions such as DualPipe while keeping performance comparable on the former and adding efficiency on the latter. A sympathetic reader would care because the separation could reduce the engineering cost of trying new combinations of data, pipeline, expert, and memory parallelism.

Core claim

Piper is a user-controllable distributed training system that decouples the strategy from the runtime implementation. Users declare a comprehensive distributed training strategy with a small set of model annotations and scheduling directives. Each directive applies a transformation on Piper's intermediate representation, a unified global training DAG that represents all computation and communication. Using this IR, Piper compiles per-device execution plans and executes them with a distributed runtime agnostic to the strategy.

What carries the argument

The unified global training DAG IR, on which user annotations and scheduling directives apply transformations to produce strategy-specific execution plans while the runtime remains unchanged.

If this is right

The system maintains performance parity on commonly available strategies such as ZeRO.
Joint scheduling of compute and communication in composed strategies such as DualPipe produces additional performance and memory efficiency gains.
New strategies can be integrated by adding transformations to the IR without altering the runtime.
Compositions of data, pipeline, expert, and memory-saving parallelism become expressible through the same annotation mechanism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Experimenting with novel parallelism combinations becomes feasible without writing new runtime code for each one.
The IR representation could support automated search over strategy spaces by treating directives as composable operators.
Similar decoupling might apply to inference or fine-tuning workloads that also combine multiple parallelism forms.

Load-bearing premise

A small set of model annotations and scheduling directives applied as transformations to the unified global training DAG IR is sufficient to express and efficiently schedule arbitrary compositions of data, pipeline, expert, and memory-saving parallelism without strategy-specific runtime changes.

What would settle it

A new parallelism composition that cannot be expressed or scheduled efficiently using only the provided annotations and directives on the DAG IR, requiring direct modifications to the distributed runtime instead.

Figures

Figures reproduced from arXiv: 2606.11169 by Andy Ruan, Gilbert Bernstein, Mat Jacob, Megan Frisella, Parker Gustafson, Shubham Tiwari, Stephanie Wang, Yi Pan.

**Figure 1.** Figure 1: DualPipe example combining PP-4 across layers with EP-2 for the expert layer and DP-2 for the non-expert attention layer. This placement uses PP stage interleaving; PP rank 0 gets the first and last layers, etc. We show a variant of DualPipe known as DualPipeV [35] that places each layer on one PP rank instead of two. Finally, we build an efficient and flexible distributed runtime to execute the combined … view at source ↗

**Figure 2.** Figure 2: DualPipeV schedule [35] for 4-way PP, 8 microbatches, 8 layers. The placement is shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Different strategies for applying intra-device parallelism. (a) Serial forward-backward execution. (b) Interleaving a forward and backward microbatch can effectively hide all-to-alls. (c) Further microbatching within the forwards enables overlapping but reduces hardware efficiency. AR A2A Attn MLP A2A LN A2A A2A AR LN Compute stream Comm stream MLP(B) MLP(W) Attn(B) Attn(W) AR A2A Attn MLP A2A LN A2A A2A… view at source ↗

**Figure 4.** Figure 4: Different low-level execution strategies corresponding to an overlapped microbatch pair in DualPipe (Figures 1 and 2), now including the allreduce from using DP for Attn (dark blue AR boxes). (a) Using a separate stream for DP communication can add interference to all-to-alls. (b) Scheduling DP communication on the stream delays EP allto-alls. (c) Partitioning DP communication into finer-grained operati… view at source ↗

**Figure 5.** Figure 5: Piper overview. White rounded boxes are user inputs. Grey areas are system. into a distributed execution plan. Second, the runtime executes the plan on distributed workers in a way that is agnostic to the specific execution strategy. Together, these components let Piper represent and realize a rich set of training schedules without hard-coding specific parallelism strategy compositions or optimizations i… view at source ↗

**Figure 6.** Figure 6: Line 9 splits the model into two microbatches ((4) in Figure 6). Split duplicates the entire DAG because its filter matches everything, assigning each DAG copy a unique index within the new MB dimension. Line 10 assigns the microbatch ordering depicted in (5) in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: PP x EP throughput for 1F1B, Interleaved 1F1B and DualPipeV schedules. variant [30]. Interleaved 1F1B can improve training throughput for large models compared to 1F1B due to better microbatch overlapping from finer-grained interleaved stages. TorchTitan allows users to provide a PP schedule, similar to Piper, and provides generic PP schedule-builders for 1F1B and interleaved 1F1B. We adapt the 1F1B and … view at source ↗

**Figure 8.** Figure 8: shows the results. Megatron, DeepSpeed, and other PP x ZeRO-1 variants OOM in all cases because they cannot fit even the smallest batch size. TorchTitan can fit the smaller batch sizes but has higher memory consumption than Piper because it does not properly re-shard weights and gradients, despite setting the provided flag for weights resharding after forwards. As a result, TorchTitan maintains a [PITH_F… view at source ↗

**Figure 9.** Figure 9: Piper PP x DP Scalability. schedule and 6% over Megatron’s interleaved schedule (Figure 7b. In this case, Megatron-Interleaved-1F1B’s improvement over Piper-Interleaved-1F1B comes from fused kernels. Note that Megatron does not fuse the EP all-to-all; thus we expect that Megatron’s fused kernels could be directly integrated into Piper for additional gains. 6.4 Scalability We evaluate Piper’s PP and DP s… view at source ↗

read the original abstract

Large-scale model training increasingly relies on composing multiple parallelism strategies, such as data, pipeline, and expert parallelism, together with memory-saving optimizations like ZeRO. Deployed systems for foundation model pretraining often rely on human experts to manually design a high-level parallelism strategy then implement the corresponding low-level execution strategy, making it difficult to adapt the system to new strategies. Meanwhile, many general-purpose frameworks are more flexible but their implementations are still tied to a fixed set of common parallelism strategies, making it challenging to integrate state-of-the-art strategies. We present Piper, a user-controllable distributed training system that decouples the strategy from the runtime implementation. Piper allows users to declare a comprehensive distributed training strategy with a small set of model annotations and scheduling directives. Each directive applies a transformation on Piper's intermediate representation (IR), a unified global training DAG that represents all computation and communication. Using this IR, Piper compiles per-device execution plans and executes them with a distributed runtime agnostic to the strategy. We show that the combined system maintains performance parity on commonly available strategies such as ZeRO, while also enabling additional performance and memory efficiency gains through joint scheduling of compute and communication in composed parallelism strategies such as DeepSeek-V3's DualPipe.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Piper's core pitch is a user annotation layer plus DAG transformations that supposedly lets you compose parallelism strategies without touching the runtime, but the abstract gives no evidence this actually works for cases like DualPipe.

read the letter

The paper's main move is to put a small set of model annotations and scheduling directives on top of a unified global training DAG IR. Those directives turn into transformations that produce per-device plans, and the runtime itself stays strategy-agnostic. The claim is that this keeps ZeRO-level performance while unlocking joint compute/communication schedules for composed strategies such as DualPipe.

What looks new is the concrete annotation language and directive set aimed at arbitrary compositions rather than just the usual fixed menu of data/pipeline/ZeRO. The framing of the problem is also clear: experts hand-craft strategies and general frameworks are locked to a few patterns.

The execution model itself is straightforward and the separation of concerns is a reasonable engineering goal. If the transformations really are small and general, it could cut the cost of trying new combinations.

The soft spot is that none of this is shown. The abstract asserts parity and gains but supplies no numbers, no IR node types, no example transformation for DualPipe, and no validation that the schedule matches hand-tuned versions. The stress-test assumption—that a fixed small annotation set plus DAG rewrites is enough for any composition without new primitives or runtime changes—remains untested on the information given.

This is for people who build or maintain distributed training systems and want a more programmable way to express parallelism. A reader who needs to evaluate whether the decoupling holds up should wait for the full paper with the actual transformations and measurements.

I would send it to peer review. The artifact addresses a real friction point and the design is coherent on its own terms; the authors just need to supply the missing evidence on generality and performance.

Referee Report

3 major / 1 minor

Summary. The paper presents Piper, a user-controllable distributed training system that decouples parallelism strategy from runtime implementation. Users declare strategies via a small set of model annotations and scheduling directives; each directive transforms a unified global training DAG IR representing all computation and communication. Piper then compiles per-device execution plans and runs them on a strategy-agnostic distributed runtime. The abstract claims performance parity with ZeRO and additional gains via joint compute/communication scheduling in composed strategies such as DeepSeek-V3's DualPipe.

Significance. If the central decoupling claim holds and the small annotation set plus DAG transformations can express arbitrary compositions without runtime modifications, the work would meaningfully advance flexibility in large-scale training frameworks, reducing reliance on hand-tuned implementations for new strategies.

major comments (3)

[Abstract, §3] Abstract and §3 (IR and transformations): the central claim that a fixed small set of annotations and directives suffices for arbitrary compositions (including DualPipe) without strategy-specific runtime changes is not supported by any concrete transformation definitions, IR node types, or derivation of the DualPipe schedule. The manuscript supplies no example showing how DualPipe is encoded as transformations on the global DAG.
[Abstract] Abstract: the claims of 'performance parity on commonly available strategies such as ZeRO' and 'additional performance and memory efficiency gains' are stated without any quantitative results, error bars, baselines, or experimental setup details, leaving the empirical support for the system unverified.
[§4] §4 (runtime): the assertion that the distributed runtime remains agnostic to the strategy rests on the IR producing per-device plans, but no evidence is given that the runtime requires no strategy-specific logic when executing the DualPipe schedule or other novel compositions.

minor comments (1)

[Abstract] The abstract would benefit from a brief mention of the number of annotations/directives in the 'small set' and the size of the IR.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and indicate revisions to strengthen the presentation of the IR transformations, empirical claims, and runtime design.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (IR and transformations): the central claim that a fixed small set of annotations and directives suffices for arbitrary compositions (including DualPipe) without strategy-specific runtime changes is not supported by any concrete transformation definitions, IR node types, or derivation of the DualPipe schedule. The manuscript supplies no example showing how DualPipe is encoded as transformations on the global DAG.

Authors: We agree that an explicit worked example is needed to substantiate the claim. The revised manuscript will add to §3 a concrete derivation of the DualPipe schedule, specifying the IR node types, the sequence of annotations and directives, and the resulting transformations on the global DAG. This will demonstrate that the small annotation set encodes the composition without requiring changes to the runtime. revision: yes
Referee: [Abstract] Abstract: the claims of 'performance parity on commonly available strategies such as ZeRO' and 'additional performance and memory efficiency gains' are stated without any quantitative results, error bars, baselines, or experimental setup details, leaving the empirical support for the system unverified.

Authors: The abstract summarizes the key findings; the supporting quantitative results (including ZeRO parity numbers, DualPipe gains, error bars, baselines, and setup) appear in §5. We will revise the abstract to reference the evaluation section and, space permitting, include one or two representative metrics. revision: partial
Referee: [§4] §4 (runtime): the assertion that the distributed runtime remains agnostic to the strategy rests on the IR producing per-device plans, but no evidence is given that the runtime requires no strategy-specific logic when executing the DualPipe schedule or other novel compositions.

Authors: The runtime is intentionally strategy-agnostic because it only interprets the per-device plans emitted by the IR compiler. The revised §4 will add a description of the execution loop together with a reference to the DualPipe example (updated in §3) to show explicitly that no strategy-specific code paths are present. revision: yes

Circularity Check

0 steps flagged

No circularity; claim rests on implemented system, not derivation

full rationale

The paper is a systems description of Piper, which uses a unified DAG IR, model annotations, and scheduling directives to produce per-device plans. No equations, parameters, or mathematical derivations appear in the abstract or described content. The decoupling claim is justified by the runtime being strategy-agnostic and by empirical parity/gains on ZeRO and DualPipe, not by any step that reduces to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked. This is a standard non-circular implementation claim for a programmable systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper introduces one new entity (the unified global training DAG IR) whose purpose is to serve as the single representation that all directives transform; no free parameters or additional axioms are visible in the abstract. The IR is postulated without independent evidence outside the system itself.

invented entities (1)

unified global training DAG IR no independent evidence
purpose: single intermediate representation that all user directives transform and from which per-device plans are compiled
The abstract states that each directive applies a transformation on this IR and that the runtime is agnostic to the strategy; no external validation of the IR's completeness is provided.

pith-pipeline@v0.9.1-grok · 5761 in / 1369 out tokens · 17236 ms · 2026-06-27T11:32:32.415909+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 5 canonical work pages

[1]

DeepSpeed: Extreme-scale model training for every- one.https://www.microsoft.com/en-us/research/blog/deepspeed- extreme-scale-model-training-for-everyone

2025. DeepSpeed: Extreme-scale model training for every- one.https://www.microsoft.com/en-us/research/blog/deepspeed- extreme-scale-model-training-for-everyone

2025
[2]

torch.distributed.tensor.https://docs.pytorch.org/docs/stable/ distributed.tensor.html

2025. torch.distributed.tensor.https://docs.pytorch.org/docs/stable/ distributed.tensor.html. Package

2025
[3]

Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irv- ing, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: a system f...

2016
[4]

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalam- barkar, Laurent Kirsch, Michael...

work page doi:10.1145/3620665.3640366 2024
[5]

Thekkath, and Yonghui Wu

Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Dan Hurt, Michael Isard, Hyeontaek Lim, Ruom- ing Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepa- ssi, Laurent El Shafey, Chandramohan A. Thekkath, and Yonghui Wu. 2022. Pathways: Asynchronous Distributed Dataflow for ML. arXiv:2203.12533 [cs.DC]https://arxiv.org/abs/2203.12533

arXiv 2022
[6]

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, et al. 2021. Jax: Autograd and xla.Astrophysics Source Code Library(2021), ascl–2111

2021
[7]

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, Zeyao Ma, Kashun Shum, Xuwu Wang, Jinxi Wei, Jiaxi Yang, Jiajun Zhang, Lei Zhang, Zongmeng Zhang, Wenting Zhao, and Fan Zhou. 2026. Qwen3-Coder-Next Technical Report. arXiv:2603.00729 [cs.CL]https: //arxiv.org/abs/2603.00729

Pith/arXiv arXiv 2026
[8]

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, et al. 2024. Flux: Fast software-based communication overlap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858(2024)

arXiv 2024
[9]

Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang. 2024. Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning. InProceedings of the 29th ACM International Conference on Architectural Support for Program- ming Languages and Operating ...

2024
[10]

Hongzheng Chen, Cody Hao Yu, Shuai Zheng, Zhen Zhang, Zhiru Zhang, and Yida Wang. 2024. Slapo: A schedule language for progres- sive optimization of large deep learning model training. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1095–1111

2024
[11]

Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-End Optimization Stack for Deep Learning.CoRRabs/1802.04799 (2018). arXiv:1802.04799http://arxiv. org/abs/1802.04799

Pith/arXiv arXiv 2018
[12]

Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

Pith/arXiv arXiv 2025
[13]

Megan Frisella, Arvin Oentoro, Xiangyu Gao, Gilbert Bernstein, and Stephanie Wang. 2025. Piper: Towards Flexible Pipeline Parallelism for PyTorch. InProceedings of the 4th Workshop on Practical Adop- tion Challenges of ML for Systems(Seoul, Republic of Korea)(PACMI ’25). Association for Computing Machinery, New York, NY, USA, 1–6. doi:10.1145/3766882.3767187

work page doi:10.1145/3766882.3767187 2025
[14]

Ahan Gupta, Zhihao Wang, Neel Dani, Masahiro Tanaka, Olatunji Ruwase, and Minjia Zhang. [n. d.]. AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism. InThe Four- teenth International Conference on Learning Representations
[15]

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. PipeDream: Fast and Efficient Pipeline Parallel DNN Training. arXiv:1806.03377 [cs.DC] https://arxiv.org/abs/1806.03377

Pith/arXiv arXiv 2018
[16]

Weifang Hu, Langshi Chen, Man Yuan, Youyang Yao, Xiulong Yuan, Li Tian, Yong Li, Wei Lin, Xuanhua Shi, Zhengping Qian, and Jingren Zhou. 2026. Tessera: A Holistic Pipeline Parallelism Framework for Trillion-Parameter Heterogeneous MoE Training. InProceedings of the 20th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI ’26). To appear. 13

2026
[17]

Le, Yonghui Wu, and Zhifeng Chen

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019.GPipe: efficient training of giant neural networks using pipeline parallelism. Curran Associates Inc., Red Hook, NY, USA

2019
[18]

Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sa- bet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkow- icz, and Olli Saarikivi. 2022. Breaking the computation and communica- tion abstraction barrier in distributed machine learning workloads. In Proceedings of the 27th ACM International Conference on Architectural Support for Pro...

2022
[19]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond data and model parallelism for deep neural networks.Proceedings of Machine Learning and Systems1 (2019), 1–13

2019
[20]

Zhiquan Lai, Shengwei Li, Xudong Tang, Keshi Ge, Weijie Liu, Yabo Duan, Linbo Qiao, and Dongsheng Li. 2023. Merak: An efficient distributed dnn training framework with automated 3d parallelism for giant foundation models.IEEE Transactions on Parallel and Distributed Systems34, 5 (2023), 1466–1478

2023
[21]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668 (2020)

Pith/arXiv arXiv 2020
[22]

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. arXiv:2006.15704 [cs.DC]https: //arxiv.org/abs/2006.15704

Pith/arXiv arXiv 2020
[23]

Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. 2025. TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining. InThe Thirteenth International Conference on Learn- ing Representations.https://...

2025
[24]

Lightning.ai. [n. d.]. Faster PyTorch Training by Reducing Peak Mem- ory (combining backward pass + optimizer step) - Lightning AI — lightning.ai.https://lightning.ai/pages/community/tutorial/faster- pytorch-training-by-reducing-peak-memory/. [Accessed 23-04- 2026]

2026
[25]

Zhiqi Lin, Youshan Miao, Quanlu Zhang, Fan Yang, Yi Zhu, Cheng Li, Saeed Maleki, Xu Cao, Ning Shang, Yilei Yang, et al . 2024. {nnScaler}:{Constraint-Guided} Parallelization Plan Generation for Deep Learning Training. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 347–363

2024
[26]

Guodong Liu, Youshan Miao, Zhiqi Lin, Xiaoxiang Shi, Saeed Maleki, Fan Yang, Yungang Bao, and Sa Wang. 2024. Aceso: Efficient parallel DNN training through iterative bottleneck alleviation. InProceedings of the Nineteenth European Conference on Computer Systems. 163–181

2024
[27]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruction Tuning. InAdvances in Neural Information Pro- cessing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 34892– 34916.https://proceedings.neurips.cc/paper_files/paper/2023/file/ 6dcf277ea32ce3288914faf369fe6...

2023
[28]

Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. 2022. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism.arXiv preprint arXiv:2211.13878(2022)

arXiv 2022
[29]

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-Efficient Pipeline-Parallel DNN Train- ing. InProceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 7937–7947.https://proceedings. mlr.press/v139/narayanan21a.html

2021
[30]

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vain- brand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. arXiv:2104.04473 [cs.CL]https://arxiv.org/abs/2104.04473

arXiv 2021
[31]

NVIDIA. [n. d.]. NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl
[32]

Yi Pan, Yile Gu, Jinbin Luo, Yibo Wu, Ziren Wang, Hongtao Zhang, Ziyi Xu, Shengkai Lin, Baris Kasikci, and Stephanie Wang. 2026. DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling.Proceedings of Machine Learning and Systems. To appear

2026
[33]

Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed DNN training acceleration. InProceedings of the 27th ACM Symposium on Operating Systems Principles. 16–29

2019
[34]

Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2023. Zero Bubble Pipeline Parallelism.ArXivabs/2401.10241 (2023).https: //api.semanticscholar.org/CorpusID:267060979

arXiv 2023
[35]

Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2025. DualPipe could be better without the Dual.https://hackmd.io/ @ufotalent/r1lVXsa9Jg. Blog

2025
[36]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Su- pervision. InProceedings of the 38th International Conference on Ma- chine Learning (Proceedings of Mac...

2021
[37]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a lan- guage and compiler for optimizing parallelism, locality, and recompu- tation in image processing pipelines.SIGPLAN Not.48, 6 (June 2013), 519–530. doi:10.1145/2499370.2462176

work page doi:10.1145/2499370.2462176 2013
[38]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He
[39]

arXiv:1910.02054 http://arxiv.org/abs/1910.02054

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models.CoRRabs/1910.02054 (2019). arXiv:1910.02054 http://arxiv.org/abs/1910.02054

Pith/arXiv arXiv 1910
[40]

James Reed, Pavel Belevich, Ke Wen, Howard Huang, and Will Con- stable. 2022. PiPPy: Pipeline Parallelism for PyTorch.https://github. com/pytorch/PiPPy

2022
[41]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRRabs/1909.08053 (2019). arXiv:1909.08053http://arxiv.org/abs/ 1909.08053

Pith/arXiv arXiv 2019
[42]

Zhenbo Sun, Huanqi Cao, Yuanwei Wang, Guanyu Feng, Shengqi Chen, Haojie Wang, and Wenguang Chen. 2024. Adapipe: Optimizing pipeline parallelism with adaptive recomputation and partitioning. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 86–100

2024
[43]

Masahiro Tanaka, Du Li, Umesh Chand, Ali Zafar, Haiying Shen, and Olatunji Ruwase. 2025. DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training.arXiv preprint arXiv:2504.09983(2025)

arXiv 2025
[44]

Jakub M Tarnawski, Deepak Narayanan, and Amar Phanishayee. 2021. Piper: Multidimensional planner for dnn parallelization.Advances in Neural Information Processing Systems34 (2021), 24829–24840

2021
[45]

Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat 14 McCormick, Jamaludin Mohd-Yusof, et al. 2022. Unity: Accelerating {DNN} training through joint optimization of algebraic transforma- tions and parallelization. In16th USENIX Symposium on Operating Systems Design and Imp...

2022
[46]

Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting very large models using automatic dataflow graph partitioning. In Proceedings of the Fourteenth EuroSys Conference 2019. 1–17

2019
[47]

Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa Murthy, Marcello Maggioni, Qiao Zhang, et al . 2022. Overlap communication with dependent computation via decomposition in large deep learning models. InProceedings of the 28th ACM International Conference on Ar- chitectural Support for Programmi...

2022
[48]

Yifu Wang, Horace He, Less Wright, Luca Wehrstedt, Tianyu Liu, and Wanchao Liang. 2024. Distributed w/ TorchTi- tan: Introducing Async Tensor Parallelism in PyTorch. https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing- async-tensor-parallelism-in-pytorch/209487. PyTorch Forum Post. Accessed: 2026-04-23

2024
[49]

Anxhelo Xhebraj, Sean Lee, Hanfeng Chen, and Vinod Grover. 2025. Scaling deep learning training with MPMD pipeline parallelism.Pro- ceedings of Machine Learning and Systems7 (2025)

2025
[50]

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, Ruoming Pang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and Zhifeng Chen. 2021. GSPMD: General and Scalable Parallelization for ML Computation Graphs. arXiv:2105.04663 [cs.DC]https://arxiv.org/abs/2105.04663

Pith/arXiv arXiv 2021
[51]

Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, and Jianyu Huang. 2025. Context Paral- lelism for Scalable Million-Token Inference. InProceedings of Machine Learning and Systems, M. Zaharia, G. Joshi, and Y. Lin (Eds.), Vol. 7. MLSys.https://proceedings.mlsys.org/paper_files/paper/2025/file/ 78834433edc3291f4c...

2025
[52]

Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang. 2024. Accelerating the training of large language models using efficient activation remate- rialization and optimal hybrid parallelism. In2024 USENIX Annual Technical Conference (USENIX ATC 24). 545–561

2024
[53]

Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. 2024. MM-LLMs: Recent Advances in MultiModal Large Language Models. InFindings of the Association for Computa- tional Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 12401–12430. doi...

work page doi:10.18653/v1/2024.findings-acl.738 2024
[54]

Zili Zhang, Yinmin Zhong, Yimin Jiang, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Daxin Jiang, and Xin Jin. 2025. DistTrain: Ad- dressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models. InProceedings of the ACM SIGCOMM 2025 Conference(São Francisco Convent, Coimbra, Por- tugal)(SIGCOMM ’25). Association fo...

work page doi:10.1145/3718958.3750472 2025
[55]

Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, and Liang Zhao. 2025. DeepEP: an efficient expert-parallel communication library.https://github.com/ deepseek-ai/DeepEP

2025
[56]

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. Py- Torch FSDP: Experiences on Scaling Fully Sharded Data Parallel. arXiv:2304.11277 [cs.DC]https://arxiv...

Pith/arXiv arXiv 2023
[57]

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559–578

2022
[58]

Zhanda Zhu, Christina Giannoula, Muralidhar Andoorveedu, Qidong Su, Karttikeya Mangalam, Bojian Zheng, and Gennady Pekhimenko
[59]

InProceedings of the Twentieth European Conference on Computer Systems

Mist: Efficient Distributed Training of Large Language Mod- els via Memory-Parallelism Co-Optimization. InProceedings of the Twentieth European Conference on Computer Systems. 1298–1316. 15

[1] [1]

DeepSpeed: Extreme-scale model training for every- one.https://www.microsoft.com/en-us/research/blog/deepspeed- extreme-scale-model-training-for-everyone

2025. DeepSpeed: Extreme-scale model training for every- one.https://www.microsoft.com/en-us/research/blog/deepspeed- extreme-scale-model-training-for-everyone

2025

[2] [2]

torch.distributed.tensor.https://docs.pytorch.org/docs/stable/ distributed.tensor.html

2025. torch.distributed.tensor.https://docs.pytorch.org/docs/stable/ distributed.tensor.html. Package

2025

[3] [3]

Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irv- ing, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: a system f...

2016

[4] [4]

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalam- barkar, Laurent Kirsch, Michael...

work page doi:10.1145/3620665.3640366 2024

[5] [5]

Thekkath, and Yonghui Wu

Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Dan Hurt, Michael Isard, Hyeontaek Lim, Ruom- ing Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepa- ssi, Laurent El Shafey, Chandramohan A. Thekkath, and Yonghui Wu. 2022. Pathways: Asynchronous Distributed Dataflow for ML. arXiv:2203.12533 [cs.DC]https://arxiv.org/abs/2203.12533

arXiv 2022

[6] [6]

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, et al. 2021. Jax: Autograd and xla.Astrophysics Source Code Library(2021), ascl–2111

2021

[7] [7]

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, Zeyao Ma, Kashun Shum, Xuwu Wang, Jinxi Wei, Jiaxi Yang, Jiajun Zhang, Lei Zhang, Zongmeng Zhang, Wenting Zhao, and Fan Zhou. 2026. Qwen3-Coder-Next Technical Report. arXiv:2603.00729 [cs.CL]https: //arxiv.org/abs/2603.00729

Pith/arXiv arXiv 2026

[8] [8]

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, et al. 2024. Flux: Fast software-based communication overlap on gpus through kernel fusion.arXiv preprint arXiv:2406.06858(2024)

arXiv 2024

[9] [9]

Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang. 2024. Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning. InProceedings of the 29th ACM International Conference on Architectural Support for Program- ming Languages and Operating ...

2024

[10] [10]

Hongzheng Chen, Cody Hao Yu, Shuai Zheng, Zhen Zhang, Zhiru Zhang, and Yida Wang. 2024. Slapo: A schedule language for progres- sive optimization of large deep learning model training. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1095–1111

2024

[11] [11]

Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-End Optimization Stack for Deep Learning.CoRRabs/1802.04799 (2018). arXiv:1802.04799http://arxiv. org/abs/1802.04799

Pith/arXiv arXiv 2018

[12] [12]

Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

Pith/arXiv arXiv 2025

[13] [13]

Megan Frisella, Arvin Oentoro, Xiangyu Gao, Gilbert Bernstein, and Stephanie Wang. 2025. Piper: Towards Flexible Pipeline Parallelism for PyTorch. InProceedings of the 4th Workshop on Practical Adop- tion Challenges of ML for Systems(Seoul, Republic of Korea)(PACMI ’25). Association for Computing Machinery, New York, NY, USA, 1–6. doi:10.1145/3766882.3767187

work page doi:10.1145/3766882.3767187 2025

[14] [14]

Ahan Gupta, Zhihao Wang, Neel Dani, Masahiro Tanaka, Olatunji Ruwase, and Minjia Zhang. [n. d.]. AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism. InThe Four- teenth International Conference on Learning Representations

[15] [15]

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. PipeDream: Fast and Efficient Pipeline Parallel DNN Training. arXiv:1806.03377 [cs.DC] https://arxiv.org/abs/1806.03377

Pith/arXiv arXiv 2018

[16] [16]

Weifang Hu, Langshi Chen, Man Yuan, Youyang Yao, Xiulong Yuan, Li Tian, Yong Li, Wei Lin, Xuanhua Shi, Zhengping Qian, and Jingren Zhou. 2026. Tessera: A Holistic Pipeline Parallelism Framework for Trillion-Parameter Heterogeneous MoE Training. InProceedings of the 20th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI ’26). To appear. 13

2026

[17] [17]

Le, Yonghui Wu, and Zhifeng Chen

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019.GPipe: efficient training of giant neural networks using pipeline parallelism. Curran Associates Inc., Red Hook, NY, USA

2019

[18] [18]

Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sa- bet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkow- icz, and Olli Saarikivi. 2022. Breaking the computation and communica- tion abstraction barrier in distributed machine learning workloads. In Proceedings of the 27th ACM International Conference on Architectural Support for Pro...

2022

[19] [19]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond data and model parallelism for deep neural networks.Proceedings of Machine Learning and Systems1 (2019), 1–13

2019

[20] [20]

Zhiquan Lai, Shengwei Li, Xudong Tang, Keshi Ge, Weijie Liu, Yabo Duan, Linbo Qiao, and Dongsheng Li. 2023. Merak: An efficient distributed dnn training framework with automated 3d parallelism for giant foundation models.IEEE Transactions on Parallel and Distributed Systems34, 5 (2023), 1466–1478

2023

[21] [21]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668 (2020)

Pith/arXiv arXiv 2020

[22] [22]

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. arXiv:2006.15704 [cs.DC]https: //arxiv.org/abs/2006.15704

Pith/arXiv arXiv 2020

[23] [23]

Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur, and Stratos Idreos. 2025. TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining. InThe Thirteenth International Conference on Learn- ing Representations.https://...

2025

[24] [24]

Lightning.ai. [n. d.]. Faster PyTorch Training by Reducing Peak Mem- ory (combining backward pass + optimizer step) - Lightning AI — lightning.ai.https://lightning.ai/pages/community/tutorial/faster- pytorch-training-by-reducing-peak-memory/. [Accessed 23-04- 2026]

2026

[25] [25]

Zhiqi Lin, Youshan Miao, Quanlu Zhang, Fan Yang, Yi Zhu, Cheng Li, Saeed Maleki, Xu Cao, Ning Shang, Yilei Yang, et al . 2024. {nnScaler}:{Constraint-Guided} Parallelization Plan Generation for Deep Learning Training. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 347–363

2024

[26] [26]

Guodong Liu, Youshan Miao, Zhiqi Lin, Xiaoxiang Shi, Saeed Maleki, Fan Yang, Yungang Bao, and Sa Wang. 2024. Aceso: Efficient parallel DNN training through iterative bottleneck alleviation. InProceedings of the Nineteenth European Conference on Computer Systems. 163–181

2024

[27] [27]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruction Tuning. InAdvances in Neural Information Pro- cessing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 34892– 34916.https://proceedings.neurips.cc/paper_files/paper/2023/file/ 6dcf277ea32ce3288914faf369fe6...

2023

[28] [28]

Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. 2022. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism.arXiv preprint arXiv:2211.13878(2022)

arXiv 2022

[29] [29]

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-Efficient Pipeline-Parallel DNN Train- ing. InProceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 7937–7947.https://proceedings. mlr.press/v139/narayanan21a.html

2021

[30] [30]

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vain- brand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. arXiv:2104.04473 [cs.CL]https://arxiv.org/abs/2104.04473

arXiv 2021

[31] [31]

NVIDIA. [n. d.]. NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl

[32] [32]

Yi Pan, Yile Gu, Jinbin Luo, Yibo Wu, Ziren Wang, Hongtao Zhang, Ziyi Xu, Shengkai Lin, Baris Kasikci, and Stephanie Wang. 2026. DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling.Proceedings of Machine Learning and Systems. To appear

2026

[33] [33]

Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed DNN training acceleration. InProceedings of the 27th ACM Symposium on Operating Systems Principles. 16–29

2019

[34] [34]

Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2023. Zero Bubble Pipeline Parallelism.ArXivabs/2401.10241 (2023).https: //api.semanticscholar.org/CorpusID:267060979

arXiv 2023

[35] [35]

Penghui Qi, Xinyi Wan, Guangxing Huang, and Min Lin. 2025. DualPipe could be better without the Dual.https://hackmd.io/ @ufotalent/r1lVXsa9Jg. Blog

2025

[36] [36]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Su- pervision. InProceedings of the 38th International Conference on Ma- chine Learning (Proceedings of Mac...

2021

[37] [37]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a lan- guage and compiler for optimizing parallelism, locality, and recompu- tation in image processing pipelines.SIGPLAN Not.48, 6 (June 2013), 519–530. doi:10.1145/2499370.2462176

work page doi:10.1145/2499370.2462176 2013

[38] [38]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

[39] [39]

arXiv:1910.02054 http://arxiv.org/abs/1910.02054

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models.CoRRabs/1910.02054 (2019). arXiv:1910.02054 http://arxiv.org/abs/1910.02054

Pith/arXiv arXiv 1910

[40] [40]

James Reed, Pavel Belevich, Ke Wen, Howard Huang, and Will Con- stable. 2022. PiPPy: Pipeline Parallelism for PyTorch.https://github. com/pytorch/PiPPy

2022

[41] [41]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRRabs/1909.08053 (2019). arXiv:1909.08053http://arxiv.org/abs/ 1909.08053

Pith/arXiv arXiv 2019

[42] [42]

Zhenbo Sun, Huanqi Cao, Yuanwei Wang, Guanyu Feng, Shengqi Chen, Haojie Wang, and Wenguang Chen. 2024. Adapipe: Optimizing pipeline parallelism with adaptive recomputation and partitioning. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 86–100

2024

[43] [43]

Masahiro Tanaka, Du Li, Umesh Chand, Ali Zafar, Haiying Shen, and Olatunji Ruwase. 2025. DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training.arXiv preprint arXiv:2504.09983(2025)

arXiv 2025

[44] [44]

Jakub M Tarnawski, Deepak Narayanan, and Amar Phanishayee. 2021. Piper: Multidimensional planner for dnn parallelization.Advances in Neural Information Processing Systems34 (2021), 24829–24840

2021

[45] [45]

Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat 14 McCormick, Jamaludin Mohd-Yusof, et al. 2022. Unity: Accelerating {DNN} training through joint optimization of algebraic transforma- tions and parallelization. In16th USENIX Symposium on Operating Systems Design and Imp...

2022

[46] [46]

Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting very large models using automatic dataflow graph partitioning. In Proceedings of the Fourteenth EuroSys Conference 2019. 1–17

2019

[47] [47]

Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa Murthy, Marcello Maggioni, Qiao Zhang, et al . 2022. Overlap communication with dependent computation via decomposition in large deep learning models. InProceedings of the 28th ACM International Conference on Ar- chitectural Support for Programmi...

2022

[48] [48]

Yifu Wang, Horace He, Less Wright, Luca Wehrstedt, Tianyu Liu, and Wanchao Liang. 2024. Distributed w/ TorchTi- tan: Introducing Async Tensor Parallelism in PyTorch. https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing- async-tensor-parallelism-in-pytorch/209487. PyTorch Forum Post. Accessed: 2026-04-23

2024

[49] [49]

Anxhelo Xhebraj, Sean Lee, Hanfeng Chen, and Vinod Grover. 2025. Scaling deep learning training with MPMD pipeline parallelism.Pro- ceedings of Machine Learning and Systems7 (2025)

2025

[50] [50]

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, Ruoming Pang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and Zhifeng Chen. 2021. GSPMD: General and Scalable Parallelization for ML Computation Graphs. arXiv:2105.04663 [cs.DC]https://arxiv.org/abs/2105.04663

Pith/arXiv arXiv 2021

[51] [51]

Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, and Jianyu Huang. 2025. Context Paral- lelism for Scalable Million-Token Inference. InProceedings of Machine Learning and Systems, M. Zaharia, G. Joshi, and Y. Lin (Eds.), Vol. 7. MLSys.https://proceedings.mlsys.org/paper_files/paper/2025/file/ 78834433edc3291f4c...

2025

[52] [52]

Tailing Yuan, Yuliang Liu, Xucheng Ye, Shenglong Zhang, Jianchao Tan, Bin Chen, Chengru Song, and Di Zhang. 2024. Accelerating the training of large language models using efficient activation remate- rialization and optimal hybrid parallelism. In2024 USENIX Annual Technical Conference (USENIX ATC 24). 545–561

2024

[53] [53]

Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. 2024. MM-LLMs: Recent Advances in MultiModal Large Language Models. InFindings of the Association for Computa- tional Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 12401–12430. doi...

work page doi:10.18653/v1/2024.findings-acl.738 2024

[54] [54]

Zili Zhang, Yinmin Zhong, Yimin Jiang, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, Daxin Jiang, and Xin Jin. 2025. DistTrain: Ad- dressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models. InProceedings of the ACM SIGCOMM 2025 Conference(São Francisco Convent, Coimbra, Por- tugal)(SIGCOMM ’25). Association fo...

work page doi:10.1145/3718958.3750472 2025

[55] [55]

Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, and Liang Zhao. 2025. DeepEP: an efficient expert-parallel communication library.https://github.com/ deepseek-ai/DeepEP

2025

[56] [56]

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. Py- Torch FSDP: Experiences on Scaling Fully Sharded Data Parallel. arXiv:2304.11277 [cs.DC]https://arxiv...

Pith/arXiv arXiv 2023

[57] [57]

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559–578

2022

[58] [58]

Zhanda Zhu, Christina Giannoula, Muralidhar Andoorveedu, Qidong Su, Karttikeya Mangalam, Bojian Zheng, and Gennady Pekhimenko

[59] [59]

InProceedings of the Twentieth European Conference on Computer Systems

Mist: Efficient Distributed Training of Large Language Mod- els via Memory-Parallelism Co-Optimization. InProceedings of the Twentieth European Conference on Computer Systems. 1298–1316. 15