Recognition: no theorem link
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
Pith reviewed 2026-05-11 02:56 UTC · model grok-4.3
The pith
HexiSeq supports asymmetric partitioning of sequences and attention heads to train long-context LLMs efficiently on mixed GPU clusters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HexiSeq introduces fully asymmetric CP-HP partitioning for heterogeneous GPU clusters by assigning sequence shards and attention heads according to each device's compute, memory, and communication capabilities. The allocation is formalized as a constrained optimization problem solved by an efficient hierarchical scheduler. On models from 3B to 70B parameters with contexts up to one million tokens, it delivers 1.11 times average throughput improvement on mixed H100-A100 hardware and 1.36 times in simulations with up to 128 GPUs of four models, approaching the performance of the strongest homogeneous baselines on FLOP-equivalent clusters.
What carries the argument
The hierarchical scheduler that solves the constrained optimization problem for asymmetric assignment of sequence shards and attention heads to heterogeneous devices.
Load-bearing premise
The performance model in the optimization accurately captures the execution time and communication costs across different GPU models and network links.
What would settle it
Running HexiSeq on a new mixed cluster of 64 GPUs with three GPU types and comparing the observed throughput to the 1.36x gain predicted by the simulations.
read the original abstract
Long-context training of large language models (LLMs) is commonly distributed with Context Parallelism (CP) and Head Parallelism (HP), but existing training systems largely assume homogeneous GPU meshes. This paper extends CP and HP to heterogeneous GPU clusters with mixed GPU models and non-uniform network bandwidths, a common setting in production training. We introduce HexiSeq, a system that supports fully asymmetric CP--HP partitioning by assigning sequence shards and attention heads according to device compute, memory, and communication capabilities. We formalize heterogeneous CP--HP allocation as a constrained optimization problem and develop an efficient hierarchical scheduler for finding optimal schedules. We evaluate HexiSeq against state-of-the-art CP and HP baselines on both real and simulated heterogeneous clusters. Across models from 3B to 70B parameters and context lengths up to one million tokens, HexiSeq improves throughput by $1.11\times$ on average and up to $1.19\times$ on mixed H100--A100 testbeds, and by $1.36\times$ on average and up to $1.72\times$ in simulations with 32--128 GPUs spanning up to four GPU models. On FLOP-comparable pairs against homogeneous clusters, HexiSeq reaches throughput close to the strongest homogeneous baseline, showing that heterogeneous clusters can be used efficiently for long-context LLM training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents HexiSeq, a system for long-context LLM training on heterogeneous GPU clusters. It extends Context Parallelism (CP) and Head Parallelism (HP) to fully asymmetric partitioning of sequence shards and attention heads according to per-device compute, memory, and communication capabilities. The allocation is formalized as a constrained optimization problem solved via an efficient hierarchical scheduler. Evaluations across 3B–70B models and contexts up to 1M tokens report average throughput gains of 1.11× (max 1.19×) on real mixed H100–A100 testbeds and 1.36× (max 1.72×) in simulations with 32–128 GPUs spanning up to four models, with heterogeneous throughput approaching the strongest homogeneous baseline.
Significance. If the scheduler and performance model are sound, the result is significant for practical distributed training: heterogeneous clusters are common in production yet most CP/HP systems assume homogeneity. Demonstrating that device-aware asymmetric partitioning can deliver measurable gains without requiring uniform hardware would allow more efficient resource utilization and reduce the need for hardware homogenization.
major comments (3)
- [§3] §3: The constrained optimization formulation for heterogeneous CP–HP allocation is load-bearing for the central claim of near-optimal schedules, yet the manuscript provides only a high-level description without the explicit objective function, decision variables, or full set of constraints; this prevents assessment of whether the hierarchical scheduler actually produces near-optimal solutions or merely feasible ones.
- [§4.2] §4.2: The performance model that drives the scheduler (accounting for compute, memory, and non-uniform bandwidth) is described at a high level; without the concrete equations or calibration procedure, it is impossible to verify whether the model accurately reflects real device and network behavior, which directly affects the validity of the reported 1.11×–1.72× speedups.
- [§5.3] §5.3, Table 3: The simulation results for 32–128 GPUs claim up to 1.72× improvement, but the paper does not report the number of independent runs, variance, or statistical tests; without these, the magnitude of the gains cannot be distinguished from experimental noise or post-hoc schedule selection.
minor comments (3)
- The abstract and §2 cite 'state-of-the-art CP and HP baselines' but the main text does not explicitly name the exact implementations or versions used, making reproducibility difficult.
- Figure 4: Axis labels and legends are too small for comfortable reading; increasing font size would improve clarity.
- Notation for CP and HP shard sizes is introduced inconsistently between §3 and §4; a single table of symbols would help.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting areas where additional detail would strengthen the paper. We address each major comment below and will revise the manuscript accordingly to improve clarity and verifiability.
read point-by-point responses
-
Referee: [§3] The constrained optimization formulation for heterogeneous CP–HP allocation is load-bearing for the central claim of near-optimal schedules, yet the manuscript provides only a high-level description without the explicit objective function, decision variables, or full set of constraints; this prevents assessment of whether the hierarchical scheduler actually produces near-optimal solutions or merely feasible ones.
Authors: We agree that the explicit formulation is essential. In the revised manuscript we will add the full optimization problem in §3: the objective is to maximize the minimum per-device throughput (inverse of the critical-path time) subject to per-device memory capacity and aggregate communication volume constraints. Decision variables are the sequence-shard lengths assigned to each GPU and the integer number of attention heads per GPU. All constraints (compute balance, memory footprint, and non-uniform link bandwidths) will be stated mathematically. We will also clarify that the hierarchical scheduler combines a greedy initial allocation with a local-search refinement and will include a brief argument on why the produced schedules are near-optimal in practice. revision: yes
-
Referee: [§4.2] The performance model that drives the scheduler (accounting for compute, memory, and non-uniform bandwidth) is described at a high level; without the concrete equations or calibration procedure, it is impossible to verify whether the model accurately reflects real device and network behavior, which directly affects the validity of the reported 1.11×–1.72× speedups.
Authors: We will expand §4.2 with the concrete performance-model equations. Compute time for a shard is modeled as T_comp = (shard_tokens × heads × model_dim) / (device_FLOPS × utilization_factor). Memory usage sums activation, KV-cache, and parameter shards. Communication cost uses measured pairwise bandwidths between GPU models (H100–H100, H100–A100, etc.) and accounts for all-reduce and all-gather patterns. Calibration was performed via micro-benchmarks on the target testbed; we will report the measured constants and validation against end-to-end timings. revision: yes
-
Referee: [§5.3] The simulation results for 32–128 GPUs claim up to 1.72× improvement, but the paper does not report the number of independent runs, variance, or statistical tests; without these, the magnitude of the gains cannot be distinguished from experimental noise or post-hoc schedule selection.
Authors: The simulator is deterministic given fixed device parameters and the scheduler is optimization-based, so each reported schedule is the unique output of the algorithm for that configuration. We will add an explicit statement in §5.3 clarifying the deterministic nature and will include a sensitivity analysis by varying the performance-model parameters within measured noise ranges. Because the underlying model is deterministic, traditional statistical tests across independent runs are not applicable; we will instead report the range of throughput obtained under parameter perturbation. revision: partial
Circularity Check
No significant circularity; empirical system evaluation
full rationale
The paper introduces a constrained optimization formulation for heterogeneous CP-HP partitioning and an associated hierarchical scheduler, then reports direct throughput measurements against baselines on real mixed H100-A100 hardware and larger simulations. No derivation reduces by construction to fitted parameters defined from the same data, no self-citation chain is load-bearing for the central claim, and no ansatz or uniqueness result is smuggled in. The performance numbers are externally falsifiable via the described testbeds, making the evaluation self-contained against independent benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Existing training systems largely assume homogeneous GPU meshes.
- domain assumption Heterogeneous GPU clusters with mixed GPU models and non-uniform network bandwidths are a common setting in production training.
Reference graph
Works this paper leans on
-
[1]
What’s new in claude opus 4.7, 2026
Anthropic. What’s new in claude opus 4.7, 2026. URLhttps://platform.claude.com/docs/en/about-claude/ models/whats-new-claude-4-7
work page 2026
-
[2]
Striped attention: Faster ring attention for causal transformers
William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan- Kelley . Striped attention: Faster ring attention for causal transformers.arXiv preprint arXiv:2311.09431, 2023
-
[3]
Qiaoling Chen, Diandian Gu, Guoteng Wang, Xun Chen, YingTong Xiong, Ting Huang, Qinghao Hu, Xin Jin, Yonggang Wen, Tianwei Zhang, et al. Internevo: Efficient long-sequence large language model training via hybrid parallelism and redundant sharding. arXiv preprint arXiv:2401.09149, 2024
-
[4]
Sppo: Efficient long-sequence llm training via adaptive sequence pipeline parallel offloading
Qiaoling Chen, Shenggui Li, Wei Gao, Peng Sun, Yonggang Wen, and Tianwei Zhang. Sppo: Efficient long-sequence llm training via adaptive sequence pipeline parallel offloading. arXiv preprint arXiv:2503.10377, 2025
-
[5]
Sirui Chen, Jingji Chen, Siqi Zhu, Ziheng Jiang, Yanghua Peng, and Xuehai Qian. Mesh-attention: A new communication-efficient distributed attention with improved data locality .arXiv preprint arXiv:2512.20968, 2025
-
[6]
Flashattention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In 12th International Conference on Learning Representations, ICLR 2024, 2024
work page 2024
-
[7]
Flashattention: Fast and memory-efficient exact attention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, 35:16344–16359, 2022
work page 2022
-
[8]
40 Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel
Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai. arXiv preprint arXiv:2405.07719, 2024
-
[9]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[10]
Enabling parallelism hot switching for efficient training of large language models
Hao Ge, Fangcheng Fu, Haoyang Li, Xuanyu Wang, Sheng Lin, Yujie Wang, Xiaonan Nie, Hailin Zhang, Xupeng Miao, and Bin Cui. Enabling parallelism hot switching for efficient training of large language models. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 178–194, 2024. 10
work page 2024
-
[11]
Bytescale: Communication-efficient scaling of llm training with a 2048k context length on 16384 gpus
Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, and Xin Liu. Bytescale: Communication-efficient scaling of llm training with a 2048k context length on 16384 gpus. In Proceedings of the ACM SIGCOMM 2025 Conference, pages 963–978, 2025
work page 2025
-
[12]
Google Cloud. Gemini 3.1 pro, 2026. URL https://docs.cloud.google.com/vertex-ai/generative-ai/docs/m odels/gemini/3-1-pro
work page 2026
-
[13]
Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, et al. Loongtrain: Efficient training of long-sequence llms with head-context parallelism. arXiv preprint arXiv:2406.18485, 2024
-
[14]
Cephalo: Harnessing heteroge- neous gpu clusters for training transformer models
Runsheng Benson Guo, Utkarsh Anand, Arthur Chen, and Khuzaima Daudjee. Cephalo: Harnessing heteroge- neous gpu clusters for training transformer models. In Proceedings of the 39th ACM International Conference on Supercomputing, pages 368–383, 2025
work page 2025
-
[15]
Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus
Guoliang He, Youhe Jiang, Wencong Xiao, Jiang Kaihua, Shuguang Wang, Jun Wang, Du Zixian, Zhuo Jiang, Xinlei Zhang, Binhang Yuan, et al. Efficient pre-training of llms via topology-aware communication alignment on more than 9600 gpus. Advances in Neural Information Processing Systems, 38:147100–147126, 2026
work page 2026
-
[16]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019
work page 2019
-
[17]
System optimizations for enabling training of extreme long sequence transformer models
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminadabi, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. System optimizations for enabling training of extreme long sequence transformer models. In Proceedings of the 43rd ACM Symposium on Principles of Distributed Computing, pages 121–130, 2024
work page 2024
-
[18]
Dcp: Addressing input dynamism in long-context training via dynamic context parallelism
Chenyu Jiang, Zhenkun Cai, Ye Tian, Zhen Jia, Yida Wang, and Chuan Wu. Dcp: Addressing input dynamism in long-context training via dynamic context parallelism. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 221–236, 2025
work page 2025
-
[19]
Osdp: Optimal sharded data parallel for distributed deep learning
Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, and Bin Cui. Osdp: Optimal sharded data parallel for distributed deep learning. arXiv preprint arXiv:2209.13258, 2022
-
[20]
Hexgen: Generative inference of large language model over heterogeneous environment
Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Binhang Yuan. Hexgen: Generative inference of large language model over heterogeneous environment. arXiv preprint arXiv:2311.11514, 2023
-
[21]
Demystifying cost- efficiency in llm serving over heterogeneous gpus.ArXiv, abs/2502.00722,
Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. Demystifying cost-efficiency in llm serving over heterogeneous gpus. arXiv preprint arXiv:2502.00722, 2025
-
[22]
Thunderserve: High-performance and cost-efficient llm serving in cloud environments
Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, and Eiko Yoneki. Thunderserve: High-performance and cost-efficient llm serving in cloud environments. Proceedings of Machine Learning and Systems, 7, 2025
work page 2025
-
[23]
Cascadia: An efficient cascade serving system for large language models
Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D Lane, and Binhang Yuan. Cascadia: An efficient cascade serving system for large language models. arXiv preprint arXiv:2506.04203, 2025
-
[24]
Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment,
Youhe Jiang, Ran Yan, and Binhang Yuan. Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment. arXiv preprint arXiv:2502.07903, 2025
-
[25]
OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration
Youhe Jiang, Fangcheng Fu, Taiyi Wang, Guoliang He, and Eiko Yoneki. Oserve: Accelerating llm serving via spatial-temporal workload orchestration. arXiv preprint arXiv:2602.12151, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[26]
Youhe Jiang, Fangcheng Fu, and Eiko Yoneki. Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization. arXiv preprint arXiv:2602.10729, 2026
-
[27]
Reducing activation recomputation in large transformer models
Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5:341–353, 2023
work page 2023
-
[28]
Distflashattn: Distributed memory-efficient attention for long-context llms training
Dacheng Li, Rulin Shao, Anze Xie, Eric P Xing, Xuezhe Ma, Ion Stoica, Joseph E Gonzalez, and Hao Zhang. Distflashattn: Distributed memory-efficient attention for long-context llms training. In First Conference on Language Modeling, 2024. 11
work page 2024
-
[29]
Haoyang Li, Fangcheng Fu, Hao Ge, Sheng Lin, Xuanyu Wang, Jiawen Niu, Xupeng Miao, and Bin Cui. Hetu v2: A general and scalable deep learning system with hierarchical and heterogeneous single program multiple data annotations. arXiv preprint arXiv:2504.20490, 2025
-
[30]
Haoyang Li, Fangcheng Fu, Sheng Lin, Hao Ge, Xuanyu Wang, Jiawen Niu, Jinbao Xue, Yangyu Tao, Di Wang, Jie Jiang, et al. Hydraulis: Balancing large transformer model training via co-designing parallel strategies and data assignment. Proceedings of the ACM on Management of Data, 3(6):1–30, 2025
work page 2025
-
[31]
Pytorch distributed: experiences on accelerating data parallel training
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch distributed: experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 13(12):3005–3018, 2020
work page 2020
-
[32]
Sequence parallelism: Long sequence training from system perspective
Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence parallelism: Long sequence training from system perspective. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2391–2404, 2023
work page 2023
-
[33]
Terapipe: Token-level pipeline parallelism for training large-scale language models
Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning, pages 6543–6552. PMLR, 2021
work page 2021
-
[34]
Ringattention with blockwise transformers for near-infinite context
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ringattention with blockwise transformers for near-infinite context. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[35]
Ziming Liu, Shaoyu Wang, Shenggan Cheng, Zhongkai Zhao, Kai Wang, Xuanlei Zhao, James Demmel, and Yang You. Startrail: Concentric ring sequence parallelism for efficient near-infinite-context transformer model training. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[36]
Mini-sequence transformers: Optimizing intermediate memory for long sequences training
Cheng Luo, Jiawei Zhao, Zhuoming Chen, Beidi Chen, and Anima Anandkumar. Mini-sequence transformers: Optimizing intermediate memory for long sequences training. Advances in Neural Information Processing Systems, 37:97299–97327, 2024
work page 2024
-
[37]
Galvatron: Efficient transformer training over multiple gpus using automatic parallelism
Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism. arXiv preprint arXiv:2211.13878, 2022
-
[38]
Pipedream: Generalized pipeline parallelism for dnn training
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM symposium on operating systems principles, pages 1–15, 2019
work page 2019
-
[39]
Efficient large-scale language model training on gpu clusters using megatron-lm
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley , Mostofa Patwary , Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the international conference for high performance computing, networking,...
work page 2021
-
[40]
Nvidia’s new ampere data center GPU in full production, 2020
Nvidia. Nvidia’s new ampere data center GPU in full production, 2020. URL https://nvidianews.nvidia.com/ne ws/nvidias-new-ampere-data-center-gpu-in-full-production
work page 2020
-
[41]
Nvidia announces hopper architecture, the next generation of accelerated computing, 2022
Nvidia. Nvidia announces hopper architecture, the next generation of accelerated computing, 2022. URL https: //nvidianews.nvidia.com/news/nvidia-announces-hopper-architecture-the-next-generation-of-accel erated-computing
work page 2022
-
[42]
Nvidia blackwell platform arrives to power a new era of computing, 2024
Nvidia. Nvidia blackwell platform arrives to power a new era of computing, 2024. URL https://nvidianews.nvi dia.com/news/nvidia-blackwell-platform-arrives-to-power-a-new-era-of-computing
work page 2024
-
[43]
Nvidia blackwell ultra ai factory platform paves way for age of ai reasoning, 2025
Nvidia. Nvidia blackwell ultra ai factory platform paves way for age of ai reasoning, 2025. URLhttps://nvidianews .nvidia.com/news/nvidia-blackwell-ultra-ai-factory-platform-paves-way-for-age-of-ai-reasoning
work page 2025
-
[44]
Heterogeneous low-bandwidth pre-training of llms
Yazan Obeidi, Amir Sarfi, Joel Lidin, Paul Janson, and Eugene Belilovsky . Heterogeneous low-bandwidth pre-training of llms. arXiv preprint arXiv:2601.02360, 2026
-
[45]
OpenAI. Gpt-5.5 model, 2026. URLhttps://developers.openai.com/api/docs/models/gpt-5.5
work page 2026
-
[46]
Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql
You Peng, Youhe Jiang, Wenqi Jiang, Chen Wang, and Binhang Yuan. Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql. arXiv preprint arXiv:2505.05286, 2025. 12
-
[47]
Zero: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley , Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020
work page 2020
-
[48]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary , Raul Puri, Patrick LeGresley , Jared Casper, and Bryan Catanzaro. Megatron- lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[49]
Sailor: Automating distributed training over dynamic, heterogeneous, and geo-distributed clusters
Foteini Strati, Zhendong Zhang, George Manos, Ixeia Sánchez Périz, Qinghao Hu, Tiancheng Chen, Berk Buzcu, Song Han, Pamela Delgado, and Ana Klimovic. Sailor: Automating distributed training over dynamic, heterogeneous, and geo-distributed clusters. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 204–220, 2025
work page 2025
-
[50]
Burstattention: An efficient distributed attention framework for extremely long sequences
Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi, and Maosong Sun. Burstattention: An efficient distributed attention framework for extremely long sequences. arXiv preprint arXiv:2403.09347, 2024
-
[51]
H2: Towards efficient large-scale llm training on hyper-heterogeneous cluster over 1,000 chips
Ding Tang, Jiecheng Zhou, Jiakai Hu, Shengwei Li, Huihuang Zheng, Zhilin Pei, Hui Wang, and Xingcheng Zhang. H2: Towards efficient large-scale llm training on hyper-heterogeneous cluster over 1,000 chips. arXiv preprint arXiv:2505.17548, 2025
-
[52]
Parallax: Efficient llm inference service over decentralized environment
Chris Tong, Youhe Jiang, Gufeng Chen, Tianyi Zhao, Sibian Lu, Wenjie Qu, Eric Yang, Lynn Ai, and Binhang Yuan. Parallax: Efficient llm inference service over decentralized environment. arXiv preprint arXiv:2509.26182, 2025
-
[53]
Jinquan Wang, Xiaojian Liao, Xuzhao Liu, Jiashun Suo, Zhisheng Huo, Chenhao Zhang, Xiangrong Xu, Runnan Shen, Xilong Xie, and Limin Xiao. Deepcee: Efficient cross-region model distributed training system under heterogeneous gpus and networks. arXiv preprint arXiv:2505.15536, 2025
-
[54]
Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, and Bin Cui. Improving automatic parallel training via balanced memory workload optimization.IEEE Transactions on Knowledge and Data Engineering, 36(8):3906–3920, 2024
work page 2024
-
[55]
Flexsp: Accelerating large language model training via flexible sequence parallelism
Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, and Bin Cui. Flexsp: Accelerating large language model training via flexible sequence parallelism. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 421...
work page 2025
-
[56]
Hexiscale: Accommodating large language model training over heterogeneous environment
Ran Yan, Youhe Jiang, Xiaonan Nie, Fangcheng Fu, Bin Cui, and Binhang Yuan. Hexiscale: Accommodating large language model training over heterogeneous environment. arXiv preprint arXiv:2409.01143, 2024
work page internal anchor Pith review arXiv 2024
-
[57]
Fsa: An alternative efficient implementation of native sparse attention kernel
Ran Yan, Youhe Jiang, Zhuoming Chen, Haohui Mai, Beidi Chen, and Binhang Yuan. Fsa: An alternative efficient implementation of native sparse attention kernel. arXiv preprint arXiv:2508.18224, 2025
-
[58]
Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus
Ran Yan, Youhe Jiang, Tianyuan Wu, Jiaxuan Gao, Zhiyu Mei, Wei Fu, Haohui Mai, Wei Wang, Yi Wu, and Binhang Yuan. Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus. arXiv preprint arXiv:2511.00796, 2025
-
[59]
Training ultra long context language model with fully pipelined distributed transformer
Jinghan Yao, Sam A Jacobs, Masahiro Tanaka, Olatunji Ruwase, Hari Subramoni, and Dhabaleswar Panda. Training ultra long context language model with fully pipelined distributed transformer. Proceedings of Machine Learning and Systems, 7, 2025
work page 2025
-
[60]
Zhang et al.Efficient Mixed-Precision Large Language Model Inference with TurboMind
Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed- precision large language model inference with turbomind. arXiv preprint arXiv:2508.15601, 2025
-
[61]
Poplar: Efficient scaling of distributed dnn training on heterogeneous gpu clusters
WenZheng Zhang, Yang Hu, Jing Shi, and Xiaoying Bai. Poplar: Efficient scaling of distributed dnn training on heterogeneous gpu clusters. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 22587–22595, 2025
work page 2025
-
[62]
Memo: Fine-grained tensor management for ultra-long context llm training
Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, et al. Memo: Fine-grained tensor management for ultra-long context llm training. Proceedings of the ACM on Management of Data, 3(1):1–28, 2025
work page 2025
-
[63]
Dsp: Dynamic sequence parallelism for multi-dimensional transformers
Xuanlei Zhao, Shenggan Cheng, Chang Chen, Zangwei Zheng, Ziming Liu, Zheming Yang, and Yang You. Dsp: Dynamic sequence parallelism for multi-dimensional transformers. InInternational Conference on Machine Learning, pages 77390–77404. PMLR, 2025. 13 A Detailed Cost Model Derivation This appendix details the memory , computation, and communication terms use...
work page 2025
-
[64]
iterates over each retained group-level plan, initializes the rank-level assignment, and improves it with bounded-iteration coordinate descent. The candidate generators PARTITIONTOPOLOGY, ABSTRACTSUPERNODES, GENERATECANDIDATES, INITIALIZEASSIGNMENT, CANDIDATEPAIRS, and PROPOSEMOVEare described in §4.3; COSTMODELand FEASIBILITYCHECKinvoke the analytical pe...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.