pith. sign in

arxiv: 2510.18830 · v2 · pith:CWMPJQLAnew · submitted 2025-10-21 · 💻 cs.CL · cs.DC· cs.LG

MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

Pith reviewed 2026-05-21 19:38 UTC · model grok-4.3

classification 💻 cs.CL cs.DCcs.LG
keywords long context trainingdynamic sparse attentiondistributed trainingring attentionLLM efficiencyultra-long contextscontext window extension
0
0 comments X

The pith

MTraining uses dynamic sparse attention with balanced ring mechanisms to train LLMs on 512K token contexts at up to 6x higher throughput while holding accuracy steady.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to train large language models on sequences hundreds of thousands of tokens long by making attention sparse in a dynamic way and then fixing the resulting load imbalances across GPUs. It introduces a dynamic sparse training pattern plus two ring-attention variants that redistribute computation and cut communication costs in a distributed cluster. The authors demonstrate the approach by taking Qwen2.5-3B from a 32K to a 512K context window on 32 A100 GPUs. A sympathetic reader would care because current full-attention training becomes impractically slow and memory-heavy once contexts exceed a few tens of thousands of tokens, limiting what models can learn about long documents or multi-step reasoning. If the method works as claimed, it removes a major practical barrier to scaling context length without requiring vastly more hardware.

Core claim

MTraining integrates a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention to resolve worker-level and step-level imbalance plus communication overhead when dynamic sparse attention is used for ultra-long contexts. The combined system trains Qwen2.5-3B from 32K to 512K tokens on a 32-GPU A100 cluster, delivering up to 6x higher throughput while producing models that match full-attention baselines on RULER, PG-19, InfiniteBench, and Needle-In-A-Haystack evaluations.

What carries the argument

balanced sparse ring attention and hierarchical sparse ring attention, which redistribute uneven computation and communication loads created by dynamic sparsity across workers and training steps

If this is right

  • Context windows can be expanded from 32K to 512K tokens on a fixed 32-GPU cluster without a proportional increase in training time.
  • Dynamic sparse attention becomes practical for distributed training rather than remaining limited to inference.
  • Downstream long-context benchmarks remain at parity with dense-attention training under the reported conditions.
  • The same balancing techniques can be applied to other base models that currently use ring attention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be combined with existing sequence parallelism or activation checkpointing to push context lengths beyond 512K on the same hardware budget.
  • Energy and carbon costs of long-context pretraining would drop roughly in proportion to the reported throughput gain.
  • Similar imbalance-correction logic might transfer to other sparse or mixture-of-experts training regimes that also produce uneven per-token work.

Load-bearing premise

The specific sparse attention patterns chosen dynamically during training do not create systematic gaps that prevent the model from learning important long-range dependencies.

What would settle it

Retraining the same model with full attention on 512K contexts and measuring whether it scores higher than the MTraining version on the longest-distance items in Needle-In-A-Haystack or InfiniteBench.

Figures

Figures reproduced from arXiv: 2510.18830 by Chengruidong Zhang, Huiqiang Jiang, Lili Qiu, Wenxuan Li, Yucheng Li, Yuqing Yang.

Figure 1
Figure 1. Figure 1: Workload distribution over 4 CP workers (GPUs) in Striped and Zigzag Ring Attention. Ring Attention Long-context training is increas￾ingly bottlenecked by attention latency. Ring At￾tention [19, 20] improves scalability by distribut￾ing long sequences across devices and overlapping key–value communication with blockwise atten￾tion computation [26], allowing sequence length to scale with the number of devic… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Latency breakdown of the training stage. (b) The attention recall of top-k(k=1024) from 128K [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of worker- and step-level load imbalance problems introduced by dynamic sparse attention [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of MTraining in distributed scenarios. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Step-level Computation Schedule of Striped Ring Attention (a) and Hierarchical Striped Ring Attention [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The training loss and throughput comparison of different methods during continued pretraining of [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Needle In A Haystack Results of the baseline checkpoint and the MTraining checkpoint. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Language Modeling Results on PG19. Language Modeling We evaluate the language modeling performance of MTraining against the base￾lines on the PG19 dataset with perplexity as the met￾ric. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of attention computation time in [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The training loss comparison of dense attention and MTrainig during continued pretraining of [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Needle In A Haystack Results of the baseline checkpoint and the MTraining checkpoint with [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of attention computation time using different methods with 512K tokens on 32 GPUs: [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Step-level Computation Schedule of Zigzag Ring Attention. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
read the original abstract

The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy. Our code is available at https://github.com/microsoft/MInference/tree/main/MTraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes MTraining, a distributed training framework for LLMs with ultra-long contexts that integrates dynamic sparse attention, balanced sparse ring attention, and hierarchical sparse ring attention to reduce computational imbalance and communication overhead. The authors train Qwen2.5-3B, extending its context window from 32K to 512K tokens on a 32-GPU A100 cluster, and report up to 6x higher training throughput while preserving accuracy on RULER, PG-19, InfiniteBench, and Needle-In-A-Haystack evaluations.

Significance. If the results hold, the work would be significant for practical long-context LLM training by demonstrating concrete wall-clock throughput gains on real hardware and models without apparent accuracy loss. The use of multiple long-context benchmarks and the release of code at the cited GitHub repository strengthen the practical contribution.

major comments (3)
  1. [Evaluation / Experiments] The central claim that accuracy is preserved on RULER, PG-19, InfiniteBench, and Needle-In-A-Haystack rests on the assumption that the dynamic sparse pattern supplies sufficient gradient signal for long-range dependencies. No ablation is reported that freezes the sparsity mask versus allowing it to evolve dynamically, nor is the fraction of long-range (>32k) tokens receiving zero gradient measured across training steps.
  2. [Methods] The dynamic sparse training pattern is introduced as one of the three key components, but the exact selection criterion (attention scores, proxy, or otherwise) and its per-step evolution are not specified in sufficient detail to assess whether low-attention long-range tokens are systematically under-sampled, which would create a training-time distribution shift undetectable by the downstream evals.
  3. [Results / Implementation] The 6x throughput result on the 32-GPU cluster is load-bearing for the efficiency claim, yet it is unclear whether the load-balancing rules in balanced sparse ring attention and hierarchical sparse ring attention were tuned post-hoc to the reported runs or derived parameter-free from the sparsity pattern.
minor comments (2)
  1. [Abstract] The benchmark is referred to as 'Needle In A Haystack' in the abstract; standardize to 'Needle-In-A-Haystack' for consistency with common usage.
  2. [Abstract / Appendix] The code link is provided, but the manuscript does not include a reproducibility checklist or details on random seeds, hyperparameter search ranges, or exact sparsity thresholds used in the reported experiments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications and indicate where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Evaluation / Experiments] The central claim that accuracy is preserved on RULER, PG-19, InfiniteBench, and Needle-In-A-Haystack rests on the assumption that the dynamic sparse pattern supplies sufficient gradient signal for long-range dependencies. No ablation is reported that freezes the sparsity mask versus allowing it to evolve dynamically, nor is the fraction of long-range (>32k) tokens receiving zero gradient measured across training steps.

    Authors: We agree that an ablation study contrasting a frozen sparsity mask against the dynamic version would provide stronger support for the role of dynamic evolution in preserving long-range gradient signals. In the revised manuscript we will add this ablation on the Qwen2.5-3B model, showing that the dynamic pattern yields measurably better downstream performance on long-context benchmarks. We will also report the measured fraction of long-range (>32k) tokens that receive zero gradient at multiple training checkpoints; our internal analysis indicates this fraction is kept low by the hierarchical component, and we will include the corresponding statistics and plots. revision: yes

  2. Referee: [Methods] The dynamic sparse training pattern is introduced as one of the three key components, but the exact selection criterion (attention scores, proxy, or otherwise) and its per-step evolution are not specified in sufficient detail to assess whether low-attention long-range tokens are systematically under-sampled, which would create a training-time distribution shift undetectable by the downstream evals.

    Authors: We thank the referee for highlighting the need for greater methodological transparency. The dynamic sparse training pattern selects tokens using a hybrid criterion that combines attention scores computed via a lightweight proxy model with positional priors designed to guarantee coverage of distant positions. The mask is recomputed at every training step from the current model activations. We will expand Section 3.1 with precise pseudocode, the exact proxy formulation, and an explicit discussion of how the selection rule prevents systematic under-sampling of long-range tokens, thereby allowing readers to evaluate potential distribution-shift concerns. revision: yes

  3. Referee: [Results / Implementation] The 6x throughput result on the 32-GPU cluster is load-bearing for the efficiency claim, yet it is unclear whether the load-balancing rules in balanced sparse ring attention and hierarchical sparse ring attention were tuned post-hoc to the reported runs or derived parameter-free from the sparsity pattern.

    Authors: The load-balancing rules in both balanced sparse ring attention and hierarchical sparse ring attention are derived directly and in a parameter-free manner from the instantaneous sparsity pattern: token-to-GPU assignments are computed from the per-token sparsity mask using a deterministic balancing procedure described in Sections 3.2 and 3.3. No post-hoc tuning specific to the reported 32-GPU runs was performed. To remove any ambiguity we will add an explicit statement in the revised Results section confirming the parameter-free derivation and will include a short algorithmic description of the balancing step. revision: partial

Circularity Check

0 steps flagged

No circularity; central claims rest on independent empirical measurements

full rationale

The paper reports measured wall-clock throughput gains (up to 6x) and downstream accuracy preservation on RULER, PG-19, InfiniteBench, and Needle-In-A-Haystack after training Qwen2.5-3B from 32K to 512K context on 32 A100 GPUs. These outcomes are obtained from direct experimental runs rather than quantities defined in terms of the same fitted parameters or self-citation chains. The described components (dynamic sparse training pattern, balanced sparse ring attention, hierarchical sparse ring attention) function as engineering choices whose efficacy is validated externally by the reported benchmarks, with no load-bearing step reducing a prediction to an input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard assumptions about attention sparsity not harming gradient flow and on the existence of efficient ring-allreduce primitives; no new physical constants or ad-hoc fitted scalars are introduced in the abstract.

axioms (1)
  • domain assumption Dynamic sparse attention patterns chosen during training preserve sufficient gradient information for long-range dependency learning.
    Implicit in the claim that accuracy is preserved on downstream long-context tasks.

pith-pipeline@v0.9.0 · 5803 in / 1310 out tokens · 26466 ms · 2026-05-21T19:38:06.030306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

    cs.CL 2026-05 unverdicted novelty 6.0

    DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 1 Pith paper · 15 internal anchors

  1. [1]

    Peek across: Improving multi-document modeling via cross-document question-answering.arXiv preprint arXiv:2305.15387, 2023

    Avi Caciularu, Matthew E Peters, Jacob Goldberger, Ido Dagan, and Arman Cohan. Peek across: Improving multi-document modeling via cross-document question-answering.arXiv preprint arXiv:2305.15387, 2023

  2. [2]

    Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.arXiv preprint arXiv:2407.01523, 2024

    Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.arXiv preprint arXiv:2407.01523, 2024

  3. [3]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  4. [4]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

  5. [5]

    Introducing deep research

    OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/. Accessed: Feb 2, 2025

  6. [6]

    Leave it to manus.https://manus.im/

    Manus. Leave it to manus.https://manus.im/. Accessed: May 15, 2025

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  8. [8]

    Qwen2 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  9. [9]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  10. [10]

    An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383, 2025. 10

  11. [11]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  12. [12]

    How to train long- context language models (effectively)

    Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively).arXiv preprint arXiv:2410.02660, 2024

  13. [13]

    QUEST: Query-aware sparsity for efficient long-context LLM inference

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. QUEST: Query-aware sparsity for efficient long-context LLM inference. InForty-first International Conference on Machine Learning, 2024

  14. [14]

    Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

  15. [15]

    Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference

    Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference. InThe Thirteenth International Conference on Learning Representations, 2025

  16. [16]

    Sparq attention: Bandwidth-efficient LLM inference

    Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, and Douglas Orr. Sparq attention: Bandwidth-efficient LLM inference. InForty-first International Conference on Machine Learning, 2024

  17. [17]

    Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, YX Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089, 2025

  18. [18]

    MoBA: Mixture of Block Attention for Long-Context LLMs

    Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long-context llms. arXiv preprint arXiv:2502.13189, 2025

  19. [19]

    Ring attention with blockwise transformers for near-infinite context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. InThe Twelfth International Conference on Learning Representations, 2024

  20. [20]

    Striped attention: Faster ring attention for causal transformers

    William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. Striped attention: Faster ring attention for causal transformers. arXiv preprint arXiv:2311.09431, 2023

  21. [21]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

  22. [22]

    RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

  23. [23]

    Needle in a haystack - pressure testing llms, 2023

    Greg Kamradt. Needle in a haystack - pressure testing llms, 2023

  24. [24]

    Infinitebench: Extending long context evaluation beyond 100K tokens

    Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, and Maosong Sun. Infinitebench: Extending long context evaluation beyond 100K tokens. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volum...

  25. [25]

    Compressive Transformers for Long-Range Sequence Modelling

    Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint arXiv:1911.05507, 2019

  26. [26]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022. 11

  27. [27]

    [Feature request] balancing computation with zigzag blocking

    Zilin Zhu. [Feature request] balancing computation with zigzag blocking. https://github. com/zhuzilin/ring-flash-attention/issues/2, February 2024. GitHub issue #2; ac- cessed 13 May 2025

  28. [28]

    Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

    Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428, 2025

  29. [29]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  30. [30]

    Loongtrain: Efficient training of long-sequence llms with head-context parallelism.arXiv preprint arXiv:2406.18485, 2024

    Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, et al. Loongtrain: Efficient training of long-sequence llms with head-context parallelism.arXiv preprint arXiv:2406.18485, 2024

  31. [31]

    {nnScaler}:{Constraint-Guided} parallelization plan generation for deep learning training

    Zhiqi Lin, Youshan Miao, Quanlu Zhang, Fan Yang, Yi Zhu, Cheng Li, Saeed Maleki, Xu Cao, Ning Shang, Yilei Yang, et al. {nnScaler}:{Constraint-Guided} parallelization plan generation for deep learning training. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 347–363, 2024

  32. [32]

    Zero: Memory optimiza- tions toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020

  33. [33]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

  34. [34]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost.arXiv preprint arXiv:1604.06174, 2016

  35. [35]

    Block Sparse Attention

    Junxian Guo, Haotian Tang, Shang Yang, Zhekai Zhang, Zhijian Liu, and Song Han. Block Sparse Attention. https://github.com/mit-han-lab/Block-Sparse-Attention , 2024

  36. [36]

    Pit: Optimization of dynamic sparse deep learning models via permutation invariant transformation

    Ningxin Zheng, Huiqiang Jiang, Quanlu Zhang, Zhenhua Han, Lingxiao Ma, Yuqing Yang, Fan Yang, Chengruidong Zhang, Lili Qiu, Mao Yang, et al. Pit: Optimization of dynamic sparse deep learning models via permutation invariant transformation. InProceedings of the 29th Symposium on Operating Systems Principles, pages 331–347, 2023

  37. [37]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  38. [38]

    YaRN: Efficient context window extension of large language models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. InThe Twelfth International Conference on Learning Representations, 2024

  39. [39]

    Reducing activation recomputation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

    Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Ander- sch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

  40. [40]

    System optimizations for enabling training of extreme long sequence transformer models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminadabi, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. System optimizations for enabling training of extreme long sequence transformer models. InProceedings of the 43rd ACM Symposium on Principles of Distributed Computing, pages 121–130, 2024

  41. [41]

    Mini-sequence transformers: Optimizing intermediate memory for long sequences training

    Cheng Luo, Jiawei Zhao, Zhuoming Chen, Beidi Chen, and Anima Anandkumar. Mini-sequence transformers: Optimizing intermediate memory for long sequences training. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  42. [42]

    A unified sequence parallelism approach for long context generative ai.arXiv preprint arXiv:2405.07719, 2024

    Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai.arXiv preprint arXiv:2405.07719, 2024. 12

  43. [43]

    Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus, 2025

    Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, and Xin Liu. Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus.arXiv preprint arXiv:2502.21231, 2025

  44. [44]

    Wlb-llm: Workload-balanced 4d parallelism for large language model training.arXiv preprint arXiv:2503.17924, 2025

    Zheng Wang, Anna Cai, Xinfeng Xie, Zaifeng Pan, Yue Guan, Weiwei Chu, Jie Wang, Shikai Li, Jianyu Huang, Chris Cai, et al. Wlb-llm: Workload-balanced 4d parallelism for large language model training.arXiv preprint arXiv:2503.17924, 2025

  45. [45]

    Flexsp: Accelerating large language model training via flexible sequence parallelism

    Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, and Bin Cui. Flexsp: Accelerating large language model training via flexible sequence parallelism. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 421–...

  46. [46]

    Kosec, S

    Mario Michael Krell, Matej Kosec, Sergio P Perez, and Andrew Fitzgibbon. Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027, 2021

  47. [47]

    Magiattention: A distributed attention towards linear scalability for ultra-long context, heterogeneous mask training

    Tao Zewei and Huang Yunpeng. Magiattention: A distributed attention towards linear scalability for ultra-long context, heterogeneous mask training. https://github.com/SandAI-org/ MagiAttention/, 2025

  48. [48]

    LongroPE: Extending LLM context window beyond 2 million tokens

    Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. LongroPE: Extending LLM context window beyond 2 million tokens. In Forty-first International Conference on Machine Learning, 2024

  49. [49]

    A length-extrapolatable transformer

    Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14590–14604, Toronto...

  50. [50]

    Extending Context Window of Large Language Models via Positional Interpolation

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

  51. [51]

    Training-free long-context scaling of large language models

    Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models. InForty-first International Conference on Machine Learning, 2024

  52. [53]

    Why does the effective context length of LLMs fall short? InThe Thirteenth International Conference on Learning Representations, 2025

    Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of LLMs fall short? InThe Thirteenth International Conference on Learning Representations, 2025

  53. [54]

    KIVI: A tuning-free asymmetric 2bit quantization for KV cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache. In Forty-first International Conference on Machine Learning, 2024

  54. [55]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

  55. [56]

    You only cache once: Decoder-decoder architectures for language models.Advances in Neural Information Processing Systems, 37:7339–7361, 2024

    Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, and Furu Wei. You only cache once: Decoder-decoder architectures for language models.Advances in Neural Information Processing Systems, 37:7339–7361, 2024

  56. [57]

    Goldfinch: High performance rwkv/transformer hybrid with linear pre-fill and extreme kv-cache compression

    Daniel Goldstein, Fares Obeid, Eric Alcaide, Guangyu Song, and Eugene Cheah. Goldfinch: High performance rwkv/transformer hybrid with linear pre-fill and extreme kv-cache compres- sion.arXiv preprint arXiv:2407.12077, 2024. 13

  57. [58]

    GQA: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, Singapor...

  58. [59]

    DHA: Learning decoupled-head attention from transformer checkpoints via adaptive heads fusion

    Yilong Chen, Linhao Zhang, Junyuan Shang, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, and Yu Sun. DHA: Learning decoupled-head attention from transformer checkpoints via adaptive heads fusion. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  59. [60]

    LLM maybe longLM: Selfextend LLM context window without tuning

    Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. LLM maybe longLM: Selfextend LLM context window without tuning. In Forty-first International Conference on Machine Learning, 2024

  60. [61]

    {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 155–172, 2024

  61. [62]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

  62. [63]

    Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33:17283–17297, 2020

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santi- ago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33:17283–17297, 2020

  63. [64]

    Reformer: The Efficient Transformer

    Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020

  64. [65]

    Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, and Lili Qiu

    Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, and Lili Qiu. MMInference: Accelerating pre-filling for long-context visual language models via modality-aware permutation sparse attention. InForty-second International Conference on Machine Learning, 2025

  65. [66]

    Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

    Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

  66. [67]

    MagicPIG: LSH sampling for efficient LLM generation

    Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, and Beidi Chen. MagicPIG: LSH sampling for efficient LLM generation. InThe Thirteenth International Conference on Learning Representations, 2025

  67. [68]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 14 A Scalability of MTraining 0 10 20 30 40 50 60 Training Iteration 1 2 4 6Training Loss Dense MTraining Figure 10: The training loss comparison of dense attention and MTrainig during continued pretraining of Llama-3.1-8B-Instruct on the ProLong dataset wi...

  68. [69]

    XAttention [28]. XAttention score square blocks by summing every certain stride along their antidiagonals and retains only the high-score blocks, giving a plug-and-play, training-free block- sparse attention that accelerates prefill while matching dense accuracy. In our experiments, we use the following settings with granularity being 128 as the block siz...