MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

Chengruidong Zhang; Huiqiang Jiang; Lili Qiu; Wenxuan Li; Yucheng Li; Yuqing Yang

arxiv: 2510.18830 · v2 · pith:CWMPJQLAnew · submitted 2025-10-21 · 💻 cs.CL · cs.DC· cs.LG

MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

Wenxuan Li , Chengruidong Zhang , Huiqiang Jiang , Yucheng Li , Yuqing Yang , Lili Qiu This is my paper

Pith reviewed 2026-05-21 19:38 UTC · model grok-4.3

classification 💻 cs.CL cs.DCcs.LG

keywords long context trainingdynamic sparse attentiondistributed trainingring attentionLLM efficiencyultra-long contextscontext window extension

0 comments

The pith

MTraining uses dynamic sparse attention with balanced ring mechanisms to train LLMs on 512K token contexts at up to 6x higher throughput while holding accuracy steady.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to train large language models on sequences hundreds of thousands of tokens long by making attention sparse in a dynamic way and then fixing the resulting load imbalances across GPUs. It introduces a dynamic sparse training pattern plus two ring-attention variants that redistribute computation and cut communication costs in a distributed cluster. The authors demonstrate the approach by taking Qwen2.5-3B from a 32K to a 512K context window on 32 A100 GPUs. A sympathetic reader would care because current full-attention training becomes impractically slow and memory-heavy once contexts exceed a few tens of thousands of tokens, limiting what models can learn about long documents or multi-step reasoning. If the method works as claimed, it removes a major practical barrier to scaling context length without requiring vastly more hardware.

Core claim

MTraining integrates a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention to resolve worker-level and step-level imbalance plus communication overhead when dynamic sparse attention is used for ultra-long contexts. The combined system trains Qwen2.5-3B from 32K to 512K tokens on a 32-GPU A100 cluster, delivering up to 6x higher throughput while producing models that match full-attention baselines on RULER, PG-19, InfiniteBench, and Needle-In-A-Haystack evaluations.

What carries the argument

balanced sparse ring attention and hierarchical sparse ring attention, which redistribute uneven computation and communication loads created by dynamic sparsity across workers and training steps

If this is right

Context windows can be expanded from 32K to 512K tokens on a fixed 32-GPU cluster without a proportional increase in training time.
Dynamic sparse attention becomes practical for distributed training rather than remaining limited to inference.
Downstream long-context benchmarks remain at parity with dense-attention training under the reported conditions.
The same balancing techniques can be applied to other base models that currently use ring attention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be combined with existing sequence parallelism or activation checkpointing to push context lengths beyond 512K on the same hardware budget.
Energy and carbon costs of long-context pretraining would drop roughly in proportion to the reported throughput gain.
Similar imbalance-correction logic might transfer to other sparse or mixture-of-experts training regimes that also produce uneven per-token work.

Load-bearing premise

The specific sparse attention patterns chosen dynamically during training do not create systematic gaps that prevent the model from learning important long-range dependencies.

What would settle it

Retraining the same model with full attention on 512K contexts and measuring whether it scores higher than the MTraining version on the longest-distance items in Needle-In-A-Haystack or InfiniteBench.

Figures

Figures reproduced from arXiv: 2510.18830 by Chengruidong Zhang, Huiqiang Jiang, Lili Qiu, Wenxuan Li, Yucheng Li, Yuqing Yang.

**Figure 1.** Figure 1: Workload distribution over 4 CP workers (GPUs) in Striped and Zigzag Ring Attention. Ring Attention Long-context training is increasingly bottlenecked by attention latency. Ring Attention [19, 20] improves scalability by distributing long sequences across devices and overlapping key–value communication with blockwise attention computation [26], allowing sequence length to scale with the number of devic… view at source ↗

**Figure 2.** Figure 2: (a) Latency breakdown of the training stage. (b) The attention recall of top-k(k=1024) from 128K [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of worker- and step-level load imbalance problems introduced by dynamic sparse attention [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of MTraining in distributed scenarios. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Step-level Computation Schedule of Striped Ring Attention (a) and Hierarchical Striped Ring Attention [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The training loss and throughput comparison of different methods during continued pretraining of [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Needle In A Haystack Results of the baseline checkpoint and the MTraining checkpoint. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Language Modeling Results on PG19. Language Modeling We evaluate the language modeling performance of MTraining against the baselines on the PG19 dataset with perplexity as the metric. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of attention computation time in [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: The training loss comparison of dense attention and MTrainig during continued pretraining of [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Needle In A Haystack Results of the baseline checkpoint and the MTraining checkpoint with [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Distribution of attention computation time using different methods with 512K tokens on 32 GPUs: [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Step-level Computation Schedule of Zigzag Ring Attention. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

read the original abstract

The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts-especially in distributed settings-remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a dynamic sparse training pattern, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during the training of models with extensive context lengths. We demonstrate the efficacy of MTraining by training Qwen2.5-3B, successfully expanding its context window from 32K to 512K tokens on a cluster of 32 A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and Needle In A Haystack, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy. Our code is available at https://github.com/microsoft/MInference/tree/main/MTraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MTraining gives a concrete distributed system for dynamic sparse attention that trains Qwen2.5-3B to 512K context at 6x throughput on 32 A100s, but the claim of preserved accuracy rests on thin evidence about the sparsity pattern itself.

read the letter

The main point is that they actually trained a model this way and measured real wall-clock gains plus downstream scores. The integration of a dynamic sparse training pattern with balanced sparse ring attention and hierarchical sparse ring attention looks like the practical fix for the imbalance problems that usually show up when you try sparse attention across workers and steps at this scale. They report expanding Qwen2.5-3B from 32K to 512K context and releasing code, which is more than a sketch.

Referee Report

3 major / 2 minor

Summary. The paper proposes MTraining, a distributed training framework for LLMs with ultra-long contexts that integrates dynamic sparse attention, balanced sparse ring attention, and hierarchical sparse ring attention to reduce computational imbalance and communication overhead. The authors train Qwen2.5-3B, extending its context window from 32K to 512K tokens on a 32-GPU A100 cluster, and report up to 6x higher training throughput while preserving accuracy on RULER, PG-19, InfiniteBench, and Needle-In-A-Haystack evaluations.

Significance. If the results hold, the work would be significant for practical long-context LLM training by demonstrating concrete wall-clock throughput gains on real hardware and models without apparent accuracy loss. The use of multiple long-context benchmarks and the release of code at the cited GitHub repository strengthen the practical contribution.

major comments (3)

[Evaluation / Experiments] The central claim that accuracy is preserved on RULER, PG-19, InfiniteBench, and Needle-In-A-Haystack rests on the assumption that the dynamic sparse pattern supplies sufficient gradient signal for long-range dependencies. No ablation is reported that freezes the sparsity mask versus allowing it to evolve dynamically, nor is the fraction of long-range (>32k) tokens receiving zero gradient measured across training steps.
[Methods] The dynamic sparse training pattern is introduced as one of the three key components, but the exact selection criterion (attention scores, proxy, or otherwise) and its per-step evolution are not specified in sufficient detail to assess whether low-attention long-range tokens are systematically under-sampled, which would create a training-time distribution shift undetectable by the downstream evals.
[Results / Implementation] The 6x throughput result on the 32-GPU cluster is load-bearing for the efficiency claim, yet it is unclear whether the load-balancing rules in balanced sparse ring attention and hierarchical sparse ring attention were tuned post-hoc to the reported runs or derived parameter-free from the sparsity pattern.

minor comments (2)

[Abstract] The benchmark is referred to as 'Needle In A Haystack' in the abstract; standardize to 'Needle-In-A-Haystack' for consistency with common usage.
[Abstract / Appendix] The code link is provided, but the manuscript does not include a reproducibility checklist or details on random seeds, hyperparameter search ranges, or exact sparsity thresholds used in the reported experiments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications and indicate where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Evaluation / Experiments] The central claim that accuracy is preserved on RULER, PG-19, InfiniteBench, and Needle-In-A-Haystack rests on the assumption that the dynamic sparse pattern supplies sufficient gradient signal for long-range dependencies. No ablation is reported that freezes the sparsity mask versus allowing it to evolve dynamically, nor is the fraction of long-range (>32k) tokens receiving zero gradient measured across training steps.

Authors: We agree that an ablation study contrasting a frozen sparsity mask against the dynamic version would provide stronger support for the role of dynamic evolution in preserving long-range gradient signals. In the revised manuscript we will add this ablation on the Qwen2.5-3B model, showing that the dynamic pattern yields measurably better downstream performance on long-context benchmarks. We will also report the measured fraction of long-range (>32k) tokens that receive zero gradient at multiple training checkpoints; our internal analysis indicates this fraction is kept low by the hierarchical component, and we will include the corresponding statistics and plots. revision: yes
Referee: [Methods] The dynamic sparse training pattern is introduced as one of the three key components, but the exact selection criterion (attention scores, proxy, or otherwise) and its per-step evolution are not specified in sufficient detail to assess whether low-attention long-range tokens are systematically under-sampled, which would create a training-time distribution shift undetectable by the downstream evals.

Authors: We thank the referee for highlighting the need for greater methodological transparency. The dynamic sparse training pattern selects tokens using a hybrid criterion that combines attention scores computed via a lightweight proxy model with positional priors designed to guarantee coverage of distant positions. The mask is recomputed at every training step from the current model activations. We will expand Section 3.1 with precise pseudocode, the exact proxy formulation, and an explicit discussion of how the selection rule prevents systematic under-sampling of long-range tokens, thereby allowing readers to evaluate potential distribution-shift concerns. revision: yes
Referee: [Results / Implementation] The 6x throughput result on the 32-GPU cluster is load-bearing for the efficiency claim, yet it is unclear whether the load-balancing rules in balanced sparse ring attention and hierarchical sparse ring attention were tuned post-hoc to the reported runs or derived parameter-free from the sparsity pattern.

Authors: The load-balancing rules in both balanced sparse ring attention and hierarchical sparse ring attention are derived directly and in a parameter-free manner from the instantaneous sparsity pattern: token-to-GPU assignments are computed from the per-token sparsity mask using a deterministic balancing procedure described in Sections 3.2 and 3.3. No post-hoc tuning specific to the reported 32-GPU runs was performed. To remove any ambiguity we will add an explicit statement in the revised Results section confirming the parameter-free derivation and will include a short algorithmic description of the balancing step. revision: partial

Circularity Check

0 steps flagged

No circularity; central claims rest on independent empirical measurements

full rationale

The paper reports measured wall-clock throughput gains (up to 6x) and downstream accuracy preservation on RULER, PG-19, InfiniteBench, and Needle-In-A-Haystack after training Qwen2.5-3B from 32K to 512K context on 32 A100 GPUs. These outcomes are obtained from direct experimental runs rather than quantities defined in terms of the same fitted parameters or self-citation chains. The described components (dynamic sparse training pattern, balanced sparse ring attention, hierarchical sparse ring attention) function as engineering choices whose efficacy is validated externally by the reported benchmarks, with no load-bearing step reducing a prediction to an input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard assumptions about attention sparsity not harming gradient flow and on the existence of efficient ring-allreduce primitives; no new physical constants or ad-hoc fitted scalars are introduced in the abstract.

axioms (1)

domain assumption Dynamic sparse attention patterns chosen during training preserve sufficient gradient information for long-range dependency learning.
Implicit in the claim that accuracy is preserved on downstream long-context tasks.

pith-pipeline@v0.9.0 · 5803 in / 1310 out tokens · 26466 ms · 2026-05-21T19:38:06.030306+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dynamic sparse training pattern... Vertical-Slash locality pattern... Balanced Sparse Ring Attention... Hierarchical Sparse Ring Attention
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.1... expectation of the attention weights after applying RoPE depends solely on the relative position n-m

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention
cs.CL 2026-05 unverdicted novelty 6.0

DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

Peek across: Improving multi-document modeling via cross-document question-answering.arXiv preprint arXiv:2305.15387, 2023

Avi Caciularu, Matthew E Peters, Jacob Goldberger, Ido Dagan, and Arman Cohan. Peek across: Improving multi-document modeling via cross-document question-answering.arXiv preprint arXiv:2305.15387, 2023

work page arXiv 2023
[2]

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.arXiv preprint arXiv:2407.01523, 2024

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.arXiv preprint arXiv:2407.01523, 2024

work page arXiv 2024
[3]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[5]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/. Accessed: Feb 2, 2025

work page 2025
[6]

Leave it to manus.https://manus.im/

Manus. Leave it to manus.https://manus.im/. Accessed: May 15, 2025

work page 2025
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Qwen2 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

How to train long- context language models (effectively)

Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively).arXiv preprint arXiv:2410.02660, 2024

work page arXiv 2024
[13]

QUEST: Query-aware sparsity for efficient long-context LLM inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. QUEST: Query-aware sparsity for efficient long-context LLM inference. InForty-first International Conference on Machine Learning, 2024

work page 2024
[14]

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

work page 2024
[15]

Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[16]

Sparq attention: Bandwidth-efficient LLM inference

Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, and Douglas Orr. Sparq attention: Bandwidth-efficient LLM inference. InForty-first International Conference on Machine Learning, 2024

work page 2024
[17]

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, YX Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

MoBA: Mixture of Block Attention for Long-Context LLMs

Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long-context llms. arXiv preprint arXiv:2502.13189, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Ring attention with blockwise transformers for near-infinite context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[20]

Striped attention: Faster ring attention for causal transformers

William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. Striped attention: Faster ring attention for causal transformers. arXiv preprint arXiv:2311.09431, 2023

work page arXiv 2023
[21]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

work page 2024
[23]

Needle in a haystack - pressure testing llms, 2023

Greg Kamradt. Needle in a haystack - pressure testing llms, 2023

work page 2023
[24]

Infinitebench: Extending long context evaluation beyond 100K tokens

Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, and Maosong Sun. Infinitebench: Extending long context evaluation beyond 100K tokens. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volum...

work page 2024
[25]

Compressive Transformers for Long-Range Sequence Modelling

Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint arXiv:1911.05507, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[26]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022. 11

work page 2022
[27]

[Feature request] balancing computation with zigzag blocking

Zilin Zhu. [Feature request] balancing computation with zigzag blocking. https://github. com/zhuzilin/ring-flash-attention/issues/2, February 2024. GitHub issue #2; ac- cessed 13 May 2025

work page 2024
[28]

Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428, 2025

work page arXiv 2025
[29]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[30]

Loongtrain: Efficient training of long-sequence llms with head-context parallelism.arXiv preprint arXiv:2406.18485, 2024

Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, et al. Loongtrain: Efficient training of long-sequence llms with head-context parallelism.arXiv preprint arXiv:2406.18485, 2024

work page arXiv 2024
[31]

{nnScaler}:{Constraint-Guided} parallelization plan generation for deep learning training

Zhiqi Lin, Youshan Miao, Quanlu Zhang, Fan Yang, Yi Zhu, Cheng Li, Saeed Maleki, Xu Cao, Ning Shang, Yilei Yang, et al. {nnScaler}:{Constraint-Guided} parallelization plan generation for deep learning training. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 347–363, 2024

work page 2024
[32]

Zero: Memory optimiza- tions toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020

work page 2020
[33]

Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

work page 2019
[34]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost.arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[35]

Block Sparse Attention

Junxian Guo, Haotian Tang, Shang Yang, Zhekai Zhang, Zhijian Liu, and Song Han. Block Sparse Attention. https://github.com/mit-han-lab/Block-Sparse-Attention , 2024

work page 2024
[36]

Pit: Optimization of dynamic sparse deep learning models via permutation invariant transformation

Ningxin Zheng, Huiqiang Jiang, Quanlu Zhang, Zhenhua Han, Lingxiao Ma, Yuqing Yang, Fan Yang, Chengruidong Zhang, Lili Qiu, Mao Yang, et al. Pit: Optimization of dynamic sparse deep learning models via permutation invariant transformation. InProceedings of the 29th Symposium on Operating Systems Principles, pages 331–347, 2023

work page 2023
[37]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023
[38]

YaRN: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[39]

Reducing activation recomputation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Ander- sch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

work page 2023
[40]

System optimizations for enabling training of extreme long sequence transformer models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminadabi, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. System optimizations for enabling training of extreme long sequence transformer models. InProceedings of the 43rd ACM Symposium on Principles of Distributed Computing, pages 121–130, 2024

work page 2024
[41]

Mini-sequence transformers: Optimizing intermediate memory for long sequences training

Cheng Luo, Jiawei Zhao, Zhuoming Chen, Beidi Chen, and Anima Anandkumar. Mini-sequence transformers: Optimizing intermediate memory for long sequences training. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[42]

A unified sequence parallelism approach for long context generative ai.arXiv preprint arXiv:2405.07719, 2024

Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai.arXiv preprint arXiv:2405.07719, 2024. 12

work page arXiv 2024
[43]

Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus, 2025

Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, and Xin Liu. Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus.arXiv preprint arXiv:2502.21231, 2025

work page arXiv 2025
[44]

Wlb-llm: Workload-balanced 4d parallelism for large language model training.arXiv preprint arXiv:2503.17924, 2025

Zheng Wang, Anna Cai, Xinfeng Xie, Zaifeng Pan, Yue Guan, Weiwei Chu, Jie Wang, Shikai Li, Jianyu Huang, Chris Cai, et al. Wlb-llm: Workload-balanced 4d parallelism for large language model training.arXiv preprint arXiv:2503.17924, 2025

work page arXiv 2025
[45]

Flexsp: Accelerating large language model training via flexible sequence parallelism

Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, and Bin Cui. Flexsp: Accelerating large language model training via flexible sequence parallelism. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 421–...

work page 2025
[46]

Kosec, S

Mario Michael Krell, Matej Kosec, Sergio P Perez, and Andrew Fitzgibbon. Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027, 2021

work page arXiv 2021
[47]

Magiattention: A distributed attention towards linear scalability for ultra-long context, heterogeneous mask training

Tao Zewei and Huang Yunpeng. Magiattention: A distributed attention towards linear scalability for ultra-long context, heterogeneous mask training. https://github.com/SandAI-org/ MagiAttention/, 2025

work page 2025
[48]

LongroPE: Extending LLM context window beyond 2 million tokens

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. LongroPE: Extending LLM context window beyond 2 million tokens. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[49]

A length-extrapolatable transformer

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14590–14604, Toronto...

work page 2023
[50]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Training-free long-context scaling of large language models

Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models. InForty-first International Conference on Machine Learning, 2024

work page 2024
[53]

Why does the effective context length of LLMs fall short? InThe Thirteenth International Conference on Learning Representations, 2025

Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of LLMs fall short? InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[54]

KIVI: A tuning-free asymmetric 2bit quantization for KV cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[55]

Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

work page 2024
[56]

You only cache once: Decoder-decoder architectures for language models.Advances in Neural Information Processing Systems, 37:7339–7361, 2024

Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, and Furu Wei. You only cache once: Decoder-decoder architectures for language models.Advances in Neural Information Processing Systems, 37:7339–7361, 2024

work page 2024
[57]

Goldfinch: High performance rwkv/transformer hybrid with linear pre-fill and extreme kv-cache compression

Daniel Goldstein, Fares Obeid, Eric Alcaide, Guangyu Song, and Eugene Cheah. Goldfinch: High performance rwkv/transformer hybrid with linear pre-fill and extreme kv-cache compres- sion.arXiv preprint arXiv:2407.12077, 2024. 13

work page arXiv 2024
[58]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, Singapor...

work page 2023
[59]

DHA: Learning decoupled-head attention from transformer checkpoints via adaptive heads fusion

Yilong Chen, Linhao Zhang, Junyuan Shang, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, and Yu Sun. DHA: Learning decoupled-head attention from transformer checkpoints via adaptive heads fusion. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[60]

LLM maybe longLM: Selfextend LLM context window without tuning

Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. LLM maybe longLM: Selfextend LLM context window without tuning. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[61]

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 155–172, 2024

work page 2024
[62]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[63]

Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33:17283–17297, 2020

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santi- ago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33:17283–17297, 2020

work page 2020
[64]

Reformer: The Efficient Transformer

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[65]

Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, and Lili Qiu

Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, and Lili Qiu. MMInference: Accelerating pre-filling for long-context visual language models via modality-aware permutation sparse attention. InForty-second International Conference on Machine Learning, 2025

work page 2025
[66]

Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

work page arXiv 2025
[67]

MagicPIG: LSH sampling for efficient LLM generation

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, and Beidi Chen. MagicPIG: LSH sampling for efficient LLM generation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[68]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 14 A Scalability of MTraining 0 10 20 30 40 50 60 Training Iteration 1 2 4 6Training Loss Dense MTraining Figure 10: The training loss comparison of dense attention and MTrainig during continued pretraining of Llama-3.1-8B-Instruct on the ProLong dataset wi...

work page internal anchor Pith review Pith/arXiv arXiv 2014
[69]

XAttention [28]. XAttention score square blocks by summing every certain stride along their antidiagonals and retains only the high-score blocks, giving a plug-and-play, training-free block- sparse attention that accelerates prefill while matching dense accuracy. In our experiments, we use the following settings with granularity being 128 as the block siz...

work page

[1] [1]

Peek across: Improving multi-document modeling via cross-document question-answering.arXiv preprint arXiv:2305.15387, 2023

Avi Caciularu, Matthew E Peters, Jacob Goldberger, Ido Dagan, and Arman Cohan. Peek across: Improving multi-document modeling via cross-document question-answering.arXiv preprint arXiv:2305.15387, 2023

work page arXiv 2023

[2] [2]

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.arXiv preprint arXiv:2407.01523, 2024

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.arXiv preprint arXiv:2407.01523, 2024

work page arXiv 2024

[3] [3]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[5] [5]

Introducing deep research

OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/. Accessed: Feb 2, 2025

work page 2025

[6] [6]

Leave it to manus.https://manus.im/

Manus. Leave it to manus.https://manus.im/. Accessed: May 15, 2025

work page 2025

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Qwen2 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

How to train long- context language models (effectively)

Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively).arXiv preprint arXiv:2410.02660, 2024

work page arXiv 2024

[13] [13]

QUEST: Query-aware sparsity for efficient long-context LLM inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. QUEST: Query-aware sparsity for efficient long-context LLM inference. InForty-first International Conference on Machine Learning, 2024

work page 2024

[14] [14]

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

work page 2024

[15] [15]

Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[16] [16]

Sparq attention: Bandwidth-efficient LLM inference

Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, and Douglas Orr. Sparq attention: Bandwidth-efficient LLM inference. InForty-first International Conference on Machine Learning, 2024

work page 2024

[17] [17]

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, YX Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

MoBA: Mixture of Block Attention for Long-Context LLMs

Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long-context llms. arXiv preprint arXiv:2502.13189, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Ring attention with blockwise transformers for near-infinite context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[20] [20]

Striped attention: Faster ring attention for causal transformers

William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. Striped attention: Faster ring attention for causal transformers. arXiv preprint arXiv:2311.09431, 2023

work page arXiv 2023

[21] [21]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

work page 2024

[23] [23]

Needle in a haystack - pressure testing llms, 2023

Greg Kamradt. Needle in a haystack - pressure testing llms, 2023

work page 2023

[24] [24]

Infinitebench: Extending long context evaluation beyond 100K tokens

Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai, Shuo Wang, Zhiyuan Liu, and Maosong Sun. Infinitebench: Extending long context evaluation beyond 100K tokens. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volum...

work page 2024

[25] [25]

Compressive Transformers for Long-Range Sequence Modelling

Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint arXiv:1911.05507, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[26] [26]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022. 11

work page 2022

[27] [27]

[Feature request] balancing computation with zigzag blocking

Zilin Zhu. [Feature request] balancing computation with zigzag blocking. https://github. com/zhuzilin/ring-flash-attention/issues/2, February 2024. GitHub issue #2; ac- cessed 13 May 2025

work page 2024

[28] [28]

Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428, 2025

work page arXiv 2025

[29] [29]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024

[30] [30]

Loongtrain: Efficient training of long-sequence llms with head-context parallelism.arXiv preprint arXiv:2406.18485, 2024

Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, et al. Loongtrain: Efficient training of long-sequence llms with head-context parallelism.arXiv preprint arXiv:2406.18485, 2024

work page arXiv 2024

[31] [31]

{nnScaler}:{Constraint-Guided} parallelization plan generation for deep learning training

Zhiqi Lin, Youshan Miao, Quanlu Zhang, Fan Yang, Yi Zhu, Cheng Li, Saeed Maleki, Xu Cao, Ning Shang, Yilei Yang, et al. {nnScaler}:{Constraint-Guided} parallelization plan generation for deep learning training. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 347–363, 2024

work page 2024

[32] [32]

Zero: Memory optimiza- tions toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020

work page 2020

[33] [33]

Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

work page 2019

[34] [34]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost.arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[35] [35]

Block Sparse Attention

Junxian Guo, Haotian Tang, Shang Yang, Zhekai Zhang, Zhijian Liu, and Song Han. Block Sparse Attention. https://github.com/mit-han-lab/Block-Sparse-Attention , 2024

work page 2024

[36] [36]

Pit: Optimization of dynamic sparse deep learning models via permutation invariant transformation

Ningxin Zheng, Huiqiang Jiang, Quanlu Zhang, Zhenhua Han, Lingxiao Ma, Yuqing Yang, Fan Yang, Chengruidong Zhang, Lili Qiu, Mao Yang, et al. Pit: Optimization of dynamic sparse deep learning models via permutation invariant transformation. InProceedings of the 29th Symposium on Operating Systems Principles, pages 331–347, 2023

work page 2023

[37] [37]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023

[38] [38]

YaRN: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[39] [39]

Reducing activation recomputation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Ander- sch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

work page 2023

[40] [40]

System optimizations for enabling training of extreme long sequence transformer models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminadabi, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. System optimizations for enabling training of extreme long sequence transformer models. InProceedings of the 43rd ACM Symposium on Principles of Distributed Computing, pages 121–130, 2024

work page 2024

[41] [41]

Mini-sequence transformers: Optimizing intermediate memory for long sequences training

Cheng Luo, Jiawei Zhao, Zhuoming Chen, Beidi Chen, and Anima Anandkumar. Mini-sequence transformers: Optimizing intermediate memory for long sequences training. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[42] [42]

A unified sequence parallelism approach for long context generative ai.arXiv preprint arXiv:2405.07719, 2024

Jiarui Fang and Shangchun Zhao. Usp: A unified sequence parallelism approach for long context generative ai.arXiv preprint arXiv:2405.07719, 2024. 12

work page arXiv 2024

[43] [43]

Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus, 2025

Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, and Xin Liu. Bytescale: Efficient scaling of llm training with a 2048k context length on more than 12,000 gpus.arXiv preprint arXiv:2502.21231, 2025

work page arXiv 2025

[44] [44]

Wlb-llm: Workload-balanced 4d parallelism for large language model training.arXiv preprint arXiv:2503.17924, 2025

Zheng Wang, Anna Cai, Xinfeng Xie, Zaifeng Pan, Yue Guan, Weiwei Chu, Jie Wang, Shikai Li, Jianyu Huang, Chris Cai, et al. Wlb-llm: Workload-balanced 4d parallelism for large language model training.arXiv preprint arXiv:2503.17924, 2025

work page arXiv 2025

[45] [45]

Flexsp: Accelerating large language model training via flexible sequence parallelism

Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, and Bin Cui. Flexsp: Accelerating large language model training via flexible sequence parallelism. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 421–...

work page 2025

[46] [46]

Kosec, S

Mario Michael Krell, Matej Kosec, Sergio P Perez, and Andrew Fitzgibbon. Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027, 2021

work page arXiv 2021

[47] [47]

Magiattention: A distributed attention towards linear scalability for ultra-long context, heterogeneous mask training

Tao Zewei and Huang Yunpeng. Magiattention: A distributed attention towards linear scalability for ultra-long context, heterogeneous mask training. https://github.com/SandAI-org/ MagiAttention/, 2025

work page 2025

[48] [48]

LongroPE: Extending LLM context window beyond 2 million tokens

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. LongroPE: Extending LLM context window beyond 2 million tokens. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[49] [49]

A length-extrapolatable transformer

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14590–14604, Toronto...

work page 2023

[50] [50]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Training-free long-context scaling of large language models

Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models. InForty-first International Conference on Machine Learning, 2024

work page 2024

[52] [53]

Why does the effective context length of LLMs fall short? InThe Thirteenth International Conference on Learning Representations, 2025

Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of LLMs fall short? InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[53] [54]

KIVI: A tuning-free asymmetric 2bit quantization for KV cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. KIVI: A tuning-free asymmetric 2bit quantization for KV cache. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[54] [55]

Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

work page 2024

[55] [56]

You only cache once: Decoder-decoder architectures for language models.Advances in Neural Information Processing Systems, 37:7339–7361, 2024

Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, and Furu Wei. You only cache once: Decoder-decoder architectures for language models.Advances in Neural Information Processing Systems, 37:7339–7361, 2024

work page 2024

[56] [57]

Goldfinch: High performance rwkv/transformer hybrid with linear pre-fill and extreme kv-cache compression

Daniel Goldstein, Fares Obeid, Eric Alcaide, Guangyu Song, and Eugene Cheah. Goldfinch: High performance rwkv/transformer hybrid with linear pre-fill and extreme kv-cache compres- sion.arXiv preprint arXiv:2407.12077, 2024. 13

work page arXiv 2024

[57] [58]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, Singapor...

work page 2023

[58] [59]

DHA: Learning decoupled-head attention from transformer checkpoints via adaptive heads fusion

Yilong Chen, Linhao Zhang, Junyuan Shang, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, and Yu Sun. DHA: Learning decoupled-head attention from transformer checkpoints via adaptive heads fusion. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[59] [60]

LLM maybe longLM: Selfextend LLM context window without tuning

Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. LLM maybe longLM: Selfextend LLM context window without tuning. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[60] [61]

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 155–172, 2024

work page 2024

[61] [62]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[62] [63]

Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33:17283–17297, 2020

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santi- ago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33:17283–17297, 2020

work page 2020

[63] [64]

Reformer: The Efficient Transformer

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[64] [65]

Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, and Lili Qiu

Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, and Lili Qiu. MMInference: Accelerating pre-filling for long-context visual language models via modality-aware permutation sparse attention. InForty-second International Conference on Machine Learning, 2025

work page 2025

[65] [66]

Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

work page arXiv 2025

[66] [67]

MagicPIG: LSH sampling for efficient LLM generation

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, and Beidi Chen. MagicPIG: LSH sampling for efficient LLM generation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[67] [68]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 14 A Scalability of MTraining 0 10 20 30 40 50 60 Training Iteration 1 2 4 6Training Loss Dense MTraining Figure 10: The training loss comparison of dense attention and MTrainig during continued pretraining of Llama-3.1-8B-Instruct on the ProLong dataset wi...

work page internal anchor Pith review Pith/arXiv arXiv 2014

[68] [69]

XAttention [28]. XAttention score square blocks by summing every certain stride along their antidiagonals and retains only the high-score blocks, giving a plug-and-play, training-free block- sparse attention that accelerates prefill while matching dense accuracy. In our experiments, we use the following settings with granularity being 128 as the block siz...

work page