pith. machine review for the scientific record. sign in

arxiv: 2605.11744 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Training-Inference Consistent Segmented Execution for Long-Context LLMs

Jiang Li, Qianyi Cai, Xiangdong Su, Xianpeng Shang, Zehua Duo

Pith reviewed 2026-05-13 06:43 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords long-context LLMssegmented executiontraining-inference consistencyKV cachegradient propagationmemory efficiencylong-range dependencies
0
0 comments X

The pith

Making training and inference use identical segment-level execution lets long-context LLMs match full attention performance with much lower memory costs at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that the common practice of training under full-context attention while running inference in bounded segments creates a harmful mismatch in how states transition between segments. To fix this, the authors enforce the same segment-level forward semantics during training by allowing each attention head to access past KV states but restricting gradient flow to only the KV states carried over from the single immediately preceding segment. This produces benchmark results on par with full attention while delivering competitive speed and memory use against other efficient baselines and sharply better scaling at extreme lengths. A sympathetic reader would care because the approach removes the need to choose between capability and feasibility when extending context windows.

Core claim

The central discovery is a training-inference consistent segment-level generation framework. Training restricts gradient propagation to KV states carried over from the immediately preceding segment but permits head-specific access to past KV states during the forward pass without involving them in gradients. This consistency yields performance on par with full-context attention across long-context benchmarks and improves scalability, including roughly 6x lower peak prefill memory at 128K compared to full attention with FlashAttention.

What carries the argument

The segment-level forward execution in which gradient flow is limited to KV states from the prior segment only while full historical KV access remains available in the forward pass.

If this is right

  • Accuracy on long-context benchmarks stays comparable to full-context attention training.
  • Peak prefill memory drops substantially at very long lengths, reaching approximately 6x reduction at 128K.
  • Latency-memory trade-offs stay competitive with other strong inference-efficient methods.
  • Practical scaling to contexts well beyond current limits becomes feasible under fixed hardware budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consistency rule could be tested with a gradient window of two preceding segments to check whether very-long-range tasks improve further without prohibitive extra cost.
  • The principle of aligning train and inference state transitions might extend directly to other bounded-context techniques such as sliding-window attention or ring attention.
  • Once trained under fixed segment size, the resulting model could support variable segment lengths at inference time to match different hardware memory constraints.

Load-bearing premise

Restricting gradient updates to KV states from only the immediately previous segment remains sufficient for the model to learn dependencies that span many segments.

What would settle it

Train the same base model with this segmented gradient rule versus full attention on a long-context task whose correct answers require information crossing more than one segment boundary, then measure whether accuracy falls by a large margin.

Figures

Figures reproduced from arXiv: 2605.11744 by Jiang Li, Qianyi Cai, Xiangdong Su, Xianpeng Shang, Zehua Duo.

Figure 1
Figure 1. Figure 1: Peak GPU memory consumption during long-context prefill. 1. Introduction Long-context modeling is increasingly vital for large lan￾guage models (Achiam et al., 2023; Team et al., 2024; Anthropic, 2024), underpinning practical applications like document understanding, sustained dialogue, and complex reasoning (Bai et al., 2024). However, the quadratic compu￾tational complexity of full-context self-attention… view at source ↗
Figure 2
Figure 2. Figure 2: Training–inference consistent segmented execution. A sequence is processed segment by segment with two cross-segment inputs: a carried KV tail Ci−1 (the only differentiable state that propagates across segments) and an optional retrieved prefix Ri−1 read from a past-only KV pool. During training, TBPTT with depth K truncates credit assignment along the state chain (red cross), so gradients flow through C f… view at source ↗
Figure 3
Figure 3. Figure 3: Head- and layer-sparse long-range retrieval. (a) In a non-long-range layer ℓ /∈ Llong, local heads attend to within-segment tokens and the carried KV state from the previous segment (green), while long-range heads use within-segment causal attention only (orange). (b) In a long-range-enabled layer ℓ ∈ Llong, local heads remain unchanged, while long-range heads additionally attend to a retrieved prefix from… view at source ↗
Figure 4
Figure 4. Figure 4: Perplexity on PG19 test under varying evaluation context lengths. Results are reported for LLaMA2-32K and LLaMA2-80K [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prefill latency and peak GPU memory under increasing evaluation context lengths for LLaMA2-7B-32K. (a) Prefill time measures the forward-pass latency (seconds) to process the entire input prompt. (b) Peak memory (GB) reports the maximum GPU allocated memory during the prefill stage. 20 30 40 50 60 70 Peak memory (GB) 6 7.5 9 10.5 12 13.5 Prefill time (s) Flash Attn. Duo-Attn. StreamingLLM Minference Ours … view at source ↗
Figure 6
Figure 6. Figure 6: Prefill latency versus peak GPU memory at 64K context length for LLaMA2-7B-32K. to improved robustness in long-context modeling across both local generation and cross-segment reasoning tasks. All improvements are obtained while preserving the effi￾ciency advantages of segment-level execution, with lower memory usage and a favorable latency–memory trade-off compared to full attention. A detailed efficiency … view at source ↗
read the original abstract

Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve efficiency by adopting bounded-context or segment-level execution only during inference, while continuing to train models under full-context attention, resulting in a mismatch between training and inference execution and state-transition semantics. Based on this insight, we propose a training-inference consistent segment-level generation framework, in which training and inference follow the same segment-level forward execution semantics. During training, consistency with inference is enforced by restricting gradient propagation to KV states carried over from the immediately preceding segment, while permitting head-specific access to past KV states during the forward pass without involving them in gradient propagation. Across long-context benchmarks, our approach achieves performance comparable to full-context attention, while achieving competitive latency-memory trade-offs against strong inference-efficient baselines, and substantially improving scalability at very long context lengths (e.g., approximately 6x lower peak prefill memory at 128K compared to full-context attention with FlashAttention).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a training-inference consistent segmented execution framework for long-context LLMs. Training and inference both use segment-level forward execution; consistency is enforced by restricting gradient propagation during training exclusively to KV states carried over from the immediately preceding segment, while still permitting head-specific access to past KV states in the forward pass. The central empirical claim is that this yields performance comparable to full-context attention on long-context benchmarks, competitive latency-memory trade-offs versus strong inference-efficient baselines, and approximately 6x lower peak prefill memory at 128K context lengths relative to full-context attention with FlashAttention.

Significance. If the gradient-restriction mechanism preserves the capacity to learn multi-segment dependencies, the work would resolve a fundamental train-inference mismatch in long-context modeling and deliver a practical route to scalable segment-level execution without performance degradation. The reported memory reduction at 128K would constitute a concrete engineering advance if the experimental controls are shown to be sound.

major comments (2)
  1. [§3 (Training Procedure)] The training procedure restricts gradients to KV states from only the immediately preceding segment (one-segment truncated back-propagation). This assumption is load-bearing for the claim of comparability to full-context attention, yet the manuscript provides no ablation that isolates the effect of this truncation window, no analysis of segment length or number of segments used in training, and no demonstration that dependencies spanning multiple segment boundaries remain learnable.
  2. [§4 (Experiments)] The experimental claims (comparable benchmark performance and 6x peak prefill memory reduction at 128K) are presented without reported variance, exact baseline implementations, segment counts during training, or controls that separate the gradient restriction from other implementation choices. These omissions make it impossible to verify that the observed results are attributable to the proposed consistency mechanism rather than other factors.
minor comments (2)
  1. [§3] Clarify in the method description whether the head-specific access to past KV states during the forward pass uses the same attention masking as standard full-context attention or introduces additional implementation details that affect semantics.
  2. [§3] Add explicit definitions or pseudocode for the segment boundary handling and KV carry-over mechanism to make the forward/backward pass semantics reproducible from the text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for clarification and strengthening, particularly around the training procedure and experimental rigor. We address each major comment below and will revise the manuscript accordingly to improve transparency and address the concerns.

read point-by-point responses
  1. Referee: [§3 (Training Procedure)] The training procedure restricts gradients to KV states from only the immediately preceding segment (one-segment truncated back-propagation). This assumption is load-bearing for the claim of comparability to full-context attention, yet the manuscript provides no ablation that isolates the effect of this truncation window, no analysis of segment length or number of segments used in training, and no demonstration that dependencies spanning multiple segment boundaries remain learnable.

    Authors: We acknowledge that the current manuscript lacks explicit ablations isolating the truncation window and detailed analysis of segment configurations. The one-segment gradient restriction is chosen to enforce exact training-inference consistency with the segment-level forward pass used at inference time. In the revised version, we will add an ablation comparing one-segment versus two-segment gradient propagation on a subset of long-context tasks to quantify the impact. We will also specify the training segment length (4K tokens) and number of segments (e.g., 32 for 128K contexts) and include a brief analysis of performance sensitivity to segment size. For multi-segment dependency learnability, the forward pass permits head-specific access to all prior KV states while gradients flow only through the immediate predecessor; comparable benchmark performance to full-context models on tasks with cross-segment dependencies (e.g., LongBench) serves as evidence that such dependencies are captured via the carried KV states. We will add a short qualitative discussion or synthetic example illustrating information propagation across boundaries. revision: partial

  2. Referee: [§4 (Experiments)] The experimental claims (comparable benchmark performance and 6x peak prefill memory reduction at 128K) are presented without reported variance, exact baseline implementations, segment counts during training, or controls that separate the gradient restriction from other implementation choices. These omissions make it impossible to verify that the observed results are attributable to the proposed consistency mechanism rather than other factors.

    Authors: We agree that additional experimental details and controls are needed for verifiability. In the revision, we will report standard deviations over 3-5 runs for all benchmark scores, provide exact baseline implementation details (including FlashAttention version and hyperparameters) in an expanded appendix, and explicitly state the training segment counts and lengths. To isolate the gradient restriction, we will add a control comparing our consistent segmented training against a mismatched setup (full-context training followed by segmented inference) on a representative long-context task. Due to compute limits, the control will be performed at a reduced scale (32K contexts) but will still demonstrate the consistency benefit. These additions will make the attribution to the proposed mechanism clearer. revision: partial

Circularity Check

0 steps flagged

No circularity: method defined directly and validated empirically

full rationale

The paper defines a training-inference consistent segmented execution framework by explicitly restricting gradient flow to KV states from the immediately preceding segment while permitting forward-pass access to earlier states. This is a direct procedural definition rather than a derivation that reduces to its own inputs by construction. No equations are presented that rename fitted parameters as predictions, import uniqueness theorems via self-citation, or smuggle ansatzes. Performance comparability to full-context attention is asserted via benchmark results, not by algebraic equivalence to the training restriction itself. The central assumption about preserving long-range dependency learning is therefore an empirical claim open to falsification on the reported benchmarks, not a self-referential tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that standard transformer attention can be segmented without loss of expressivity when gradients are blocked across segment boundaries except for carried KV states; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Transformer attention can be executed segment-wise while still allowing head-specific access to past KV states in the forward pass.
    Invoked when describing the forward pass that remains consistent with inference.

pith-pipeline@v0.9.0 · 5493 in / 1307 out tokens · 59599 ms · 2026-05-13T06:43:58.229431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 4 internal anchors

  1. [3]

    2024 , note =

    The Claude 3 Model Family: Opus, Sonnet, Haiku , author =. 2024 , note =

  2. [4]

    Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

    Longbench: A bilingual, multitask benchmark for long context understanding , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

  3. [5]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  4. [6]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  5. [7]

    International conference on algorithmic learning theory , pages=

    On the computational complexity of self-attention , author=. International conference on algorithmic learning theory , pages=. 2023 , organization=

  6. [8]

    The Twelfth International Conference on Learning Representations,

    Guangxuan Xiao and Yuandong Tian and Beidi Chen and Song Han and Mike Lewis , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  7. [10]

    The Twelfth International Conference on Learning Representations,

    Tri Dao , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  8. [13]

    The Thirteenth International Conference on Learning Representations,

    Wenhao Wu and Yizhong Wang and Guangxuan Xiao and Hao Peng and Yao Fu , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

  9. [14]

    Recurrent Memory Transformer , booktitle =

    Aydar Bulatov and Yuri Kuratov and Mikhail Burtsev , editor =. Recurrent Memory Transformer , booktitle =. 2022 , timestamp =

  10. [15]

    Fu and Stefano Ermon and Atri Rudra and Christopher R

    Tri Dao and Daniel Y. Fu and Stefano Ermon and Atri Rudra and Christopher R. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , booktitle =. 2022 , timestamp =

  11. [17]

    FlexGen: High-Throughput Generative Inference of Large Language Models with a Single

    Ying Sheng and Lianmin Zheng and Binhang Yuan and Zhuohan Li and Max Ryabinin and Beidi Chen and Percy Liang and Christopher R. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single. International Conference on Machine Learning,. 2023 , url =

  12. [19]

    Abdi and Dongsheng Li and Chin

    Huiqiang Jiang and Yucheng Li and Chengruidong Zhang and Qianhui Wu and Xufang Luo and Surin Ahn and Zhenhua Han and Amir H. Abdi and Dongsheng Li and Chin. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention , booktitle =. 2024 , timestamp =

  13. [21]

    Forty-second International Conference on Machine Learning,

    Yaofo Chen and Zeng You and Shuhai Zhang and Haokun Li and Yirui Li and Yaowei Wang and Mingkui Tan , title =. Forty-second International Conference on Machine Learning,. 2025 , url =

  14. [22]

    Shiftable Context: Addressing Training-Inference Context Mismatch in Simultaneous Speech Translation , booktitle =

    Matthew Raffel and Drew Penney and Lizhong Chen , editor =. Shiftable Context: Addressing Training-Inference Context Mismatch in Simultaneous Speech Translation , booktitle =. 2023 , url =

  15. [25]

    Rae and Anna Potapenko and Siddhant M

    Jack W. Rae and Anna Potapenko and Siddhant M. Jayakumar and Chloe Hillier and Timothy P. Lillicrap , title =. 8th International Conference on Learning Representations,. 2020 , url =

  16. [26]

    The Tenth International Conference on Learning Representations,

    Yuhuai Wu and Markus Norman Rabe and DeLesley Hutchins and Christian Szegedy , title =. The Tenth International Conference on Learning Representations,. 2022 , url =

  17. [27]

    The Thirteenth International Conference on Learning Representations,

    Guangxuan Xiao and Jiaming Tang and Jingwei Zuo and Junxian Guo and Shang Yang and Haotian Tang and Yao Fu and Song Han , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

  18. [28]

    2024 , url=

    Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Boris Ginsburg , booktitle=. 2024 , url=

  19. [29]

    Forty-first International Conference on Machine Learning,

    Yao Fu and Rameswar Panda and Xinyao Niu and Xiang Yue and Hannaneh Hajishirzi and Yoon Kim and Hao Peng , title =. Forty-first International Conference on Machine Learning,. 2024 , url =

  20. [31]

    Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention , booktitle =

    Jingyang Yuan and Huazuo Gao and Damai Dai and Junyu Luo and Liang Zhao and Zhengyan Zhang and Zhenda Xie and Yuxing Wei and Lean Wang and Zhiping Xiao and Yuqing Wang and Chong Ruan and Ming Zhang and Wenfeng Liang and Wangding Zeng , editor =. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention , booktitle =. 2025 , url =

  21. [33]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  22. [34]

    Gulavani, and Ramachandran Ramjee

    Agrawal, A., Panwar, A., Mohan, J., Kwatra, N., Gulavani, B. S., and Ramjee, R. SARATHI: efficient LLM inference by piggybacking decodes with chunked prefills. CoRR, abs/2308.16369, 2023. doi:10.48550/ARXIV.2308.16369. URL https://doi.org/10.48550/arXiv.2308.16369

  23. [35]

    The claude 3 model family: Opus, sonnet, haiku, 2024

    Anthropic . The claude 3 model family: Opus, sonnet, haiku, 2024. Technical report

  24. [36]

    Longbench: A bilingual, multitask benchmark for long context understanding

    Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., et al. Longbench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pp.\ 3119--3137, 2024

  25. [37]

    Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

    Bai, Y., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y., et al. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3639--3664, 2025

  26. [38]

    Longformer: The Long-Document Transformer

    Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. CoRR, abs/2004.05150, 2020. URL https://arxiv.org/abs/2004.05150

  27. [39]

    Recurrent memory transformer

    Bulatov, A., Kuratov, Y., and Burtsev, M. Recurrent memory transformer. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022

  28. [40]

    Core context aware transformers for long context language modeling

    Chen, Y., You, Z., Zhang, S., Li, H., Li, Y., Wang, Y., and Tan, M. Core context aware transformers for long context language modeling. In Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=MAHPZNduS4

  29. [41]

    doi: 10.18653/v1/P19-1285

    Dai, Z., Yang, Z., Yang, Y., Carbonell, J. G., Le, Q. V., and Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. In Korhonen, A., Traum, D. R., and M \` a rquez, L. (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume ...

  30. [42]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=mZn2Xyh9Ec

  31. [43]

    Y., Ermon, S., Rudra, A., and R \' e , C

    Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and R \' e , C. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Or...

  32. [44]

    Data engineering for scaling language models to 128k context

    Fu, Y., Panda, R., Niu, X., Yue, X., Hajishirzi, H., Kim, Y., and Peng, H. Data engineering for scaling language models to 128k context. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=TaAqeo7lUh

  33. [45]

    Sliding window attention training for efficient large language models

    Fu, Z., Song, W., Wang, Y., Wu, X., Zheng, Y., Zhang, Y., Xu, D., Wei, X., Xu, T., and Zhao, X. Sliding window attention training for efficient large language models. CoRR, abs/2502.18845, 2025. doi:10.48550/ARXIV.2502.18845. URL https://doi.org/10.48550/arXiv.2502.18845

  34. [46]

    Lm-infinite: Zero-shot extreme length generalization for large language models

    Han, C., Wang, Q., Peng, H., Xiong, W., Chen, Y., Ji, H., and Wang, S. Lm-infinite: Zero-shot extreme length generalization for large language models. In Duh, K., G \' o mez - Adorno, H., and Bethard, S. (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V...

  35. [47]

    What matters in transformers? not all attention is needed

    He, S., Sun, G., Shen, Z., and Li, A. What matters in transformers? not all attention is needed. CoRR, abs/2406.15786, 2024. doi:10.48550/ARXIV.2406.15786. URL https://doi.org/10.48550/arXiv.2406.15786

  36. [48]

    RULER : What s the Real Context Size of Your Long - Context Language Models ? In First Conference on Language Modeling, 2024

    Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., and Ginsburg, B. RULER : What s the Real Context Size of Your Long - Context Language Models ? In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=kIoBbc76Sy

  37. [49]

    H., Li, D., Lin, C., Yang, Y., and Qiu, L

    Jiang, H., Li, Y., Zhang, C., Wu, Q., Luo, X., Ahn, S., Han, Z., Abdi, A. H., Li, D., Lin, C., Yang, Y., and Qiu, L. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. In Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J. M., and Zhang, C. (eds.), Advances in Neural Information Processing S...

  38. [50]

    D., Wijewardena, P

    Keles, F. D., Wijewardena, P. M., and Hegde, C. On the computational complexity of self-attention. In International conference on algorithmic learning theory, pp.\ 597--619. PMLR, 2023

  39. [51]

    Efficient memory management for large language model serving with pagedattention

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Flinn, J., Seltzer, M. I., Druschel, P., Kaufmann, A., and Mace, J. (eds.), Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, German...

  40. [52]

    Chunkkv: Semantic-preserving KV cache compression for efficient long-context LLM inference

    Liu, X., Tang, Z., Dong, P., Li, Z., Li, B., Hu, X., and Chu, X. Chunkkv: Semantic-preserving KV cache compression for efficient long-context LLM inference. CoRR, abs/2502.00299, 2025. doi:10.48550/ARXIV.2502.00299. URL https://doi.org/10.48550/arXiv.2502.00299

  41. [53]

    W., Potapenko, A., Jayakumar, S

    Rae, J. W., Potapenko, A., Jayakumar, S. M., Hillier, C., and Lillicrap, T. P. Compressive transformers for long-range sequence modelling. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net, 2020. URL https://openreview.net/forum?id=SylKikSYDH

  42. [54]

    Shiftable context: Addressing training-inference context mismatch in simultaneous speech translation

    Raffel, M., Penney, D., and Chen, L. Shiftable context: Addressing training-inference context mismatch in simultaneous speech translation. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Proceedings of...

  43. [55]

    Flexgen: High-throughput generative inference of large language models with a single GPU

    Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., R \' e , C., Stoica, I., and Zhang, C. Flexgen: High-throughput generative inference of large language models with a single GPU . In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-...

  44. [56]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  45. [57]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  46. [58]

    N., Kaiser, ., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017

  47. [59]

    Retrieval head mechanistically explains long-context factuality

    Wu, W., Wang, Y., Xiao, G., Peng, H., and Fu, Y. Retrieval head mechanistically explains long-context factuality. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=EytBpUGB1Z

  48. [60]

    N., Hutchins, D., and Szegedy, C

    Wu, Y., Rabe, M. N., Hutchins, D., and Szegedy, C. Memorizing transformers. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022. URL https://openreview.net/forum?id=TrjbxzRcnf-

  49. [61]

    Efficient streaming language models with attention sinks

    Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=NG7sS51zVF

  50. [62]

    Duoattention: Efficient long-context LLM inference with retrieval and streaming heads

    Xiao, G., Tang, J., Zuo, J., Guo, J., Yang, S., Tang, H., Fu, Y., and Han, S. Duoattention: Efficient long-context LLM inference with retrieval and streaming heads. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net, 2025. URL https://openreview.net/forum?id=cFu7ze7xUm

  51. [63]

    Native sparse attention: Hardware-aligned and natively trainable sparse attention

    Yuan, J., Gao, H., Dai, D., Luo, J., Zhao, L., Zhang, Z., Xie, Z., Wei, Y., Wang, L., Xiao, Z., Wang, Y., Ruan, C., Zhang, M., Liang, W., and Zeng, W. Native sparse attention: Hardware-aligned and natively trainable sparse attention. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association...

  52. [64]

    Infllm-v2: Dense-sparse switchable attention for seamless short-to-long adaptation

    Zhao, W., Zhou, Z., Su, Z., Xiao, C., Li, Y., Li, Y., Zhang, Y., Zhao, W., Li, Z., Huang, Y., Sun, A., Han, X., and Liu, Z. Infllm-v2: Dense-sparse switchable attention for seamless short-to-long adaptation. CoRR, abs/2509.24663, 2025. doi:10.48550/ARXIV.2509.24663. URL https://doi.org/10.48550/arXiv.2509.24663