Accelerating Prefilling via Decoding-time Contribution Sparsity

Chengruidong Zhang; Huiqiang Jiang; Lili Qiu; Yike Zhang; Yuqing Yang; Zhiyuan He

arxiv: 2507.21526 · v4 · submitted 2025-07-29 · 💻 cs.CL

Accelerating Prefilling via Decoding-time Contribution Sparsity

Zhiyuan He , Yike Zhang , Chengruidong Zhang , Huiqiang Jiang , Yuqing Yang , Lili Qiu This is my paper

Pith reviewed 2026-05-19 03:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords sparse attentionprefilling accelerationLLM inferencedecoding contributionstatic sparsitylong contextTriangleMixattention pattern

0 comments

The pith

Many attention blocks with high prefilling scores contribute negligibly to later decoding, so a static Triangle attention pattern in select layers speeds up long-context prefilling by 15.3x with near-lossless accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models face quadratic attention costs during prefilling of long inputs, creating a major latency bottleneck before any tokens are generated. The paper identifies decoding-time contribution sparsity: gradient analysis shows that numerous blocks score highly early yet add almost nothing to subsequent decoding steps. TriangleMix addresses this with a training-free static pattern that applies full dense attention in some layers and switches to Triangle attention in others. This yields a 15.3x attention speedup for 128K inputs while preserving nearly the same model outputs as dense attention, and it layers on top of existing dynamic sparsity methods for further time-to-first-token gains. The insight matters because it targets a previously overlooked source of waste in the prefilling stage without requiring retraining or model changes.

Core claim

The paper shows that decoding-time contribution sparsity, revealed by gradient-based analysis, permits a training-free static attention pattern called Triangle attention. When integrated into TriangleMix by using dense attention in a subset of layers and Triangle attention in the rest, the approach delivers a 15.3x speedup in attention computation for 128K inputs, maintains nearly lossless performance relative to dense attention, and combines with dynamic sparsity to cut time-to-first-token by an extra 6 to 19 percent.

What carries the argument

Triangle attention, a static sparse attention pattern applied in selected layers of TriangleMix that limits computation based on observed decoding-time contribution sparsity identified via gradients.

If this is right

For 128K inputs, Triangle attention alone delivers 15.3x faster attention computation than dense attention.
Model outputs stay nearly identical to full dense attention across tested tasks.
Pairing TriangleMix with existing dynamic sparsity methods produces an extra 6 to 19 percent reduction in time-to-first-token.
The method requires no training or fine-tuning and works on top of current models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar gradient-driven sparsity analysis could be applied to other transformer components such as feed-forward layers to find additional static savings.
The approach may support practical scaling to context lengths beyond 128K on current hardware by lowering the prefilling cost floor.
If the sparsity pattern proves stable, inference engines could hard-code TriangleMix layers at deployment time for predictable latency gains.

Load-bearing premise

Gradient analysis performed during prefilling correctly flags attention blocks whose later contribution to decoding is negligible, and this sparsity pattern remains consistent across models, tasks, and input distributions.

What would settle it

Applying Triangle attention in the designated layers on a new long-context benchmark and measuring a clear drop in accuracy metrics such as task performance or perplexity compared with dense attention would disprove the near-lossless claim.

Figures

Figures reproduced from arXiv: 2507.21526 by Chengruidong Zhang, Huiqiang Jiang, Lili Qiu, Yike Zhang, Yuqing Yang, Zhiyuan He.

**Figure 1.** Figure 1: TriangleMix on Llama-3.1-8B-Instruct. bottleneck in the prefilling stage of LLMs (Jiang et al., 2024a; Lai et al., 2025). To address this bottleneck and accelerate the prefilling stage, researchers have proposed both static and dynamic sparse attention methods. Static sparse attention methods, such as StreamingLLM (Xiao et al., 2023), reduce computational complexity from O(N2 ) to O(N) but suffer notable … view at source ↗

**Figure 2.** Figure 2: The average gradient Grad(M, l) of the Middle Q-K sections, measured on Llama-3.1-8B-Instruct, shows a significant decline in deeper layers. This suggests that the Middle Q-K components in deeper layers contribute minimally and might potentially be skipped to improve efficiency. sequence lengths, it becomes non-trivial for moderately long contexts ranging from 32K to 128K tokens. We measured the average… view at source ↗

**Figure 4.** Figure 4: First row: Average attention score Att(M, l) for the Middle and Last Q-K sections; Second row: average gradient Grad(M, l) for the Middle and Last Q-K sections. Mi,j = 0 become effectively zero after the softmax operation. To accelerate computing, sparse attention aims to find a sparse mask matrix M′ to compute the attention output: A′ = Softmax( 1 √ d QKT − c(1 − M′ )) where |A − A′ | is expected to be s… view at source ↗

**Figure 5.** Figure 5: Average RULER score at 64K length for different [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Time-to-first-token (TTFT) in seconds mea [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Large Language Models (LLMs) incur quadratic attention complexity with input length, creating a major time bottleneck in the prefilling stage. Existing acceleration methods largely exploit attention score sparsity by estimating blocks with high attention scores and applying dynamic sparse attention. In this work, we identify another untapped form of sparsity in the prefilling stage, namely decoding-time contribution sparsity, where many attention blocks exhibit nontrivial attention scores during prefilling yet contribute negligibly to subsequent decoding, as indicated by gradient-based analysis. Building on this observation, we propose TriangleMix, a training-free static attention pattern that uses dense attention in a subset of layers and switches to Triangle attention in the others. Extensive experiments show that TriangleMix preserves nearly lossless performance relative to dense attention while substantially reducing attention overhead in Triangle layers. For 128K inputs, Triangle attention achieves a 15.3x speedup in attention computation, significantly exceeding the acceleration of typical dynamic sparse methods (1.9x to 3.4x). Furthermore, TriangleMix can be seamlessly combined with dynamic sparsity approaches, delivering an additional 6% to 19% reduction in TTFT over using dynamic sparsity alone. Our code is released at https://aka.ms/TriangleMix.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper identifies decoding-time contribution sparsity via gradients and turns it into a static TriangleMix pattern that cuts prefilling attention cost by 15x with near-lossless results on tested cases.

read the letter

Hey, the core takeaway is that many attention blocks during prefilling carry decent scores yet add almost nothing to later decoding steps. The authors use gradient analysis to locate this sparsity, then build TriangleMix: dense attention in a subset of layers and a fixed triangle pattern in the others. It is training-free and stacks on top of dynamic sparse methods for extra gains. For 128K inputs they report 15.3x faster attention computation while performance stays close to dense, plus 6-19% further TTFT reduction when combined with existing dynamic approaches. Code release is a plus for anyone wanting to try it. The experiments appear thorough on the setups they ran, and the static pattern is a clean departure from purely score-based dynamic sparsity. The soft spot is whether the gradient signal during prefilling truly predicts cumulative decoding impact across new inputs, tasks, or model scales. If the negligible blocks matter more under distribution shift, the lossless claim narrows while the speedup remains. Details on exact gradient targets, layer selection, and controls for confounding factors would clarify how robust the pattern is. This is aimed at engineers and researchers who need faster long-context inference without retraining. A reader focused on practical speedups for documents or conversations would get direct value from the method and numbers. I would send it for peer review; the idea is straightforward, the reported gains are large, and the open questions are the right kind for referees to probe.

Referee Report

3 major / 2 minor

Summary. The paper identifies a new form of sparsity called decoding-time contribution sparsity in LLM prefilling, where many attention blocks with nontrivial scores during prefilling have negligible impact on subsequent decoding as measured by gradient analysis. It proposes TriangleMix, a training-free static pattern applying dense attention in a subset of layers and Triangle attention (a fixed sparse pattern) in the rest. Experiments claim nearly lossless performance relative to dense attention, with 15.3x attention speedup for 128K inputs, and additional TTFT gains when combined with dynamic sparsity methods.

Significance. If the sparsity observation and resulting static pattern hold, this offers a simple, training-free complement to dynamic sparse attention methods for reducing quadratic prefilling costs in long-context inference. The approach is notable for being parameter-free, empirically derived rather than fitted, and accompanied by released code, which supports reproducibility. The combination results with existing dynamic methods suggest practical composability.

major comments (3)

[§3.2] §3.2: The gradient-based proxy for decoding-time contribution is computed during prefilling, but the manuscript does not specify whether gradients are taken w.r.t. a single next-token loss, an aggregated decoding trajectory, or a particular output position; this choice directly determines whether the identified 'negligible' blocks truly have negligible cumulative effect across variable-length generation.
[§4.3] §4.3 and Table 5: Generalization experiments are confined to the models and input distributions used for the initial gradient analysis; no results are shown for out-of-distribution tasks (e.g., long-context retrieval vs. open-ended generation) or different model scales, leaving the claim that the fixed TriangleMix mask remains valid under distribution shift untested and load-bearing for the 'nearly lossless' assertion.
[§3.1] §3.1, Eq. (3): The Triangle attention pattern is defined with a fixed mask, but the paper does not report sensitivity analysis on the layer-selection threshold or alternative partitions; if the negligible blocks were chosen differently, the reported 15.3x speedup versus the performance delta could change materially.

minor comments (2)

[Figure 2] Figure 2: The visualization of the Triangle pattern would benefit from an explicit legend indicating which layers are dense versus Triangle and the exact sparsity ratio per layer.
[§5] §5: Some baseline dynamic sparsity methods are compared, but their exact hyperparameter settings (e.g., sparsity ratio, block size) are not tabulated, making it difficult to reproduce the 6–19% additional TTFT reduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications and commitments to revisions where the points identify gaps in the current presentation. Our responses aim to strengthen the paper without overstating the existing results.

read point-by-point responses

Referee: [§3.2] §3.2: The gradient-based proxy for decoding-time contribution is computed during prefilling, but the manuscript does not specify whether gradients are taken w.r.t. a single next-token loss, an aggregated decoding trajectory, or a particular output position; this choice directly determines whether the identified 'negligible' blocks truly have negligible cumulative effect across variable-length generation.

Authors: We appreciate this observation. In §3.2 the gradient analysis computes gradients with respect to the loss for predicting the single next token immediately after the prefilling sequence. This choice uses the immediate next-token prediction as a proxy for the contribution at the start of decoding. We will revise the section to state this explicitly, include the precise formulation, and add a short discussion of why a single-token proxy is reasonable for identifying blocks whose effect remains small over longer generations. revision: yes
Referee: [§4.3] §4.3 and Table 5: Generalization experiments are confined to the models and input distributions used for the initial gradient analysis; no results are shown for out-of-distribution tasks (e.g., long-context retrieval vs. open-ended generation) or different model scales, leaving the claim that the fixed TriangleMix mask remains valid under distribution shift untested and load-bearing for the 'nearly lossless' assertion.

Authors: We agree that the current experiments stay within the model family and input distributions used for the original gradient analysis. The reported results demonstrate that the fixed pattern preserves performance on those distributions and across varying context lengths. While we have not evaluated out-of-distribution tasks such as retrieval or substantially different model scales, the sparsity observation itself is derived from gradient magnitudes rather than task-specific fitting. We will revise §4.3 and the discussion to more clearly state the scope of the generalization claim and to note the assumption that the observed sparsity pattern is largely architecture-driven. revision: partial
Referee: [§3.1] §3.1, Eq. (3): The Triangle attention pattern is defined with a fixed mask, but the paper does not report sensitivity analysis on the layer-selection threshold or alternative partitions; if the negligible blocks were chosen differently, the reported 15.3x speedup versus the performance delta could change materially.

Authors: We thank the referee for highlighting the need for robustness checks. The layer partition and threshold in TriangleMix were selected empirically from the gradient analysis to balance sparsity and accuracy. We did not include a sensitivity study in the original submission. We will add this analysis to the revised manuscript, reporting performance and attention speedup for a range of thresholds and for alternative layer partitions to show that the chosen configuration is stable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on empirical gradient analysis

full rationale

The paper identifies decoding-time contribution sparsity through direct gradient-based analysis performed on prefilling inputs, then uses this observation to construct a static TriangleMix attention pattern. This process does not reduce any claimed result to its own inputs by construction, does not rename fitted quantities as predictions, and contains no load-bearing self-citations or uniqueness theorems imported from prior author work. The central claim remains an independent empirical finding that can be externally validated against dense attention baselines on held-out inputs, models, and tasks, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the empirical identification of decoding-time contribution sparsity via gradients and the assumption that a fixed layer-wise pattern suffices to exploit it without retraining.

invented entities (1)

Triangle attention pattern no independent evidence
purpose: Static sparse attention pattern applied in selected layers to exploit decoding-time contribution sparsity
Introduced as the core mechanism in TriangleMix; no independent evidence outside the paper's experiments is provided.

pith-pipeline@v0.9.0 · 5756 in / 1031 out tokens · 44988 ms · 2026-05-19T03:01:37.320414+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we identify another untapped form of sparsity in the prefilling stage, namely decoding-time contribution sparsity, where many attention blocks exhibit nontrivial attention scores during prefilling yet contribute negligibly to subsequent decoding, as indicated by gradient-based analysis
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TriangleMix applies standard dense attention in shallow layers, and switches to a triangle-shaped sparse attention pattern in deeper layers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
cs.LG 2026-04 unverdicted novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
S2O: Early Stopping for Sparse Attention via Online Permutation
cs.LG 2026-02 unverdicted novelty 6.0

S2O uses online permutation and importance-based early stopping to increase effective sparsity in attention, delivering 7.51x attention and 3.81x end-to-end speedups on Llama-3.1-8B at 128K context with preserved accuracy.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 2 Pith papers · 13 internal anchors

[1]

GPT-4 Technical Report

Gpt-4 techni- cal report. arXiv preprint arXiv:2303.08774. 9 Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sang- hai

work page internal anchor Pith review Pith/arXiv arXiv
[2]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245. Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[3]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508. Iz Beltagy, Matthew E Peters, and Arman Cohan

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Longformer: The Long-Document Transformer

Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever

work page internal anchor Pith review Pith/arXiv arXiv 2004
[5]

Generating Long Sequences with Sparse Transformers

Generating long se- quences with sparse transformers. arXiv preprint arXiv:1904.10509. Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

work page internal anchor Pith review Pith/arXiv arXiv 1904
[6]

arXiv preprint arXiv:2307.02486 (2023)

Longnet: Scaling trans- formers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486. GradientAI

work page arXiv
[7]

The Llama 3 Herd of Models

The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shan- tanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg

work page internal anchor Pith review Pith/arXiv arXiv
[8]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven L...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Mistral 7B

Mistral 7b. Preprint, arXiv:2310.06825. Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir Abdi, Dongsheng Li, Chin-Yew Lin, and 1 oth- ers. 2024a. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. Advances in Neural Information Processing Systems , 37:52481–52515...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

Flexprefill: A context-aware sparse attention mechanism for efficient long-sequence in- ference. arXiv preprint arXiv:2502.20766. Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, and 1 others. 2024a. Scbench: A kv cache-centric analysis of long-context methods. arXiv p...

work page arXiv
[11]

YaRN: Efficient Context Window Extension of Large Language Models

Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071. Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Quest: Query- aware sparsity for efficient long-context llm inference. arXiv preprint arXiv:2406.10774. Philippe Tillet, Hsiang-Tsung Kung, and David Cox

work page internal anchor Pith review Pith/arXiv arXiv
[13]

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Duoattention: Efficient long-context llm inference with retrieval and streaming heads. arXiv preprint arXiv:2410.10819. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Efficient Streaming Language Models with Attention Sinks

Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453. 10 Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

Xattention: Block sparse attention with antidiagonal scoring. arXiv preprint arXiv:2503.16428. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, and 1 others

work page arXiv
[16]

Qwen2.5 Technical Report

Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115. Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K Qiu, and Lili Qiu

work page internal anchor Pith review Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2409.14924

Retrieval augmented generation (rag) and beyond: A comprehensive sur- vey on how to make your llms use external data more wisely. arXiv preprint arXiv:2409.14924. 11 A Appendix A.1 Task to Probe Attention Importance We use a key-value retrieval task to analyze the importance of different attention sections. The lan- guage model is prompted to retrieve the...

work page arXiv 2019

[1] [1]

GPT-4 Technical Report

Gpt-4 techni- cal report. arXiv preprint arXiv:2303.08774. 9 Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sang- hai

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245. Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508. Iz Beltagy, Matthew E Peters, and Arman Cohan

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Longformer: The Long-Document Transformer

Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever

work page internal anchor Pith review Pith/arXiv arXiv 2004

[5] [5]

Generating Long Sequences with Sparse Transformers

Generating long se- quences with sparse transformers. arXiv preprint arXiv:1904.10509. Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

work page internal anchor Pith review Pith/arXiv arXiv 1904

[6] [6]

arXiv preprint arXiv:2307.02486 (2023)

Longnet: Scaling trans- formers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486. GradientAI

work page arXiv

[7] [7]

The Llama 3 Herd of Models

The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shan- tanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven L...

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Mistral 7B

Mistral 7b. Preprint, arXiv:2310.06825. Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir Abdi, Dongsheng Li, Chin-Yew Lin, and 1 oth- ers. 2024a. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. Advances in Neural Information Processing Systems , 37:52481–52515...

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Flex- prefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766,

Flexprefill: A context-aware sparse attention mechanism for efficient long-sequence in- ference. arXiv preprint arXiv:2502.20766. Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, and 1 others. 2024a. Scbench: A kv cache-centric analysis of long-context methods. arXiv p...

work page arXiv

[11] [11]

YaRN: Efficient Context Window Extension of Large Language Models

Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071. Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Quest: Query- aware sparsity for efficient long-context llm inference. arXiv preprint arXiv:2406.10774. Philippe Tillet, Hsiang-Tsung Kung, and David Cox

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Duoattention: Efficient long-context llm inference with retrieval and streaming heads. arXiv preprint arXiv:2410.10819. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Efficient Streaming Language Models with Attention Sinks

Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453. 10 Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

Xattention: Block sparse attention with antidiagonal scoring. arXiv preprint arXiv:2503.16428. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, and 1 others

work page arXiv

[16] [16]

Qwen2.5 Technical Report

Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115. Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K Qiu, and Lili Qiu

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

arXiv preprint arXiv:2409.14924

Retrieval augmented generation (rag) and beyond: A comprehensive sur- vey on how to make your llms use external data more wisely. arXiv preprint arXiv:2409.14924. 11 A Appendix A.1 Task to Probe Attention Importance We use a key-value retrieval task to analyze the importance of different attention sections. The lan- guage model is prompted to retrieve the...

work page arXiv 2019