pith. sign in

arxiv: 2502.17421 · v4 · submitted 2025-02-24 · 💻 cs.CL · cs.AI· cs.LG

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

Pith reviewed 2026-05-23 01:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords speculative decodinglong-context inferenceKV cachedraft modeltree attentionlossless accelerationposition indicesLLM inference optimization
0
0 comments X

The pith

LongSpec makes speculative decoding practical for long contexts by keeping the draft model's KV cache constant-sized and fixing position and attention mismatches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LongSpec to extend lossless speculative decoding to the long contexts now common in LLM applications. Existing methods fail here because their draft models require growing KV caches, their short-text training produces errors on long inputs, and their tree attention becomes inefficient with long sequences. LongSpec counters these with a draft model whose KV cache stays fixed in size, position indices that align training and inference, and an attention aggregation method that mixes fast prefix work with tree attention. The result is reported speedups of up to 3.26 times versus Flash Attention on five long-context datasets and a 2.25 times wall-clock reduction on the AIME24 reasoning task.

Core claim

LongSpec is a speculative decoding framework built from three components: a draft model whose KV cache size remains constant regardless of context length, novel position indices that remove the short-training to long-inference mismatch, and an attention aggregation strategy that performs fast prefix computation before applying standard tree attention. These changes together allow efficient, lossless acceleration on arbitrarily long inputs.

What carries the argument

The constant-sized KV cache draft model combined with novel position indices and attention aggregation strategy, which together enable memory-efficient drafting and verification for long token sequences.

If this is right

  • Speculative decoding becomes usable for long-context tasks without the memory cost of growing KV caches in the draft model.
  • LLM agent applications that rely on extended contexts can run with lower latency while preserving exact output.
  • Draft models trained only on short text can be deployed on long inputs once the position indices are applied.
  • Tree attention overhead on long sequences is reduced by first handling the shared prefix separately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same constant-cache idea could be tested on other acceleration methods that currently scale memory with context length.
  • If the position indices prove robust, they may reduce the need to retrain draft models on long data.
  • The approach opens the possibility of combining LongSpec with other long-context techniques such as sparse attention.
  • Verification on models beyond the ones tested would show whether the three components transfer across architectures.

Load-bearing premise

The constant-sized KV cache draft model together with the new position indices and attention aggregation fully removes training-inference mismatch and tree attention problems for any long context without lowering draft quality.

What would settle it

A measurable drop in draft acceptance rate or final output quality when context length exceeds the draft model's training length would show that the position indices and aggregation do not fully solve the mismatch.

Figures

Figures reproduced from arXiv: 2502.17421 by Bo An, Chao Du, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Penghui Yang, Tianyu Pang.

Figure 1
Figure 1. Figure 1: The SoTA SD method, EAGLE, has a training context length of 2048, which is significantly shorter than the context lengths of modern LLMs. 1. Introduction Large Language Models (LLMs) have demonstrated remark￾able capabilities (Achiam et al., 2023), and their ability to handle extensive contexts is becoming crucial for emerg￾ing applications such as LLM agents and long reasoning tasks (Tan et al., 2025; Guo… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the memory-efficient draft model, the Anchor-Offset Indices, and the Hybrid Tree Attention. (a) We use a sliding window self-attention layer to capture the local context information and a cross-attention layer to gather long-context information. (b) The differences between the vanilla indexing and the Anchor-Offset Indices. By introducing a randomly selected offset and some anchor indices, … view at source ↗
Figure 3
Figure 3. Figure 3: Decoding speed (tokens/s) across different models and settings. All results are computed at T = 1. The letters G, Q, M, L, and R on the horizontal axis represent the datasets GovReport, QMSum, Multi-News, LCC, and RepoBench-P respectively. 2023), LLaMA-3.1-8B-Instruct (Dubey et al., 2024), and QwQ-32B (Qwen, 2024), as target models. In order to make the draft model and target model more compatible, our dra… view at source ↗
Figure 4
Figure 4. Figure 4: Training loss curves on long-context data. Pretrained models with Anchor-Offset Indices exhibit lower initial and final loss, and reach the same loss level 3.93× faster compared to mod￾els without Anchor-Offset Indices. speedup over the former, even as a standalone component, and our method can achieve up to 6× speedup over HF At￾tention on code completion datasets. This result underscores the necessity fo… view at source ↗
Figure 5
Figure 5. Figure 5: Latency breakdown for a single speculative decoding loop comparing the EAGLE implementation and the proposed Hybrid Tree Attention. Significant latency reduction is observed in the target model’s attention layer (the yellow part) using our approach. 2.25× Tokens/s 18.92 42.63 3.82× 1.00 3.82 Mean Accepted Tokens Vanilla LongSpec Vanilla LongSpec [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance of our method on the QwQ-32B model with the AIME24 dataset, using a maximum output length of 32k tokens. The left plot shows the tokens generated per second, where our approach achieves 2.25× higher speed compared to the baseline. The right plot shows the mean number of accepted tokens, where our method achieves an average of 3.82 mean accepted tokens. Hybrid Tree Attention. The results present… view at source ↗
Figure 7
Figure 7. Figure 7: Throughput comparison of Vanilla, MagicDec, and LONGSPEC on RepoBench-P using Vicuna-7B across different batch sizes. LONGSPEC shows superior throughput and scalability, outperforming both Vanilla and MagicDec in all batch sizes. sequence datasets. Finally, we introduce Hybrid Tree Atten￾tion, which efficiently integrates tree-based speculative de￾coding with Flash Attention. Extensive experiments demonstr… view at source ↗
read the original abstract

As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this capability. Speculative decoding (SD) offers a promising lossless acceleration technique compared to lossy alternatives such as quantization and model cascades. However, most state-of-the-art SD methods are trained on short texts (typically fewer than 4k tokens), making them unsuitable for long-context scenarios. Specifically, adapting these methods to long contexts presents three key challenges: (1) the excessive memory demands posed by draft models due to large Key-Value (KV) cache; (2) performance degradation resulting from the mismatch between short-context training and long-context inference; and (3) inefficiencies in tree attention mechanisms when managing long token sequences. This work introduces LongSpec, a framework that addresses these challenges through three core innovations: a memory-efficient draft model with a constant-sized KV cache; novel position indices that mitigate the training-inference mismatch; and an attention aggregation strategy that combines fast prefix computation with standard tree attention to enable efficient decoding. Experimental results confirm the effectiveness of LongSpec, achieving up to a 3.26x speedup over strong Flash Attention baselines across five long-context understanding datasets, as well as a 2.25x reduction in wall-clock time on the AIME24 long reasoning task with the QwQ model, demonstrating significant latency improvements for long-context applications. The code is available at https://github.com/sail-sg/LongSpec.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces LongSpec, a speculative decoding framework for long-context LLMs that uses a constant-sized KV cache draft model, novel position indices to mitigate training-inference mismatch, and an attention aggregation strategy for efficient tree attention. It reports up to 3.26x speedup over Flash Attention baselines on five long-context understanding datasets and 2.25x wall-clock reduction on the AIME24 task with the QwQ model, with code released at https://github.com/sail-sg/LongSpec.

Significance. If the central claims on lossless preservation and speedup hold, the work would offer a practical approach to accelerating long-context inference without quality loss, addressing a timely need for LLM agents and similar applications. The open-sourced code is a clear strength supporting reproducibility.

major comments (1)
  1. [§3] §3: the description of the constant-sized KV cache combined with position index remapping and attention aggregation asserts resolution of the training-inference mismatch and preservation of draft quality, but supplies no derivation, bound on distribution shift, or ablation of acceptance rate versus context length; this assumption is load-bearing for the lossless property and the reported speedups over arbitrary long contexts.
minor comments (1)
  1. [Abstract] Abstract: reports speedups but omits experimental details such as error bars, dataset statistics, model sizes, or explicit verification steps for the lossless property.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive comment on Section 3. We address the concern point-by-point below.

read point-by-point responses
  1. Referee: §3: the description of the constant-sized KV cache combined with position index remapping and attention aggregation asserts resolution of the training-inference mismatch and preservation of draft quality, but supplies no derivation, bound on distribution shift, or ablation of acceptance rate versus context length; this assumption is load-bearing for the lossless property and the reported speedups over arbitrary long contexts.

    Authors: We agree that Section 3 does not include a formal derivation or bound on distribution shift. The current manuscript supports the claims via empirical measurements of acceptance rates and end-to-end speedups on contexts up to the lengths tested in the five long-context datasets. We will add an explicit ablation of acceptance rate versus context length (up to the maximum evaluated) in the revised version. A tight theoretical bound on the shift induced by remapping is not straightforward to derive given the attention aggregation, but we will expand the design rationale for the position indices to clarify how they reduce the mismatch. revision: partial

standing simulated objections not resolved
  • A formal derivation or bound on the distribution shift

Circularity Check

0 steps flagged

No circularity: empirical framework with independent experimental validation

full rationale

The paper introduces LongSpec as an empirical framework addressing three engineering challenges in long-context speculative decoding via a constant-sized KV cache, novel position indices, and attention aggregation. No equations, derivations, or 'predictions' are presented that reduce to fitted inputs or self-definitions by construction. Reported outcomes are measured speedups on external datasets and tasks, with no load-bearing self-citations or uniqueness theorems invoked. The method is self-contained against external benchmarks (Flash Attention baselines, AIME24), satisfying the criteria for a non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; full text required for ledger construction.

pith-pipeline@v0.9.0 · 5835 in / 970 out tokens · 51594 ms · 2026-05-23T01:53:35.671126+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Test-Time Speculation

    cs.CL 2026-05 unverdicted novelty 7.0

    Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.

  2. See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

    cs.CL 2026-04 unverdicted novelty 7.0

    LVSpec introduces the first training-free loosely speculative decoding framework for Video-LLMs that identifies sparse visual-relevant tokens for strict verification while tolerating position shifts for semantic fille...

  3. Test-Time Speculation

    cs.CL 2026-05 unverdicted novelty 6.0

    TTS adapts speculator models online via target model verifications to improve acceptance lengths by up to 72% over prior methods, with gains increasing for longer generations.

  4. When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

    cs.CL 2026-04 unverdicted novelty 6.0

    KV cache reuse improves long-range draft acceptance rates in speculative decoding but delivers only marginal end-to-end speedups because shallow drafters cannot accurately estimate target queries and receive sparse gr...

  5. When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

    cs.CL 2026-04 unverdicted novelty 6.0

    KV cache reuse improves long-range draft acceptance in speculative decoding but delivers only marginal end-to-end speedups due to drafter limitations.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 3 Pith papers · 6 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    Longformer: The Long-Document Transformer

    Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer. arXiv preprint arXiv:2004.05150,

  3. [3]

    DeepSeek

    URL https://crfm.stanford.edu/2023/10/ 12/flashdecoding.html. DeepSeek. DeepSeek’s API context caching on disk technol- ogy,

  4. [4]

    The Llama 3 Herd of Models

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The LLaMA 3 herd of models. arXiv preprint arXiv:2407.21783,

  5. [5]

    How to train long- context language models (effectively)

    Gao, T., Wettig, A., Yen, H., and Chen, D. How to train long- context language models (effectively). arXiv preprint arXiv:2410.02660,

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforce- ment learning. arXiv preprint arXiv:2501.12948,

  7. [7]

    Liger Kernel: Efficient Triton Kernels for

    Hsu, P.-L., Dai, Y ., Kothapalli, V ., Song, Q., Tang, S., Zhu, S., Shimizu, S., Sahni, S., Ning, H., and Chen, Y . Liger kernel: Efficient triton kernels for LLM training. arXiv preprint arXiv:2410.10989,

  8. [8]

    Li, Y ., Huang, Y ., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D

    URL https://lmsys.org/blog/ 2023-06-29-longchat . Li, Y ., Huang, Y ., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. SnapKV: LLM knows what you are looking for before generation. In Advances in Neural Information Processing Systems, 2024a. Li, Y ., Wei, F., Zhang, C., and Zhang, H. EAGLE: Specula- tive sampling requires ...

  9. [9]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Team, G., Georgiev, P., Lei, V . I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context. arXiv preprint arXiv:2403.05530,

  10. [10]

    Qwen2.5 Technical Report

    Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115,

  11. [11]

    What and how does in-context learning learn? Bayesian model averaging, parameterization, and generalization

    Zhang, Y ., Zhang, F., Yang, Z., and Wang, Z. What and how does in-context learning learn? Bayesian model averaging, parameterization, and generalization. arXiv preprint arXiv:2305.19420,

  12. [12]

    introduce speculative cascading, a method that integrates cascade-style deferral rules with speculative execution to yield better cost-quality trade-offs than either approach alone. Another approach, MTAD (Qin et al., 2025), uses a smaller auxiliary model to approximate the multi-token joint distribution of a larger model, enhancing both inference speed a...

  13. [13]

    Theoretical Analysis In this section, we provide the theoretical analysis of our methods

    D.2. Theoretical Analysis In this section, we provide the theoretical analysis of our methods. We begin with the definition of our Glide network. It consists of three modules: the self-attention module, the cross-attention module, and the Feed-Forward (FF) module. Here the self-attention and the cross-attention are both based on the attention module. For ...

  14. [14]

    Given any two conjugate numbers u, v ∈ [1, ∞], i.e., 1 u + 1 v = 1, and 1 ≤ p ≤ ∞, for any A ∈ Rr×c and x ∈ Rc, we have ∥Ax∥p ≤ ∥A⊤∥p,u∥x∥v and ∥Ax∥p ≤ ∥A∥u,p∥x∥v

    ). Given any two conjugate numbers u, v ∈ [1, ∞], i.e., 1 u + 1 v = 1, and 1 ≤ p ≤ ∞, for any A ∈ Rr×c and x ∈ Rc, we have ∥Ax∥p ≤ ∥A⊤∥p,u∥x∥v and ∥Ax∥p ≤ ∥A∥u,p∥x∥v. 18 LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification Lemma D.8. For a query vector q ∈ Rd, and two sets of key-value pairs K1 ∈ RN1×d, K2 ∈ RN2×d,...

  15. [15]

    The RRB’s programs are designed to provide comprehensive benefits to railroad workers and their families

    In April 2015, railroad employment peaked at 253,000 workers, the highest level since November 1999, and then declined through FY2017, falling to 221,000 workers. The RRB’s programs are designed to provide comprehensive benefits to railroad workers and their families. The RRA and RUIA are important components of the railroad industry’s retirement and bene...