pith. sign in

arxiv: 2606.06256 · v2 · pith:SG6DEQXHnew · submitted 2026-06-04 · 💻 cs.AI

RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

Pith reviewed 2026-06-29 05:05 UTC · model grok-4.3

classification 💻 cs.AI
keywords KV cache managementLLM servinglong-context inferenceattention headscache reusememory efficiencydistributed servingPagedAttention
0
0 comments X

The pith

RedKnot decomposes the KV cache along individual attention heads to support reuse, compression, separation, and distribution without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that attention heads in large language models differ markedly in their functional roles, attention distances, and runtime importance during serving. Treating the KV cache as a uniform block therefore wastes resources and limits flexibility across long-context workloads. By breaking the cache into head-specific structures, a single management layer can apply tailored policies for position-independent reuse, prefix compression, hot versus cold separation, and distributed placement. These policies improve memory efficiency and concurrency while leaving model outputs unchanged. The result reframes the KV cache from a passive storage object into an active, model-aware substrate for scalable serving.

Core claim

RedKnot breaks the conventional monolithic KV cache abstraction by decomposing the KV cache along KV heads, whose importance and effective attention ranges vary significantly across serving scenarios. This head-level decomposition turns the KV cache from a monolithic tensor abstraction into a structured memory object, enabling uniform support for position-independent KV reuse, prefix KV compression, hot/cold KV separation, and distributed KV placement while preserving output fidelity and improving resource efficiency, without requiring model retraining or fine-tuning.

What carries the argument

Head-aware decomposition of the KV cache into per-head structured memory objects, paired with SegPagedAttention for segmented paged attention management.

If this is right

  • Position-independent KV reuse becomes possible without custom per-scenario code.
  • Prefix KV compression can be applied selectively to low-importance heads.
  • Hot and cold KV data can be separated at head granularity for better eviction.
  • Distributed placement decisions can route individual heads to different nodes.
  • Overall GPU memory capacity and serving concurrency increase while output fidelity stays the same.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Serving frameworks could adopt dynamic per-head eviction thresholds that adapt at runtime based on observed attention patterns.
  • Model architectures that explicitly differentiate head roles during pretraining might amplify the efficiency gains shown here.
  • Combining head decomposition with existing sparse or local attention mechanisms could further reduce the active cache footprint.
  • The same decomposition principle might extend to other per-head structures such as activation caches or optimizer states.

Load-bearing premise

KV cache utility differs enough across heads that separate management policies preserve exact model outputs in every serving scenario.

What would settle it

A controlled run on a production long-context workload in which applying identical management rules to every head produces the same latency, memory footprint, and output quality as head-specific rules.

Figures

Figures reproduced from arXiv: 2606.06256 by Boyu Wang, Guanjie Chen, HuaYi Jin, Junhao Hu, RuoZhou He, Yang Liu, ZhaoKai Luo, Zhiyong Wang.

Figure 1
Figure 1. Figure 1: RedKnot decouples the KV cache along the head dimen￾sion, classifies heads into global and local classes, and co-optimizes sparse attention, sparse FFN execution with selected tokens and Seg￾PagedAttention. The combined design yields 1.6–3.5× lower TTFT, 4.7–7.8× higher concurrency, and 67–79% fewer FLOPs compared with dense attention. generation (RAG) [11,21] routinely concatenates tens of thou￾sands of r… view at source ↗
Figure 2
Figure 2. Figure 2: For short contexts, prefill TTFT is dominated by FFN computation rather than KV-cache construction. heads of a selected token together, the effective recomputa￾tion set becomes the union of head-specific important tokens. This union can cover a large portion of the chunk, forcing the system to recompute many tokens to recover accuracy. This creates a fundamental limitation for token-level PIC recovery. Eve… view at source ↗
Figure 3
Figure 3. Figure 3: For short contexts, prefill TTFT is dominated by FFN computation rather than KV-cache construction. unfavorable trade-off: selecting fewer tokens may leave some head-specific errors uncorrected and hurt output quality, while selecting more tokens improves fidelity but quickly reduces the TTFT benefit of KV reuse. Beyond the attention-level bottleneck, existing PIC systems also overlook a complemen￾tary cha… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of RedKnot. 4.1 Overview of RedKnot RedKnot consists of two core components: (i) an Elastic Sparsity module, and (ii) a module that stores data at the gran￾ularity of KV-cache heads. We next describe the end-to-end workflow of RedKnot, highlighting how these modules in￾teract during inference. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: WorkFlow of Elastic Sparsity. recovery. RoPE-based positional alignment. When a reusable chunk is cached offline and later placed after a different prefix, its token positions change. Since modern LLMs commonly use RoPE, the cached keys contain position-dependent rotations. Before applying recovery, Elastic Sparsity first aligns the cached keys to their online positions using the rotational structure of Ro… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of RedKnot. HBM bandwidth utilization and throughput. 4.1 Overview of RedKnot RedKnot consists of two core components: (i) an Elastic Sparsity module, and (ii) a module that stores data at the gran￾ularity of KV-cache heads. We next describe the end-to-end workflow of RedKnot, highlighting how these modules in￾teract during inference. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of SegPagedAttention. Lines 18–19 merge the recovered heads and form the inter￾mediate hidden states. Lines 20–23 then apply partial sparse FFN recovery. Elastic Sparsity selects important tokens ac￾cording to the recovered attention signal, executes the dense FFN only on these selected tokens, and sets the FFN update of unselected tokens to zero so that they follow the residual iden￾tity path. Fi… view at source ↗
Figure 6
Figure 6. Figure 6: WorkFlow of Elastic Sparsity. positions using the rotational structure of RoPE. For a cached key originally encoded at offline position poff and reused at online position pon, Elastic Sparsity applies the relative rota￾tion K(pon) = R(pon)R(poff) −1K(poff), where R(·) denotes the RoPE rotation matrix. This step re￾moves the deterministic position mismatch caused by moving the chunk to a new location. The r… view at source ↗
Figure 7
Figure 7. Figure 7: End-to-end comparison across latency, answer quality, and KV matching metrics. RedKnot consistently improves TTFT while preserving higher accuracy and stronger Top-K KV matching than position-independent cache baselines. scaling is better than that of CacheBlend and ProphetKV. Why RedKnot achieves a better quality–latency trade￾off. The key difference is that RedKnot aligns its recovery granularity with th… view at source ↗
Figure 7
Figure 7. Figure 7: Overview of SegPagedAttention. Sparsity uses the recovered attention signal to estimate token importance. Tokens with high importance execute the dense FFN, while other tokens follow the residual identity path. In this way, Elastic Sparsity spends FFN computation only where correction is likely to affect the final hidden states. The two sparsity dimensions are complementary. Head￾aware attention recovery r… view at source ↗
Figure 8
Figure 8. Figure 8: Prefill compute comparison across six workload settings. RedKnot substantially reduces prefill FLOPs compared with dense recomputation, CacheBlend, and ProphetKV. Panel titles encode model.dataset.context length: M = Mistral-7B, Q = Qwen3-32B, and L70 = Llama-3.3-70B; TQA = TriviaQA, MFQA = MultiFieldQA, and HQA = HotpotQA; 16K/24K/32K/64K denote the total prompt context length in tokens. Each panel uses a… view at source ↗
Figure 8
Figure 8. Figure 8: End-to-end accuracy and TTFT comparison across three model families. RedKnot achives the per-panel TTFT speedup ranging from 3.51× to 5.16×. Overall, RedKnot preserves accuracy close to full recompute (typically ≥ 95% of the dense F1) while delivering 1.4–5.2× TTFT speedup, and consistently dominates the token-level PIC baselines on the quality–latency trade-off. tokens. Datasets. We draw RAG prompts from … view at source ↗
Figure 9
Figure 9. Figure 9: Single-layer decode kernel latency under two SDPA back￾ends on Qwen3-32B shapes (Hq = 32, D = 128, bf16, qlen = 1). FlashAttention requires a null attn_mask. explain why RedKnot is not merely faster than dense prefill, but also substantially cheaper in total prefill compute than token-level PIC baselines. 5.4 SegPagedAttention Micro-Benchmarks We evaluate SegPagedAttention with kernel-isolated micro￾benchm… view at source ↗
Figure 9
Figure 9. Figure 9: Throughput and attention-kernel efficiency of RedKnot. Top row (a)–(d): serving throughput (QPS/GPU, log scale) vs. context length on (a) Qwen3-32B (TP=2), (b) Llama-3.3-70B (TP=4), (c) Qwen3.5-397B (TP=8), and (d) DeepSeek-V4-Flash (PP=8), comparing RedKnot with dense recompute, CacheBlend (r = 15%), and ProphetKV (r = 20%). Bottom row (e)–(h): kernel-isolated latency of SegPagedAttention vs. masked/dense… view at source ↗
Figure 12
Figure 12. Figure 12: 64-layer prefill throughput (tok/s) at batch 1. SDPA+mask retains only 25% of 8 K throughput at 32 K, while SegPagedAttention retains 46% and maintains much higher absolute throughput. 32K, and 128K, respectively. This gap is not an algorithmic property of head sparsity; it is a dispatch artifact. PyTorch SDPA can use the FlashAttention backend only when the mask is null. Once RedKnotmaterializes local/gl… view at source ↗
Figure 10
Figure 10. Figure 10: Prefix multi-head KV compression on Qwen3-32B under PD disaggregation. (a) first-decode-step logit cosine vs. the full-KV baseline (left axis) and KV-transfer saving (right axis) vs. prefix length; the dashed line marks the 0.99 pass threshold. (b) aggregate decode throughput (QPS/GPU) under a fixed KV-memory budget, full-KV baseline vs. trim<32, with the per-point speedup annotated. (c) per-dataset logit… view at source ↗
Figure 11
Figure 11. Figure 11: 64-layer decode and prefill latency with fused varlen SegPagedAttention. All paths are numerically equivalent (cos > 0.99998). Labels above the fused bars indicate the speedup over SDPA+mask. and GQA-4. We sweep context lengths of 8K, 32K, and 128K tokens. The sparsity pattern follows the head-class layout used by RedKnot: half of the KV heads are retrieval/global heads that read the full context, and hal… view at source ↗
Figure 11
Figure 11. Figure 11: Chunk-level KV reuse on the MuSiQue stream (2417 questions, 48,315 chunk accesses, 17,629 unique passages). (a) reuse-count distribution per chunk (log–log). (b) fraction of each chunk’s reuse that comes from non-prefix positions, with mean 0.95. (c) reuse count vs. residency (the request span over which a chunk stays live), colored by log value density. (d) recompute saved vs. the KV memory needed to cac… view at source ↗
Figure 13
Figure 13. Figure 13: Additional system metrics beyond quality, TTFT, and compute. (a) RedKnot reduces KV transfer volume by 4.3–6.3× and transfer time by up to 4.1× under prefill–decode disaggregation. (b) Burst-mode throughput improves by 15–43% on the current dense+mask backend. (c) When SegPagedAttention materializes KV savings as physical memory savings, concurrent session capacity per GPU increases by 4.7–7.8×, which pro… view at source ↗
Figure 12
Figure 12. Figure 12: System-level effects of head-class KV sparsity. (a) KV-cache transfer saving over dense PD disaggregation, separated into transferred bytes and wall-clock transfer time, across Llama-3.3-70B and Qwen3-32B at 8K–24K. (b) burst-mode throughput (req/s) of dense vs. RedKnot for bursts of N concurrent requests, annotated with the relative gain. (c) concurrent sessions per GPU under dense vLLM-style KV storage … view at source ↗
Figure 13
Figure 13. Figure 13: Sparse denoising becomes more useful as context grows. (a) On DeepSeek-V4-Flash, the fraction of tokens needed to cover 99% attention mass drops with context length across HotpotQA, 2WikiMQA, MultiFieldQA, and GovReport. (b) On Qwen3.5-397B and DeepSeek-V4-Flash, dense accuracy degrades under long-context noise, while RedKnot stays stable and overtakes dense at longer contexts. become the sparsest because… view at source ↗
read the original abstract

As the input length of large language model (LLM) serving continues to grow, the KV cache has become a dominant bottleneck in AI infrastructure. It limits GPU memory capacity, serving concurrency, cache reuse, and distributed scalability. Multiple important problems, including position-independent KV cache, prefix KV cache compression, hot/cold KV cache separation, and distributed KV cache management, all depend on how the KV cache is represented and managed. However, existing serving systems largely rely on a monolithic KV cache abstraction, where the KV cache is treated as a homogeneous sequence of token-level memory blocks and managed with similar policies across attention heads and serving scenarios. We observe that KV cache utility is highly structured across KV heads: different heads exhibit different functional roles, attention distances, and runtime importance. Therefore, a full KV cache is not always necessary for every head, token range, or serving scenario. We present RedKnot, a head-aware KV cache management system for LLM serving. RedKnot breaks the conventional monolithic KV cache abstraction by decomposing the KV cache along KV heads, whose importance and effective attention ranges vary significantly across serving scenarios. This head-level decomposition turns the KV cache from a monolithic tensor abstraction into a structured memory object, enabling RedKnot to uniformly support position-independent KV reuse, prefix KV compression, hot/cold KV separation, and distributed KV placement while preserving output fidelity and improving resource efficiency, without requiring model retraining or fine-tuning. RedKnot establishes a new foundation for AI infrastructure by transforming the KV cache from a monolithic, passive runtime artifact into a dynamic, model-aware runtime substrate for scalable LLM serving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes RedKnot, a head-aware KV cache management system for long-context LLM serving. It observes that KV utility is structured across heads (differing functional roles, attention distances, runtime importance) and decomposes the cache along heads to enable position-independent reuse, prefix compression, hot/cold separation, and distributed placement while claiming to preserve output fidelity without retraining or fine-tuning.

Significance. If the per-head decomposition is shown to be stable and fidelity-preserving across models and workloads, the work would meaningfully advance LLM serving infrastructure by converting the KV cache from a monolithic tensor into a dynamic, model-aware substrate, uniformly supporting multiple long-standing systems problems.

major comments (1)
  1. [Abstract] Abstract: the central claim that head-level decomposition 'preserves output fidelity' while enabling the listed optimizations rests on the unquantified premise that per-head differences in functional roles and attention ranges are both large and stable enough that independent management never alters attention outputs; no fidelity deltas, perplexity measurements, exact-match rates, or head-wise variance statistics are reported to substantiate this.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The single major comment is addressed below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that head-level decomposition 'preserves output fidelity' while enabling the listed optimizations rests on the unquantified premise that per-head differences in functional roles and attention ranges are both large and stable enough that independent management never alters attention outputs; no fidelity deltas, perplexity measurements, exact-match rates, or head-wise variance statistics are reported to substantiate this.

    Authors: We agree that the abstract would be strengthened by explicit quantitative support for the fidelity claim. The manuscript's experimental sections evaluate output fidelity via perplexity, downstream task accuracy, and head-wise variance across models and workloads, showing that head-aware management preserves outputs. We will revise the abstract to include a concise summary of these metrics (e.g., perplexity deltas and exact-match rates) to directly address the concern. revision: yes

Circularity Check

0 steps flagged

No circularity; systems architecture rests on stated observation without equations or self-referential reductions

full rationale

The paper is a systems description of RedKnot that decomposes KV cache along heads based on an empirical observation ('We observe that KV cache utility is highly structured across KV heads'). No equations, fitted parameters, predictions, or derivations appear in the provided text. The central claim is an engineering design that follows from the observation; it does not reduce to a self-definition, a renamed fit, or a self-citation chain. Per rules, absence of quantitative validation is a correctness concern, not circularity. This is a standard non-circular systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no free parameters, mathematical axioms, or invented entities; the approach rests on an empirical observation about head variation rather than new postulates.

pith-pipeline@v0.9.1-grok · 5853 in / 1074 out tokens · 24261 ms · 2026-06-29T05:05:13.850884+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 12 canonical work pages · 8 internal anchors

  1. [1]

    Gulavani, Alexey Tu- manov, and Ramachandran Ramjee

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tu- manov, and Ramachandran Ramjee. Taming throughput- latency tradeoff in LLM inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems De- sign and Implementation (OSDI), pages 117–134, 2024

  2. [2]

    GQA: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901. Association for Computational Linguistics, 2023

  3. [3]

    Claude Code

    Anthropic. Claude Code. https://www.anthropic.com/ claude-code, 2025. Accessed: 2026-06-04. 21

  4. [4]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask bench- mark for long context understanding.arXiv preprint arXiv:2308.14508, 2024

  5. [5]

    LongBench v2: Towards deeper understanding and reasoning on realis- tic long-context multitasks

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xi- aozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench v2: Towards deeper understanding and reasoning on realis- tic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ...

  6. [6]

    Extending context window of large language models via positional interpolation, 2023

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation, 2023

  7. [7]

    NVIDIA Hopper H100 GPU: Scaling performance.IEEE Micro, 43(3):9–17, 2023

    Jack Choquette. NVIDIA Hopper H100 GPU: Scaling performance.IEEE Micro, 43(3):9–17, 2023

  8. [8]

    Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wen- feng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models, 2024

  9. [9]

    FlashAttention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024

  10. [10]

    Smith, and Matt Gardner

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 4599–4610, 2021

  11. [11]

    DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model, 2024

    DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model, 2024

  12. [12]

    DeepSeek-V3 technical report, 2024

    DeepSeek-AI. DeepSeek-V3 technical report, 2024

  13. [13]

    FlashMLA: Efficient multi-head latent attention kernels

    DeepSeek-AI. FlashMLA: Efficient multi-head latent attention kernels. https://github.com/deepseek-ai/Flas hMLA, 2025. Accessed: 2026-06-14

  14. [14]

    DeepSeek-V4-Flash model card

    DeepSeek-AI. DeepSeek-V4-Flash model card. https: //huggingface.co/deepseek-ai/DeepSeek-V4-Flash,

  15. [15]

    Accessed: 2026-06-14

  16. [16]

    DeepSeek-V4-Pro model card

    DeepSeek-AI. DeepSeek-V4-Pro model card. https: //huggingface.co/deepseek-ai/DeepSeek-V4-Pro, 2026. Accessed: 2026-06-14

  17. [17]

    Huerta, and Hao Peng

    Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Babu Bodapati, Aram Galstyan, Az- ton Wells, Roy Schwartz, Eliu A. Huerta, and Hao Peng. Context length alone hurts LLM performance despite perfect retrieval. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 23281–23298, Suzhou, China, 2025. Association for C...

  18. [19]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2022

  19. [20]

    MoA: Mixture of sparse attention for automatic large language model compres- sion.arXiv preprint arXiv:2406.14909, 2024

    Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zix- iao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, and Yu Wang. MoA: Mixture of sparse attention for automatic large language model compres- sion.arXiv preprint arXiv:2406.14909, 2024

  20. [21]

    Not all heads matter: A head- level KV cache compression method with integrated retrieval and reasoning

    Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head- level KV cache compression method with integrated retrieval and reasoning. InInternational Conference on Learning Representations, 2025

  21. [22]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2024

  22. [23]

    Prompt Cache: Modular attention reuse for low-latency inference

    In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt Cache: Modular attention reuse for low-latency inference. In Proceedings of Machine Learning and Systems (MLSys), 2024

  23. [24]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  24. [25]

    Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics (COLING), pages 6609– 6625, 2020. 22

  25. [26]

    RULER: What’s the real context size of your long-context language models?, 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shan- tanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?, 2024. COLM 2024

  26. [27]

    EPIC: Efficient position-independent caching for serving large language models

    Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Zhang Qin, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie. EPIC: Efficient position-independent caching for serving large language models. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofPro- ceedings of Machine Learning Research, pages 24391– 24402. PMLR, 2025

  27. [28]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXi...

  28. [29]

    Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse at- tention. InAdvances in Neural Information Processing Systems, 2024

  29. [30]

    Weld, and Luke Zettlemoyer

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly su- pervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the As- sociation for Computational Linguistics (ACL), pages 1601–1611, 2017

  30. [31]

    Gonza- lez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with Page- dAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626, 2023

  31. [32]

    CATS: Contextually- aware thresholding for sparsity in large language models

    Donghyun Lee, Je-Yong Lee, Genghan Zhang, Mo Ti- wari, and Azalia Mirhoseini. CATS: Contextually- aware thresholding for sparsity in large language models. InConference on Language Modeling, 2024

  32. [33]

    GShard: Scaling gi- ant models with conditional computation and automatic sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, De- hao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling gi- ant models with conditional computation and automatic sharding. InInternational Conference on Learning Representations, 2021

  33. [34]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Se- bastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 9459–9474, 2020

  34. [35]

    NeedleBench: Evaluating LLM retrieval and reasoning across varying information densities, 2024

    Mo Li, Songyang Zhang, Taolin Zhang, Haodong Duan, Yunxin Liu, and Kai Chen. NeedleBench: Evaluating LLM retrieval and reasoning across varying information densities, 2024

  35. [36]

    CompressKV: Seman- tic retrieval heads know what tokens are not important before generation.arXiv preprint arXiv:2508.02401, 2025

    Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li, and Grace Li Zhang. CompressKV: Seman- tic retrieval heads know what tokens are not important before generation.arXiv preprint arXiv:2508.02401, 2025

  36. [37]

    Rethinking RoPE: A mathematical blueprint for n-dimensional positional encoding, 2025

    Haiping Liu and Hongpeng Zhou. Rethinking RoPE: A mathematical blueprint for n-dimensional positional encoding, 2025

  37. [38]

    TEAL: Training- free activation sparsity in large language models

    James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, and Ben Athiwaratkun. TEAL: Training- free activation sparsity in large language models. In International Conference on Learning Representations, 2025

  38. [39]

    Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran- jape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

  39. [40]

    CacheSlide: Unlocking cross position-aware KV cache reuse for accelerating LLM serving

    Yang Liu, Yunfei Gu, Liqiang Zhang, Chentao Wu, Guangtao Xue, Jie Li, Minyi Guo, Junhao Hu, and Jie Meng. CacheSlide: Unlocking cross position-aware KV cache reuse for accelerating LLM serving. InPro- ceedings of the 24th USENIX Conference on File and Storage Technologies, FAST ’26. USENIX Association, 2026

  40. [41]

    Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time

    Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  41. [42]

    Deja Vu: Contextual sparsity for efficient LLMs at inference 23 time

    Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Ré, and Beidi Chen. Deja Vu: Contextual sparsity for efficient LLMs at inference 23 time. InProceedings of the 40th International Confer- ence on Machine Learning (ICML), pages 22137–22176, 2023

  42. [43]

    Rossi, Seunghyun Yoon, and Hinrich Sch"utze

    Ali Modarressi, Hanieh Deilamsalehy, Franck Dernon- court, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, and Hinrich Sch"utze. NoLiMa: Long-context evaluation beyond literal matching, 2025. ICML 2025

  43. [44]

    NVIDIA H100 Tensor Core GPU architecture

    NVIDIA. NVIDIA H100 Tensor Core GPU architecture. Whitepaper, NVIDIA Corporation, 2023

  44. [45]

    Introducing Codex: A cloud-based software engineering agent

    OpenAI. Introducing Codex: A cloud-based software engineering agent. https://openai.com/index/introduci ng-codex/, 2025. Accessed: 2025-05-20

  45. [46]

    OpenClaw, 2025

    OpenClaw Contributors. OpenClaw, 2025

  46. [47]

    Efficiently scal- ing transformer inference

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scal- ing transformer inference. InProceedings of Machine Learning and Systems (MLSys), 2023

  47. [48]

    Qwen3.5-35B-A3B model card

    Qwen Team. Qwen3.5-35B-A3B model card. https: //huggingface.co/Qwen/Qwen3.5-35B-A3B, 2026. Accessed: 2026-06-14

  48. [49]

    Qwen3.5-397B-A17B model card

    Qwen Team. Qwen3.5-397B-A17B model card. https: //huggingface.co/Qwen/Qwen3.5-397B-A17B, 2026. Accessed: 2026-06-14

  49. [50]

    Qwen3.5 model collection

    Qwen Team. Qwen3.5 model collection. https://huggin gface.co/collections/Qwen/qwen35, 2026. Accessed: 2026-06-14

  50. [51]

    Qwen3.5: Towards native multimodal agents

    Qwen Team. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id=qwen3.5, 2026. Accessed: 2026-06-14

  51. [52]

    HiCache system design and optimiza- tion

    SGLang Team. HiCache system design and optimiza- tion. https://docs.sglang.ai/advanced_features/hicache _design.html, 2025. Accessed: 2026-05-31

  52. [54]

    FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. FlashAttention- 3: Fast and accurate attention with asynchrony and low- precision.arXiv preprint arXiv:2407.08608, 2024

  53. [55]

    Fast Transformer Decoding: One Write-Head is All You Need

    Noam Shazeer. Fast transformer decoding: One write- head is all you need.arXiv preprint arXiv:1911.02150, 2019

  54. [56]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017

  55. [57]

    ProSparse: Intro- ducing and enhancing intrinsic activation sparsity within large language models

    Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guan- gli Li, Tao Yang, and Maosong Sun. ProSparse: Intro- ducing and enhancing intrinsic activation sparsity within large language models. InProceedings of the 31st In- ternational Conference on Computational Linguistics, 2025

  56. [58]

    PowerInfer: Fast large language model serving with a consumer-grade GPU

    Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. PowerInfer: Fast large language model serving with a consumer-grade GPU. InProceedings of the 30th Symposium on Operating Systems Principles (SOSP), 2024

  57. [59]

    RoFormer: Enhanced trans- former with rotary position embedding.Neurocomput- ing, 568:127063, 2024

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced trans- former with rotary position embedding.Neurocomput- ing, 568:127063, 2024

  58. [60]

    Ra- zorAttention: Efficient KV cache compression through retrieval heads

    Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Danning Ke, Shikuan Hong, Yiwu Yao, and Gongyi Wang. Ra- zorAttention: Efficient KV cache compression through retrieval heads. InInternational Conference on Learn- ing Representations, 2025

  59. [61]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  60. [62]

    MuSiQue: Multihop questions via single hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions via single hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  61. [63]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

  62. [64]

    Prophetkv: User- query-driven selective recomputation for efficient kv cache reuse in retrieval-augmented generation.arXiv preprint arXiv:2602.02579, 2026

    Shihao Wang, Jiahao Chen, Yanqi Pan, Hao Huang, Yichen Hao, Xiangyu Zou, Wen Xia, Wentao Zhang, Chongyang Qiu, and Pengfei Wang. Prophetkv: User- query-driven selective recomputation for efficient kv cache reuse in retrieval-augmented generation.arXiv preprint arXiv:2602.02579, 2026. 24

  63. [65]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  64. [66]

    LongGenBench: Benchmarking long- form generation in long context LLMs

    Yuhao Wu, Ming Shan Hee, Zhiqing Hu, and Roy Ka-Wei Lee. LongGenBench: Benchmarking long- form generation in long context LLMs. InInterna- tional Conference on Learning Representations, 2025. arXiv:2409.02076

  65. [67]

    DuoAttention: Efficient long-context LLM inference with retrieval and streaming heads

    Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. DuoAttention: Efficient long-context LLM inference with retrieval and streaming heads. InInternational Conference on Learning Representations, 2025

  66. [68]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Confer- ence on Learning Representations, 2024

  67. [69]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jian- wei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Ya...

  68. [70]

    Gated delta networks: Improving Mamba2 with delta rule,

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving Mamba2 with delta rule,

  69. [71]

    Cohen, Ruslan Salakhutdinov, and Christopher D

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben- gio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), pages 2369–2380, 2018

  70. [72]

    CacheBlend: Fast large language model serving for RAG with cached knowledge fusion

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Du, Xie Han, Shan Cao, and Junchen Jiang. CacheBlend: Fast large language model serving for RAG with cached knowledge fusion. InProceedings of the 19th European Conference on Computer Systems (EuroSys), 2025

  71. [73]

    H2O: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuan- dong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  72. [74]

    Gonzalez, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kober, Cedric Shi, Kefan Xiao, Ion Stoica, Hao Zhang, Joseph E. Gonzalez, and Ying Sheng. SGLang: Effi- cient execution of structured language model programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  73. [75]

    Dist- Serve: Disaggregating prefill and decoding for goodput- optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Dist- Serve: Disaggregating prefill and decoding for goodput- optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 193–210, 2024. 25