SparseX: Efficient Segment-Level KV Cache Sharing for Interleaved LLM Serving

Bo Tang; Feiyu Xiong; Kai Chen; Ning Liao; Quqing Zhang; Xiaoxing Wang; Zehao Lin; Zhiyu Li

arxiv: 2606.01751 · v2 · pith:CNPMIFD4new · submitted 2026-06-01 · 💻 cs.PF

SparseX: Efficient Segment-Level KV Cache Sharing for Interleaved LLM Serving

Quqing Zhang , Kai Chen , Ning Liao , Zehao Lin , Bo Tang , Feiyu Xiong , Zhiyu Li , Xiaoxing Wang This is my paper

Pith reviewed 2026-06-28 11:43 UTC · model grok-4.3

classification 💻 cs.PF

keywords KV cache sharingsegment-level reuseinterleaved LLM servingsparse recomputationSparse-Q indicesprefix cache extensionRAG and agent workflows

0 comments

The pith

SparseX reuses non-prefix KV cache segments in LLM serving by estimating and correcting key tokens via sparse-Q indices in one forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that conventional prefix caches fall short for real workloads where repeated content appears as interleaved segments across requests, turns, or agents. SparseX treats contiguous token segments as the reuse unit and leverages Sparse-Q indices that already appear during cache reuse to identify which tokens need correction. It then runs Sparse-KV Recomputation inside a single forward pass to restore cross-segment attention interactions. A hybrid full-plus-sparse attention schedule keeps early layers dense for stable importance signals and switches later layers to sparse mode for efficiency. The approach stays model-agnostic, training-free, and compatible with existing Prefix Cache mechanisms while supporting chat, RAG, and agent workflows.

Core claim

SparseX performs Sparse-KV Recomputation within a single forward pass, using Sparse-Q indices that naturally arise in KV Cache reuse workloads to estimate and correct the key tokens, thereby restoring cross-segment contextual interactions under complex interleaved reuse patterns without additional models or separate preprocessing stages for token selection.

What carries the argument

Sparse-KV Recomputation driven by Sparse-Q token selection, executed inside one forward pass on segment-level cache units.

If this is right

Non-prefix, cross-request, and cross-agent segments become reusable without full recomputation.
A layer-wise threshold switches early layers to full attention and later layers to sparse recomputation.
The system integrates segment lookup, PagedAttention, RoPE alignment, and FlashAttention into one execution path.
No extra models or offline token-selection stages are required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same segment cache could be extended to share across different models if the Sparse-Q signal remains stable across architectures.
Hybrid attention thresholds might be tuned per task rather than fixed per layer to further reduce recomputation cost on shorter contexts.

Load-bearing premise

Sparse-Q indices already present in reuse workloads can accurately identify the tokens whose keys must be recomputed without introducing large errors in the restored context.

What would settle it

Measure the drop in downstream task accuracy or the increase in effective context error when SparseX is applied to workloads with known ground-truth attention patterns; if the error exceeds the baseline full-attention case by more than a few percent, the selection method fails.

read the original abstract

In long-context LLM serving, the prefill stage often dominates time-to-first-token and computational cost. Although Prefix Cache in vLLM/PagedAttention has been widely used to reuse identical prompt prefixes, repeated content in practical applications frequently appears as non-prefix, cross-request, cross-turn, and cross-agent segments, which makes conventional cache mechanisms insufficient. This paper presents SparseX, a segment-level KV Cache sharing method for common serving scenarios. SparseX uses contiguous token segments as reuse units and exploits Sparse-Q indices that naturally arise in KV Cache reuse workloads to estimate the key tokens that require correction. Based on this estimate, SparseX performs Sparse-KV Recomputation within a single forward pass, thereby restoring cross-segment contextual interactions under complex interleaved reuse patterns while avoiding additional models or separate preprocessing stages for token selection. SparseX further implements a full+sparse hybrid attention mode based on a layer-specific threshold: early layers retain full attention to obtain a more stable token-importance signal, and later layers switch to sparse recomputation to improve reuse quality on complex long-context tasks. We implement SparseX-vLLM on top of vLLM, integrating segment-level cache lookup, PagedAttention management, RoPE alignment, Sparse-Q token selection, and FlashAttention backends into a unified execution path. SparseX is model-agnostic, training-free, and compatible with Prefix Cache, and it provides unified support for common online serving scenarios including multi-round chat, retrieval-augmented generation (RAG), and agent workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SparseX extends KV cache to segments via Sparse-Q recomputation for interleaved serving, but the accuracy of its token selection lacks quantitative support.

read the letter

The main thing to know is that SparseX brings segment-level KV cache sharing to interleaved LLM serving by using Sparse-Q indices for targeted recomputation in one pass.

The paper does a solid job on the system side. They built it into vLLM, handling segment lookup, paged attention, RoPE, and FlashAttention in one path. The hybrid full-plus-sparse attention per layer is a reasonable way to balance stability and speed. It's training-free and works alongside prefix caching, which lowers the barrier for adoption in multi-turn, RAG, and agent setups.

The soft spot is the reliance on Sparse-Q indices to pick the right tokens for correction. The description covers how they do the recomputation, but there are no numbers showing how accurate the selection is compared to full attention or an oracle choice. Without that, it's hard to know if the cross-segment interactions are restored well enough in practice, especially when reuse patterns get complex.

This work is for engineers running production LLM servers who need better cache reuse beyond simple prefixes. A practitioner tuning inference for long-context agents would pick up useful implementation tricks.

It deserves a serious referee. The problem is grounded in real serving pain points and the solution is implemented end-to-end.

I recommend sending it for review, focusing on adding error bounds or ablations for the token selection step.

Referee Report

2 major / 2 minor

Summary. The paper presents SparseX, a segment-level KV cache sharing system for LLM serving that targets non-prefix reuse patterns in multi-turn chat, RAG, and agent workflows. It exploits Sparse-Q indices that arise naturally during cache reuse to select tokens for Sparse-KV Recomputation performed inside a single forward pass, thereby restoring cross-segment attention interactions. A layer-specific full+sparse hybrid attention mode is used (full attention in early layers for stable importance signals, sparse recomputation in later layers), with the whole mechanism integrated into vLLM including segment lookup, PagedAttention, RoPE alignment, and FlashAttention. The approach is claimed to be model-agnostic, training-free, and compatible with Prefix Cache.

Significance. If the Sparse-Q selection error remains low under interleaved reuse, the technique would meaningfully extend KV-cache reuse beyond prefix matching and reduce prefill cost in realistic serving workloads without requiring auxiliary models or offline preprocessing. The single-pass recomputation and hybrid attention design are pragmatic engineering contributions that could be adopted in production systems.

major comments (2)

[§3.2] §3.2: The central claim that Sparse-Q indices suffice to select the tokens whose recomputation restores cross-segment interactions rests on an unquantified assumption. No error bound, oracle comparison, or L2/attention-score deviation metric versus full recomputation is provided for the selected indices under the described reuse patterns.
[§4.1] §4.1: The layer-wise full+sparse switch and RoPE-aligned recomputation are described, yet no ablation quantifies how selection error in early layers propagates to later-layer attention outputs or final generation quality in multi-turn/agent traces. This directly affects the claim that the restored context is sufficiently close to the non-cached baseline.

minor comments (2)

[§1] The abstract and §1 repeatedly use “naturally arise” for Sparse-Q indices; a brief characterization of the workloads or attention patterns that produce these indices would improve clarity.
Implementation details on how segment-level cache lookup interacts with PagedAttention block management are mentioned but not accompanied by a diagram or pseudocode; adding one would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects of quantitative validation for the Sparse-Q mechanism and its layer-wise behavior. We address each point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§3.2] The central claim that Sparse-Q indices suffice to select the tokens whose recomputation restores cross-segment interactions rests on an unquantified assumption. No error bound, oracle comparison, or L2/attention-score deviation metric versus full recomputation is provided for the selected indices under the described reuse patterns.

Authors: We agree that a direct quantitative characterization of Sparse-Q selection error would strengthen the central claim. The current manuscript evaluates the approach through end-to-end metrics (TTFT reduction and output quality) on interleaved workloads rather than intermediate token-selection error. In the revision we will add an oracle comparison subsection that reports L2 deviation of attention scores and top-k overlap between Sparse-Q selected tokens and full recomputation for representative multi-turn, RAG, and agent traces. This addition will be placed in §3.2 alongside the existing description. revision: yes
Referee: [§4.1] The layer-wise full+sparse switch and RoPE-aligned recomputation are described, yet no ablation quantifies how selection error in early layers propagates to later-layer attention outputs or final generation quality in multi-turn/agent traces. This directly affects the claim that the restored context is sufficiently close to the non-cached baseline.

Authors: We acknowledge the absence of a dedicated propagation ablation. The design rationale for full attention in early layers is precisely to obtain stable importance signals before switching to sparse recomputation; however, we did not quantify how residual selection error at the switch point affects downstream layers or final perplexity/quality. In the revised manuscript we will include an ablation that varies the layer threshold, measures attention-output L2 deviation relative to a full-recomputation baseline, and reports generation quality on the same multi-turn and agent traces used in §4. This will directly address the propagation concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an engineering method for segment-level KV cache sharing that exploits Sparse-Q indices arising naturally from reuse workloads, performs single-pass Sparse-KV Recomputation, and uses a layer-wise full+sparse hybrid attention mode. No equations, parameter fits, or derivations are shown that reduce by construction to the inputs; no self-citation chains or uniqueness theorems imported from prior author work appear in the provided text. The approach is presented as model-agnostic and training-free without renaming known results or smuggling ansatzes via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; all fields left empty.

pith-pipeline@v0.9.1-grok · 5823 in / 1069 out tokens · 17060 ms · 2026-06-28T11:43:12.315709+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 5 linked inside Pith

[1]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP), 2023

2023
[2]

RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024. 21

2024
[3]

Fu, Stefano Ermon, Atri Rudra, and Christopher Re

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvancesin Neural Information Processing Systems 35 (NeurIPS), 2022

2022
[4]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems 33 (NeurIPS), 2020

2020
[5]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. InAdvancesin Neural Information Processing Systems 30 (NeurIPS), 2017

2017
[6]

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. In Proceedings of the 20th European Conference on Computer Systems (EuroSys), 2025

2025
[7]

EPIC: Efficient Position-Independent Caching for Serving Large Language Models

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Zhang Qin, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie. EPIC: Efficient Position-Independent Caching for Serving Large Language Models. In Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025

2025
[8]

KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems.arXiv preprint arXiv:2510.12872, 2025

Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, and Yiran Chen. KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems.arXiv preprint arXiv:2510.12872, 2025

arXiv 2025
[9]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention. InAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

2024
[10]

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. InProceedings of Machine Learning and Systems (MLSys), 2025

2025
[11]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient Execution of Structured Language Model Programs. InAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

2024
[12]

CacheClip: Accelerating RAG with Effective KV Cache Reuse

Bin Yang, Qiuyu Leng, Jun Zeng, and Zhenhua Wu. CacheClip: Accelerating RAG with Effective KV Cache Reuse. arXiv preprint arXiv:2510.10129, 2025

Pith/arXiv arXiv 2025
[13]

DroidSpeak: KV Cache Sharing Across Fine-tuned Model Variants

Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, Madan Musuvathi, and Esha Choukse. DroidSpeak: KV Cache Sharing Across Fine-tuned Model Variants. In Proceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2026

2026
[14]

KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse.arXiv preprint arXiv:2503.16525, 2025

Huan Yang, Renji Zhang, Mingzhe Huang, Weijun Wang, Yin Tang, Yuanchun Li, Yunxin Liu, and Deyu Zhang. KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse.arXiv preprint arXiv:2503.16525, 2025

arXiv 2025
[15]

KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse.arXiv preprint arXiv:2502.16002, 2025

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse.arXiv preprint arXiv:2502.16002, 2025

arXiv 2025
[16]

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM Knows What You are Looking for Before Generation. InAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

2024
[17]

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Re, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. In Advances in Neural Information Processing Systems 36 (NeurIPS), 2023

2023
[18]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient Streaming Language Models with Attention Sinks. InProceedings of the International Conference on Learning Representations (ICLR), 2024. 22

2024
[19]

Qwen2.5-1M Technical Report.arXiv preprint arXiv:2501.15383, 2025

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, et al. Qwen2.5-1M Technical Report.arXiv preprint arXiv:2501.15383, 2025

Pith/arXiv arXiv 2025
[20]

Needle In A Haystack: Pressure Testing LLMs

Greg Kamradt. Needle In A Haystack: Pressure Testing LLMs. GitHub repository, 2023. Available athttps: //github.com/gkamradt/LLMTest_NeedleInAHaystack

2023
[21]

Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, et al. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[22]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating Very Long-Term Conversational Memory of LLM Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

2024
[23]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InProceedings of the International Conference on Learning Representations (ICLR), 2025

2025
[24]

RULER: What’s the Real Context Size of Your Long-Context Language Models? InProceedings of the Conference on Language Modeling (COLM), 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the Real Context Size of Your Long-Context Language Models? InProceedings of the Conference on Language Modeling (COLM), 2024

2024
[25]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. InAdvancesin Neural Information Processing Systems Datasets and Benchmarks Track(NeurIPS), 2021

2021
[26]

PAL: Program-Aided Language Models.arXiv preprint arXiv:2211.10435, 2022

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL: Program-Aided Language Models.arXiv preprint arXiv:2211.10435, 2022

Pith/arXiv arXiv 2022
[27]

Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), 2017

2017
[28]

Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learni...

2024
[29]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark. InProceedings of the Conference on Language Modeling (COLM), 2024

2024
[30]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.arXiv preprint arXiv:2406.01574, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.arXiv preprint arXiv:2406.01574, 2024

Pith/arXiv arXiv 2024
[31]

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421, 2021

2021
[32]

MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering. InProceedings of the Conference on Health, Inference, and Learning (CHIL), volume 174 ofProceedings of MachineLearning Research, pages 248–260. PMLR, 2022

2022
[33]

MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

Rui Ye, Keduan Huang, Qimin Wu, Yuzhu Cai, Tian Jin, Xianghe Pang, Xiangrui Liu, Jiaqi Su, Chen Qian, Bohan Tang, Kaiqu Liang, Jiaao Chen, Yue Hu, Zhenfei Yin, Rongye Shi, Bo An, Yang Gao, Wenjun Wu, Lei Bai, and Siheng Chen. MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems. arXiv preprint arXiv:2505.16988, 2025. 23

arXiv 2025
[34]

Awadallah, Ryen W

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed H. Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. InProceedings of the Conference on Language Modeling (COLM), 2024

2024
[35]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 17889–17904, 2024. 24 A End-to-End SparseX Algorithm This...

2024

[1] [1]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP), 2023

2023

[2] [2]

RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding.Neurocomputing, 568:127063, 2024. 21

2024

[3] [3]

Fu, Stefano Ermon, Atri Rudra, and Christopher Re

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvancesin Neural Information Processing Systems 35 (NeurIPS), 2022

2022

[4] [4]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems 33 (NeurIPS), 2020

2020

[5] [5]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. InAdvancesin Neural Information Processing Systems 30 (NeurIPS), 2017

2017

[6] [6]

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. In Proceedings of the 20th European Conference on Computer Systems (EuroSys), 2025

2025

[7] [7]

EPIC: Efficient Position-Independent Caching for Serving Large Language Models

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Zhang Qin, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie. EPIC: Efficient Position-Independent Caching for Serving Large Language Models. In Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025

2025

[8] [8]

KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems.arXiv preprint arXiv:2510.12872, 2025

Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, and Yiran Chen. KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems.arXiv preprint arXiv:2510.12872, 2025

arXiv 2025

[9] [9]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention. InAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

2024

[10] [10]

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. InProceedings of Machine Learning and Systems (MLSys), 2025

2025

[11] [11]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient Execution of Structured Language Model Programs. InAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

2024

[12] [12]

CacheClip: Accelerating RAG with Effective KV Cache Reuse

Bin Yang, Qiuyu Leng, Jun Zeng, and Zhenhua Wu. CacheClip: Accelerating RAG with Effective KV Cache Reuse. arXiv preprint arXiv:2510.10129, 2025

Pith/arXiv arXiv 2025

[13] [13]

DroidSpeak: KV Cache Sharing Across Fine-tuned Model Variants

Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, Madan Musuvathi, and Esha Choukse. DroidSpeak: KV Cache Sharing Across Fine-tuned Model Variants. In Proceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2026

2026

[14] [14]

KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse.arXiv preprint arXiv:2503.16525, 2025

Huan Yang, Renji Zhang, Mingzhe Huang, Weijun Wang, Yin Tang, Yuanchun Li, Yunxin Liu, and Deyu Zhang. KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse.arXiv preprint arXiv:2503.16525, 2025

arXiv 2025

[15] [15]

KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse.arXiv preprint arXiv:2502.16002, 2025

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse.arXiv preprint arXiv:2502.16002, 2025

arXiv 2025

[16] [16]

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM Knows What You are Looking for Before Generation. InAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

2024

[17] [17]

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Re, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. In Advances in Neural Information Processing Systems 36 (NeurIPS), 2023

2023

[18] [18]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient Streaming Language Models with Attention Sinks. InProceedings of the International Conference on Learning Representations (ICLR), 2024. 22

2024

[19] [19]

Qwen2.5-1M Technical Report.arXiv preprint arXiv:2501.15383, 2025

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, et al. Qwen2.5-1M Technical Report.arXiv preprint arXiv:2501.15383, 2025

Pith/arXiv arXiv 2025

[20] [20]

Needle In A Haystack: Pressure Testing LLMs

Greg Kamradt. Needle In A Haystack: Pressure Testing LLMs. GitHub repository, 2023. Available athttps: //github.com/gkamradt/LLMTest_NeedleInAHaystack

2023

[21] [21]

Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, et al. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[22] [22]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating Very Long-Term Conversational Memory of LLM Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

2024

[23] [23]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InProceedings of the International Conference on Learning Representations (ICLR), 2025

2025

[24] [24]

RULER: What’s the Real Context Size of Your Long-Context Language Models? InProceedings of the Conference on Language Modeling (COLM), 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the Real Context Size of Your Long-Context Language Models? InProceedings of the Conference on Language Modeling (COLM), 2024

2024

[25] [25]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. InAdvancesin Neural Information Processing Systems Datasets and Benchmarks Track(NeurIPS), 2021

2021

[26] [26]

PAL: Program-Aided Language Models.arXiv preprint arXiv:2211.10435, 2022

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL: Program-Aided Language Models.arXiv preprint arXiv:2211.10435, 2022

Pith/arXiv arXiv 2022

[27] [27]

Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), 2017

2017

[28] [28]

Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learni...

2024

[29] [29]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark. InProceedings of the Conference on Language Modeling (COLM), 2024

2024

[30] [30]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.arXiv preprint arXiv:2406.01574, 2024

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark.arXiv preprint arXiv:2406.01574, 2024

Pith/arXiv arXiv 2024

[31] [31]

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421, 2021

2021

[32] [32]

MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering. InProceedings of the Conference on Health, Inference, and Learning (CHIL), volume 174 ofProceedings of MachineLearning Research, pages 248–260. PMLR, 2022

2022

[33] [33]

MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

Rui Ye, Keduan Huang, Qimin Wu, Yuzhu Cai, Tian Jin, Xianghe Pang, Xiangrui Liu, Jiaqi Su, Chen Qian, Bohan Tang, Kaiqu Liang, Jiaao Chen, Yue Hu, Zhenfei Yin, Rongye Shi, Bo An, Yang Gao, Wenjun Wu, Lei Bai, and Siheng Chen. MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems. arXiv preprint arXiv:2505.16988, 2025. 23

arXiv 2025

[34] [34]

Awadallah, Ryen W

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed H. Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. InProceedings of the Conference on Language Modeling (COLM), 2024

2024

[35] [35]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 17889–17904, 2024. 24 A End-to-End SparseX Algorithm This...

2024