Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving
Pith reviewed 2026-06-28 11:36 UTC · model grok-4.3
The pith
AsymCache reduces time-to-first-token in LLM inference by up to 2x by aligning KV cache decisions with attention kernel costs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AsymCache is a computation-latency-aware KV cache management system for LLM inference that explicitly aligns cache residency decisions with GPU attention kernel performance, including three key components: Multi-Segment Attention (MSA) for efficient non-contiguous KV context processing, a cache eviction policy that jointly optimizes hit rate and position-aware recomputation cost, and an adaptive chunking scheduler for high hardware utilization, which together reduce TTFT by up to 1.90-2.03x and TPOT by 1.62-1.71x over latest baselines while preserving exact outputs and enabling integration into agent serving systems.
What carries the argument
Multi-Segment Attention (MSA), which enables efficient processing of non-contiguous KV cache blocks to support position-aware eviction without prohibitive recomputation overhead.
If this is right
- KV cache eviction can be improved by jointly optimizing hit rate and position-dependent recomputation costs rather than using frequency or recency alone.
- Adaptive chunking during attention computation raises hardware utilization when cache blocks are non-contiguous.
- The design integrates directly into existing agent serving frameworks and yields further average job latency reductions of up to 18.1 percent.
Where Pith is reading between the lines
- Similar kernel-aware eviction logic could be applied to other memory-intensive operations beyond attention if their recomputation costs also vary with data layout.
- The position-aware cost model may need recalibration when moving to new GPU architectures whose attention kernels exhibit different scaling with segment count.
- Future cache systems might benefit from exposing low-level kernel timing models to the eviction policy instead of treating attention as a black-box cost.
Load-bearing premise
The reported speedups arise from the proposed MSA, joint eviction policy, and adaptive scheduler rather than from differences in workloads, model sizes, or baseline implementations.
What would settle it
Re-run the experiments on identical hardware and workloads using the baseline systems equipped with the same low-level optimizations as AsymCache but without MSA and the position-aware recomputation term, then check whether the 1.9-2x TTFT and 1.6-1.7x TPOT gains disappear.
Figures
read the original abstract
Large Language Model (LLM) inference relies on key-value (KV) caches to avoid redundant attention computation. While approximate KV cache retention techniques reduce memory usage by sacrificing model accuracy, lossless approaches instead evict KV cache blocks from GPU memory and reconstruct them on demand to preserve exact outputs. Existing lossless KV cache management systems primarily base eviction decisions on access frequency or positional heuristics, without considering how different KV cache blocks affect the execution efficiency of GPU attention kernels. In this paper, we propose AsymCache, a computation-latency-aware KV cache management system for LLM inference that explicitly aligns cache residency decisions with GPU attention kernel performance, including three key components: Multi-Segment Attention (MSA) for efficient non-contiguous KV context processing, a cache eviction policy that jointly optimizes hit rate and position-aware recomputation cost, and an adaptive chunking scheduler for high hardware utilization. Experiments show that AsymCache reduces TTFT by up to 1.90-2.03x and time-per-output-token (TPOT) by 1.62-1.71x over latest baselines, confirming the effectiveness of the method in common workloads and validating its design goal of balancing computational efficiency with cache hit rate. Moreover, the low-level design of AsymCache allows seamless integration into agent serving systems such as Continuum, where it further reduces average job latency by up to 18.1%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AsymCache, a KV-cache management system for LLM inference consisting of Multi-Segment Attention (MSA) for non-contiguous KV processing, a joint hit-rate and position-aware recomputation eviction policy, and an adaptive chunking scheduler. It claims these components yield TTFT speedups of 1.90-2.03x and TPOT speedups of 1.62-1.71x versus latest baselines while preserving exact outputs, with further latency gains when integrated into systems such as Continuum.
Significance. If the reported speedups are shown to arise from the algorithmic components rather than implementation artifacts, the work would offer a practical advance in lossless KV-cache management that directly ties eviction decisions to GPU kernel efficiency, potentially improving serving throughput for long-context workloads.
major comments (1)
- [Abstract] Abstract: the headline performance claims (TTFT 1.90-2.03x, TPOT 1.62-1.71x) are presented without any description of models, workloads, hardware, baseline re-implementations, or measurement methodology. This directly prevents assessment of whether the gains are produced by MSA, the joint eviction policy, and the scheduler, or by unstated differences in memory layout, kernel choice, or chunking strategy between AsymCache and the baselines.
minor comments (1)
- The title emphasizes Multi-Segment Attention while the abstract centers the system name AsymCache; a brief clarification of how MSA relates to the overall AsymCache design would improve readability.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive comment. We agree that the abstract lacks sufficient detail on the experimental setup, which is necessary to properly contextualize the reported speedups and allow assessment of whether they stem from the proposed algorithmic components.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline performance claims (TTFT 1.90-2.03x, TPOT 1.62-1.71x) are presented without any description of models, workloads, hardware, baseline re-implementations, or measurement methodology. This directly prevents assessment of whether the gains are produced by MSA, the joint eviction policy, and the scheduler, or by unstated differences in memory layout, kernel choice, or chunking strategy between AsymCache and the baselines.
Authors: We agree that the current abstract does not provide the necessary context on models, workloads, hardware, baselines, or methodology. In the revised version we will expand the abstract (within length constraints) to include brief but explicit information on the evaluated models (Llama-2-7B/13B and Mistral-7B), workloads (long-context generation and chat), hardware (A100/H100 GPUs), baseline re-implementations (vLLM, FlexGen, and recent KV-cache eviction methods), and measurement methodology (end-to-end TTFT/TPOT with exact output verification). The full experimental details will remain in Section 5, but the abstract will now allow readers to immediately assess the source of the gains. We will also add a short sentence clarifying that all comparisons use identical memory layouts and kernel backends where possible. revision: yes
Circularity Check
No circularity; empirical performance claims with no derivations or self-referential predictions
full rationale
The paper is a systems/empirical contribution focused on measured speedups (TTFT/TPOT) from AsymCache components. No equations, first-principles derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. Claims rest on experimental comparisons against baselines rather than any chain that reduces to its own inputs by construction. This is the standard case of a self-contained empirical result.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[3]
Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, and Yiying Zhang
- [4]
-
[5]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, and Shiv Saini. 2025. Cache-craft: Managing chunk-caches for efficient retrieval-augmented genera- tion.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28
2025
-
[7]
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 117–134
2024
-
[8]
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training generalized multi-query trans- former models from multi-head checkpoints.arXiv preprint arXiv:2305.13245 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al . 2024. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). 3119–3137
2024
-
[10]
Zhuohang Bian, Feiyang Wu, Teng Ma, and Youwei Zhuo. 2025. Tokencake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications. arXiv preprint arXiv:2510.18586(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Edward Y. Chang and Longling Geng. 2025. SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning.Proc. VLDB Endow.18, 12 (2025), 4874–4886. https://doi.org/10.14778/3750601.3750611
-
[12]
Yukang Chen, Weihao Cui, Han Zhao, Ziyi Xu, Xiaoze Fan, Xusheng Chen, Yangjie Zhou, Shixuan Sun, Bingsheng He, and Quan Chen. 2026. Towards High-Goodput LLM Serving with Prefill-decode Multiplexing. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 2030–2047
2026
-
[13]
Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jing Liu, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Cheng Li, Yuqing Yang, Fan Yang, and Mao Yang. 2026. RetroInfer: A Vector Storage Engine for Scalable Long- Context LLM Inference.Proc. VLDB Endow.19, 5 (2026), 1016–...
2026
- [14]
-
[15]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407
2024
-
[16]
Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Attentionstore: Cost- effective attention reuse across multi-turn conversations in large language model serving.arXiv preprint arXiv:2403.1970852 (2024), 20–38
-
[17]
Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. {Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention}. In2024 USENIX Annual Technical Conference (USENIX ATC 24). 111–126
2024
-
[18]
Shiwei Gao, Youmin Chen, and Jiwu Shu. 2025. Fast state restoration in LLM serving with HCache. InProceedings of the Twentieth European Conference on Computer Systems. 128–143
2025
-
[19]
Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-Serve: Adap- tive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving. Proceedings of the ACM on Management of Data3, 3 (2025), 1–28
2025
-
[20]
Victor Giannakouris and Immanuel Trummer. 2025. 𝜆-tune: Harnessing large language models for automated database system tuning.Proceedings of the ACM on Management of Data3, 1 (2025), 1–26
2025
-
[21]
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems6 (2024), 325–338
2024
-
[22]
Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, et al . 2024. Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference.arXiv preprint arXiv:2401.08671(2024)
-
[23]
Chuxuan Hu, Austin Peters, and Daniel Kang. 2024. LEAP: LLM-powered End- to-end Automatic Library for Processing Social Science Queries on Unstructured Data.Proc. VLDB Endow.18, 2 (2024), 253–264. https://doi.org/10.14778/3705829. 3705843
-
[24]
Wei Huang, Anda Cheng, Yinggui Wang, Lei Wang, and Tao Wei. 2026. LLM- AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning.Proc. VLDB Endow.19, 5 (2026), 794–807. https://www.vldb.org/pvldb/vol19/p794- cheng.pdf
2026
-
[25]
Xinmei Huang, Haoyang Li, Jing Zhang, Xinxin Zhao, Zhiming Yao, Yiyan Li, Tieying Zhang, Jianjun Chen, Hong Chen, and Cuiping Li. 2025. E2ETune: End-to- End Knob Tuning via Fine-tuned Generative Language Model.Proc. VLDB Endow. 18, 13 (2025), 5540–5554. https://www.vldb.org/pvldb/vol18/p5540-huang.pdf
2025
-
[26]
Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, and Gustavo Alonso
-
[27]
Chameleon: A Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models.Proceedings of the VLDB Endowment18, 1 (2024), 42–52
2024
-
[28]
Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2025. Ragcache: Efficient knowledge caching for retrieval-augmented generation.ACM Transactions on Computer Systems44, 1 (2025), 1–27
2025
- [29]
-
[30]
Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, and Ashish Panwar. 2025. Pod-attention: Unlocking full prefill-decode overlap for faster llm inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 897–912
2025
-
[31]
Hao Kang, Ziyang Li, Xinyu Yang, Weili Xu, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, and Simran Arora. 2026. Thunderagent: A simple, fast and program-aware agentic inference system.arXiv preprint arXiv:2602.13692(2026)
work page internal anchor Pith review arXiv 2026
-
[32]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626
2023
-
[33]
Hanchen Li, Qiuyang Mang, Runyuan He, Qizheng Zhang, Huanzhi Mao, Xi- aokun Chen, Alvin Cheung, Joseph Gonzalez, and Ion Stoica. 2025. Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to- Live.arXiv preprint arXiv:2511.02230(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [34]
-
[35]
Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. 2024. Loogle: Can long-context language models understand long contexts?. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 16304–16333
2024
-
[36]
Yuhang Li, Rong Gu, Chengying Huan, Zhibin Wang, Renjie Yao, Chen Tian, and Guihai Chen. 2025. HotPrefix: Hotness-Aware KV Cache Scheduling for Efficient Prefix Sharing in LLM Inference Systems.Proceedings of the ACM on Management of Data3, 4 (2025), 1–27
2025
-
[37]
Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient serving of {LLM-based} applications with semantic variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 929–945
2024
-
[38]
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al . 2024. Deepseek- v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. 2024. Cachegen: Kv cache compression and streaming for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56
2024
-
[40]
Kuan Lu, Zhihui Yang, Sai Wu, Ruichen Xia, Dongxiang Zhang, and Gang Chen
-
[41]
Adda: Towards Efficient in-Database Feature Generation via LLM-based Agents.Proceedings of the ACM on Management of Data3, 3 (2025), 1–27
2025
- [42]
-
[43]
Xinyue Ma, Heelim Hong, Taegeon Um, Jongseop Lee, Seoyeong Choy, Woo- Yeon Lee, and Myeongjae Jeon. 2026. ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration.Proc. VLDB Endow. 19, 5 (2026), 1046–1059. https://www.vldb.org/pvldb/vol19/p1046-ma.pdf
2026
-
[44]
2023.TensorRT-LLM
NVIDIA. 2023.TensorRT-LLM. https://github.com/NVIDIA/TensorRT-LLM High-Performance Deep Learning Inference
2023
-
[45]
NVIDIA Corporation. 2024. NVIDIA CUDA Toolkit, Version 12.8. https:// developer.nvidia.com/cuda-toolkit. https://developer.nvidia.com/cuda-toolkit
2024
-
[46]
NVIDIA Corporation and CUTLASS Contributors. 2024. CUTLASS: CUDA Tem- plates for Linear Algebra Subroutines, Version 3.4.0. https://github.com/NVIDIA/ cutlass. https://github.com/NVIDIA/cutlass GitHub repository. Accessed: 2026- 01-17
2024
-
[47]
Zaifeng Pan, AJJKUMAR DAHYALAL PATEL, Yipeng Shen, Zhengding Hu, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2026. KVFlow: Effi- cient prefix caching for accelerating LLM-based multi-agent workflows.Advances in Neural Information Processing Systems38 (2026), 126246–126265. Chunan Shi, Yilei Chen, Yilin Chen, Xupeng Miao, and Bin Cui
2026
-
[48]
Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. 2025. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty- second International Conference on Machine Learning
2025
-
[49]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al . 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serving.ACM Transactions on Storage(2024)
2024
-
[50]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053(2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[51]
Jie Tan, Kangfei Zhao, Rui Li, Jeffrey Xu Yu, Chengzhi Piao, Hong Cheng, Helen Meng, Deli Zhao, and Yu Rong. 2025. Can large language models be query optimizer for relational databases?Proceedings of the ACM on Management of Data3, 6 (2025), 1–28
2025
-
[52]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)
2017
-
[54]
Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, and Haibo Chen. 2025. KVCache cache in the wild: characterizing and optimizing KVCache cache at a large cloud provider. In Proceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference (Boston, MA, USA)(USENIX ATC ’25). USENIX Associatio...
2025
-
[55]
Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, and Shuaiwen Leon Song. 2023. Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Un- structured Sparsity.Proceedings of the VLDB Endowment17, 2 (2023), 211–224
2023
-
[56]
Jianxin Yan, Wangze Ni, Lei Chen, Xuemin Lin, Peng Cheng, Zhan Qin, and Kui Ren. 2025. ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models.Proceedings of the VLDB Endowment18, 12 (2025), 5391–5394
2025
-
[57]
Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024. ChunkAttention: Efficient Self- Attention with Prefix-Aware KV Cache and Two-Phase Partition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangk...
-
[58]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating systems design and implementation (OSDI 22). 521–538
2022
-
[59]
Lingfan Yu, Jinkun Lin, and Jinyang Li. 2025. Stateful large language model serving with pensieve. InProceedings of the Twentieth European Conference on Computer Systems. 144–158
2025
-
[60]
Hao Yuan, Xin Ai, Qiange Wang, Peizheng Li, Jiayang Yu, Chaoyi Chen, Xinbo Yang, Yanfeng Zhang, Zhenbo Fu, Yingyou Wen, et al. 2025. DepCache: A KV Cache Management Framework for GraphRAG with Dependency Attention. Proceedings of the ACM on Management of Data3, 6 (2025), 1–29
2025
-
[61]
Enhao Zhang, Nicole Sullivan, Brandon Haynes, Ranjay Krishna, and Magdalena Balazinska. 2025. Self-Enhancing Video Data Management System for Com- positional Events with Large Language Models.Proceedings of the ACM on Management of Data3, 3 (2025), 1–29
2025
-
[62]
Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. 2025. Pqcache: Product quantization-based kvcache for long context llm inference.Proceedings of the ACM on Management of Data3, 3 (2025), 1–30
2025
-
[63]
Qizheng Zhang, Michael Wornow, and Kunle Olukotun. 2025. Cost-efficient serving of llm agents via test-time plan caching. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models
2025
-
[64]
Xinyang Zhao, Xuanhe Zhou, and Guoliang Li. 2024. Chat2data: An interactive data analysis system with rag, vector databases and llms.Proceedings of the VLDB Endowment17, 12 (2024), 4481–4484
2024
-
[65]
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37 (2024), 62557–62583
2024
-
[66]
Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, Zhiyuan Liu, Weize Chen, Jianming Wu, Jiesi Liu, Ruohang Feng, and Guoyang Zeng. 2024. D-Bot: Database Diagnosis System using Large Language Models.Proc. VLDB Endow.17, 10 (2024), 2514–
2024
-
[67]
Lemma 1.Let 𝑓 : R→R be a continuous, non-negative, and non-constant function
https://doi.org/10.14778/3675034.3675043 Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving A APPENDIX A.1 Properties of Order-Preserving Rule We now proof that only exponential function can satisfy the order- preserving rule proposed in Section 4.4. Lemma 1.Let 𝑓 : R→R be a continuous, non-negative, an...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.