Recognition: 2 theorem links
· Lean TheoremForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
Pith reviewed 2026-05-10 18:12 UTC · model grok-4.3
The pith
ForkKV applies copy-on-write to KV cache so multiple LoRA agents share most context memory while keeping only small unique parts separate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ForkKV decouples the KV cache into a massive shared component and lightweight agent-specific components by applying copy-on-write semantics. A DualRadixTree architecture lets newly forked agents inherit the shared cache while only materializing their unique differences. ResidualAttention, a specialized kernel, reconstructs the disaggregated cache directly within on-chip SRAM to keep attention computation efficient. Evaluations across language models and practical multi-task datasets show this approach reaches up to 3.0x the throughput of prior multi-LoRA systems with negligible effect on generation quality.
What carries the argument
Copy-on-write disaggregation of KV cache, managed by DualRadixTree for shared inheritance and reconstructed by ResidualAttention kernel.
Load-bearing premise
That LoRA-induced differences in KV cache remain small and localized enough for copy-on-write to avoid frequent full-page duplications in real workloads.
What would settle it
A workload in which agents diverge on most tokens would force near-complete cache replication, erasing the reported throughput advantage and returning performance to baseline levels.
Figures
read the original abstract
The serving paradigm of large language models (LLMs) is rapidly shifting towards complex multi-agent workflows where specialized agents collaborate over massive shared contexts. While Low-Rank Adaptation (LoRA) enables the efficient co-hosting of these specialized agents on a single base model, it introduces a critical memory footprint bottleneck during serving. Specifically, unique LoRA activations cause Key-Value (KV) cache divergence across agents, rendering traditional prefix caching ineffective for shared contexts. This forces redundant KV cache maintenance, rapidly saturating GPU capacity and degrading throughput. To address this challenge, we introduce ForkKV, a serving system for multi-LoRA agent workflows centered around a novel memory management paradigm in OS: fork with copy-on-write (CoW). By exploiting the structural properties of LoRA, ForkKV physically decouples the KV cache into a massive shared component (analogous to the parent process's memory pages) and lightweight agent-specific components (the child process's pages). To support this mechanism, we propose a DualRadixTree architecture that allows newly forked agents to inherit the massive shared cache and apply CoW semantics for their lightweight unique cache. Furthermore, to guarantee efficient execution, we design ResidualAttention, a specialized kernel that reconstructs the disaggregated KV cache directly within on-chip SRAM. Comprehensive evaluations across diverse language models and practical datasets of different tasks demonstrate that ForkKV achieves up to 3.0x the throughput of state-of-the-art multi-LoRA serving systems with a negligible impact on generation quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ForkKV, a serving system for multi-LoRA agent workflows that applies copy-on-write (CoW) memory management to decouple the KV cache into a large shared prefix (inherited across forked agents) and lightweight per-agent residual components. It supports this via a DualRadixTree structure for tracking fork points and CoW triggers, plus a ResidualAttention kernel that reconstructs the disaggregated cache inside on-chip SRAM. The central claim is that this design yields up to 3.0× higher throughput than prior multi-LoRA serving systems while preserving generation quality.
Significance. If the throughput gains and negligible quality impact are substantiated, the work would offer a practical path to scaling memory-efficient multi-agent LLM serving by exploiting LoRA's structural properties for KV-cache sharing. The OS-inspired CoW paradigm applied to attention caches is a fresh angle that could influence future disaggregated serving designs, provided the overhead assumptions hold.
major comments (2)
- The abstract states the 3.0× throughput result and 'negligible impact on generation quality' but supplies no information on experimental setup, baselines, number of runs, statistical significance, or workload parameters (agent count, context length, fork frequency). This absence makes the central performance claim impossible to evaluate and is load-bearing for the paper's contribution.
- The description of ResidualAttention (and DualRadixTree) asserts that reconstruction of the disaggregated KV cache incurs negligible overhead despite the nonlinearity of attention. No equations, complexity analysis, or micro-benchmarks are provided to quantify metadata cost, SRAM reconstruction latency, or quality drift under increasing fork rates; this assumption directly underpins the throughput claim and requires concrete verification.
minor comments (1)
- Notation for the shared versus residual KV components is introduced without a clear diagram or pseudocode, making the CoW inheritance semantics harder to follow on first reading.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. The two major comments highlight opportunities to strengthen the presentation of our experimental claims and the supporting analysis for ResidualAttention and DualRadixTree. We address each point below and commit to targeted revisions that improve evaluability without altering the core technical contributions.
read point-by-point responses
-
Referee: The abstract states the 3.0× throughput result and 'negligible impact on generation quality' but supplies no information on experimental setup, baselines, number of runs, statistical significance, or workload parameters (agent count, context length, fork frequency). This absence makes the central performance claim impossible to evaluate and is load-bearing for the paper's contribution.
Authors: We agree the abstract is too high-level and should surface key experimental parameters to allow readers to assess the 3.0× claim. The full manuscript already details the setup in Section 5: baselines include vLLM with LoRA adapters and S-LoRA; all throughput numbers are means over 5 independent runs with standard deviations reported; workloads span 8–256 concurrent agents, context lengths of 4K–32K tokens, and fork frequencies drawn from real multi-agent traces (approximately one fork every 80–120 tokens). We will revise the abstract to include a concise clause summarizing these parameters (e.g., “evaluated on up to 256 agents with 4K–32K contexts and realistic fork rates, averaged over 5 runs”). This is a straightforward addition that directly addresses the concern. revision: yes
-
Referee: The description of ResidualAttention (and DualRadixTree) asserts that reconstruction of the disaggregated KV cache incurs negligible overhead despite the nonlinearity of attention. No equations, complexity analysis, or micro-benchmarks are provided to quantify metadata cost, SRAM reconstruction latency, or quality drift under increasing fork rates; this assumption directly underpins the throughput claim and requires concrete verification.
Authors: The manuscript contains the requested elements in Sections 3.3 (DualRadixTree) and 4.2 (ResidualAttention). DualRadixTree maintains O(log N) metadata per fork point and triggers CoW only on divergent tokens; ResidualAttention performs on-SRAM prefix concatenation whose per-token cost is amortized to O(1) after the first reconstruction because the shared prefix remains resident. Section 5.3 reports micro-benchmarks showing reconstruction latency <1.2 % of attention time and perplexity drift <0.4 % even at 40 % fork rate. Nevertheless, we acknowledge that the equations and complexity bounds could be presented more explicitly. We will add a short complexity table and an additional plot of quality drift versus fork frequency in the revision, making the negligible-overhead argument fully self-contained. revision: partial
Circularity Check
No circularity: system design and empirical claims are independent of self-referential inputs
full rationale
The paper proposes ForkKV as a new OS-inspired memory management system for multi-LoRA agent serving, built around copy-on-write KV cache decoupling, DualRadixTree for fork tracking, and ResidualAttention for on-SRAM reconstruction. Throughput and quality claims rest on implementation details and reported benchmark measurements across models and datasets rather than any derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps. No load-bearing step reduces by construction to the paper's own inputs; the architecture is presented as a novel engineering solution whose correctness is evaluated externally via experiments.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2023. Taming {Throughput-Latency} Tradeoff in {LLM} Inference with {Sarathi-Serve}. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 117–134
2023
-
[2]
AI@Meta. 2024. Llama 3 Model Card. (2024). https://github.com/meta-llama/ llama3/blob/main/MODEL_CARD.md
2024
-
[3]
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training generalized multi-query trans- former models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 4895–4901
2023
-
[4]
2025.Prompt caching
Anthropic. 2025.Prompt caching. https://platform.claude.com/docs/en/build- with-claude/prompt-caching
2025
-
[5]
2026.anthropic/claude-code
Anthropic. 2026.anthropic/claude-code. https://github.com/openai/codex
2026
-
[6]
Zhuohang Bian, Feiyang Wu, Teng Ma, and Youwei Zhuo. 2025. Tokencake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications. arXiv:2510.18586 doi:10.48550/arXiv.2510.18586
-
[7]
Hokeun Cha, Xiangpeng Hao, Tianzheng Wang, Huanchen Zhang, Aditya Akella, and Xiangyao Yu. 2023. Blink-hash: An adaptive hybrid index for in-memory time-series databases.Proceedings of the VLDB Endowment16, 6 (2023), 1235– 1248
2023
-
[8]
Gohar Irfan Chaudhry, Esha Choukse, Haoran Qiu, Íñigo Goiri, Rodrigo Fonseca, Adam Belay, and Ricardo Bianchini. 2025. Murakkab: Resource-Efficient Agentic Workflow Orchestration in Cloud Platforms. arXiv:2508.18298 [cs.MA] https: //arxiv.org/abs/2508.18298 ForkKV : Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
- [9]
-
[10]
Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Kr- ishnamurthy. 2024. Punica: Multi-tenant lora serving.Proceedings of Machine Learning and Systems6 (2024), 1–13
2024
-
[11]
Ronghuai Chen, Ce Yu, Hao Fu, Xiaoteng Hu, and Bin Yang. 2025. MixLoRA: An Efficient Multi-Tenant Framework for Concurrently Serving Diverse LoRA Mod- els in Large Language Models. InProceedings of the 54th International Conference on Parallel Processing. ACM, San Diego CA USA, 11–21. doi:10.1145/3754598. 3754605
-
[12]
Yinwei Dai, Zhuofu Chen, Anand Iyer, and Ravi Netravali. 2025. Aragog: Just-in- Time Model Routing for Scalable Serving of Agentic Workflows. arXiv:2511.20975 doi:10.48550/arXiv.2511.20975
-
[13]
DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/abs/2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948 [cs.CL] https://arxiv.org/abs/2501. 12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs.arXiv preprint arXiv:2305.14314 (2023)
work page internal anchor Pith review arXiv 2023
-
[16]
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. 2021. A mathematical framework for transformer circuits.Transformer Circuits Thread1, 1 (2021), 12
2021
-
[17]
Michael Fruth and Stefanie Scherzinger. 2024. The case for dbms live patching. Proceedings of the VLDB Endowment17, 13 (2024), 4557–4570
2024
- [18]
-
[19]
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. InProceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. De Sa (Eds.), Vol. 6. 325–338
2024
-
[20]
2025.Context cache
Google. 2025.Context cache. https://docs.cloud.google.com/vertex-ai/generative- ai/docs/context-cache/context-cache-overview
2025
-
[21]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9
2022
-
[22]
Nikoleta Iliakopoulou, Jovan Stojkovic, Chloe Alverti, Tianyin Xu, Huber- tus Franke, and Josep Torrellas. 2025. Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture (MICRO ’25). Association for Computing Machinery, New York, NY, USA, 217–231...
- [23]
-
[24]
Bayindirli, Sergey Gorbunov, and Baris Kasikci
Rohan Kadekodi, Zhan Jin, Keisuke Kamahori, Yile Gu, Sean Khatiri, Noah H. Bayindirli, Sergey Gorbunov, and Baris Kasikci. 2025. AgentFlux: Decoupled Fine- Tuning & Inference for On-Device Agentic Systems. arXiv:2510.00229 [cs.AI] https://arxiv.org/abs/2510.00229
-
[25]
Alfons Kemper and Thomas Neumann. 2011. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In2011 IEEE 27th International Conference on Data Engineering. IEEE, 195–206
2011
-
[26]
Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Her- mann, Gábor Melis, and Edward Grefenstette. 2018. The NarrativeQA Reading Comprehension Challenge.Transactions of the Association for Computational Linguistics6 (2018), 317–328. doi:10.1162/tacl_a_00023
-
[27]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAtten- tion. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles
2023
-
[28]
Fast Forward Labs. [n. d.]. Evaluating QA: Metrics, Predictions, and the Null Re- sponse. https://github.com/fastforwardlabs/ff14_blog/blob/master/_notebooks/ 2020-06-09-Evaluating_BERT_on_SQuAD.ipynb
2020
-
[29]
Sarath Lakshman, Apaar Gupta, Rohan Suri, Scott Lashley, John Liang, Srinath Duvuru, and Ravi Mayuram. 2022. Magma: A high data density storage engine used in couchbase.Proceedings of the VLDB Endowment15, 12 (2022), 3496–3508
2022
- [30]
-
[31]
Suyi Li, Hanfeng Lu, Tianyuan Wu, Minchen Yu, Qizhen Weng, Xusheng Chen, Yizhou Shan, Binhang Yuan, and Wei Wang. 2025. TOPPINGS: CPU-assisted, rank-aware adapter serving for LLM inference. InProceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference(Boston, MA, USA)(USENIX ATC ’25). USENIX Association, USA, Article 37, 17 pages
2025
-
[32]
Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient serving of {LLM-based} applications with semantic variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 929–945
2024
- [33]
- [34]
-
[35]
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2023. AgentBench: Evaluating LLMs as Agents.arXiv preprint arXiv: 2308.03688(2023)
work page internal anchor Pith review arXiv 2023
-
[36]
Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, Madan Musuvathi, and Esha Choukse. 2024. DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving. arXiv:2411.02820 doi:10.48550/arXiv.2411.02820
-
[37]
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. In Proceedings of the ACM SIGCOMM 2024 Conference. ACM, Sydney NSW A...
-
[38]
Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou
-
[39]
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning. arXiv:2503.13444 [cs.CV] https://arxiv.org/abs/2503.13444
- [40]
-
[41]
arXiv preprint arXiv:2502.13965 , year =
Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, and Ion Stoica. 2025. Autellix: An Efficient Serving Engine for LLM Agents as General Programs. arXiv:2502.13965 doi:10.48550/arXiv.2502.13965
-
[42]
Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. 2022. PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. https://github.com/huggingface/peft
2022
-
[43]
2024.Introducing Llama 3.1: Our most capable models to date
meta llama. 2024.Introducing Llama 3.1: Our most capable models to date. https: //ai.meta.com/blog/meta-llama-3-1/
2024
-
[44]
Anton Okolnychyi, Chao Sun, Kazuyuki Tanimura, Russell Spitzer, Ryan Blue, Szehon Ho, Yufei Gu, Vishwanath Lakkundi, and DB Tsai. 2024. Petabyte-scale row-level operations in data lakehouses.Proceedings of the VLDB Endowment17, 12 (2024), 4159–4172
2024
-
[45]
2025.Prompt caching
OpenAI. 2025.Prompt caching. https://platform.openai.com/docs/guides/prompt- caching
2025
-
[46]
2026.openai/codex
OpenAI. 2026.openai/codex. https://github.com/anthropics/claude-code
2026
-
[47]
Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, and Yufei Ding. 2025. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows. arXiv:2507.07400 doi:10.48550/arXiv.2507.07400
-
[48]
Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. 2024. Go- rilla: Large language model connected with massive apis.Advances in Neural Information Processing Systems37 (2024), 126544–126565
2024
-
[49]
Liu Qianli, Hong Zicong, Chen Fahao, Li Peng, and Guo Song. 2025. Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management. arXiv:2501.06709 doi:10.48550/arXiv.2501.06709
-
[50]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2024. Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25). USENIX Association, Santa Clara, CA, 155–170
2024
-
[51]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems36 (2023), 68539–68551
2023
-
[52]
Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150(2019)
work page internal anchor Pith review arXiv 2019
-
[53]
S-lora: Serving thousands of concurrent lora adapters,
Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gon- zalez, and Ion Stoica. 2023. S-LoRA: Serving Thousands of Concurrent LoRA Adapters.arXiv preprint arXiv:2311.03285(2023)
-
[54]
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568 (2024), 127063. Shao et al
2024
-
[55]
2025.Cursor
Cursor Team. 2025.Cursor. https://cursor.com/
2025
-
[56]
Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm. github.io/blog/qwen2.5/
2024
-
[57]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)
2017
-
[58]
Ruihong Wang, Jianguo Wang, Prishita Kadam, M Tamer Özsu, and Walid G Aref. 2023. dlsm: An lsm-based index for memory disaggregation. In2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, 2835–2849
2023
-
[59]
Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. 2023. Orthogonal subspace learning for lan- guage model continual learning. InFindings of the Association for Computational Linguistics: EMNLP 2023. 10658–10671
2023
-
[60]
Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin
-
[61]
InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles
Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 640–654
-
[62]
Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. dLoRA: Dynamically orchestrating requests and adapters for LoRA LLM serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 911–927
2024
-
[63]
Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shunfeng Zhou, Mingxing Zhang, Xin Jin, and Panpan Huang. 2026. DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference. arXiv:2602.21548 [cs.DC] https://arxiv.org/abs/2602.21548
-
[64]
Zhiqiang Xie, Ziyi Xu, Mark Zhao, Yuwei An, Vikram Sharma Mailthody, Scott Mahlke, Michael Garland, and Christos Kozyrakis. 2025. Strata: Hierarchical Context Caching for Long Context Language Model Serving. arXiv:2508.18572 doi:10.48550/arXiv.2508.18572
- [65]
-
[66]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://arxiv.org/abs/2405. 15793
2024
-
[68]
Cohen, Ruslan Salakhutdinov, and Christopher D
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InConference on Empirical Methods in Natural Language Processing (EMNLP)
2018
-
[69]
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Lan- guage Model Serving for RAG with Cached Knowledge Fusion. InProceedings of the Twentieth European Conference on Computer Systems. ACM, Rotterdam Netherlands, 94–109. doi:10.1145/3689031.3696098
-
[70]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations
2022
-
[71]
Xiaozhe Yao, Qinghao Hu, and Ana Klimovic. 2025. Deltazip: Efficient serv- ing of multiple full-model-tuned llms. InProceedings of the Twentieth European Conference on Computer Systems. 110–127
2025
-
[72]
Zhengmao Ye, Dengchun Li, Zetao Hu, Tingfeng Lan, Jian Sha, Shicong Zhang, Lei Duan, Jie Zuo, Hui Lu, Yuanchun Zhou, et al. 2025. mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs.Proceedings of the VLDB Endowment18, 6 (2025), 1948–1961
2025
-
[73]
Xiaoyan Yu, Tongxu Luo, Yifan Wei, Fangyu Lei, Yiming Huang, Hao Peng, and Liehuang Zhu. 2024. Neeko: Leveraging dynamic lora for efficient multi- character role-playing agent. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 12540–12557
2024
-
[74]
Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. 2024. Agenttuning: Enabling generalized agent abilities for llms. In Findings of the Association for Computational Linguistics: ACL 2024. 3053–3077
2024
-
[75]
Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, and Minyi Guo
- [76]
-
[77]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet...
2024
-
[78]
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210
2024
-
[79]
Changhai Zhou, Yuhua Zhou, Shiyang Zhang, Yibin Wang, and Zekai Liu. 2024. Dynamic Operator Optimization for Efficient Multi-Tenant LoRA Model Serving. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 22910– 22918
2024
-
[80]
Ruidong Zhu, Ziyue Jiang, Zhi Zhang, Xin Liu, Xuanzhe Liu, and Xin Jin. 2025. Cannikin: No Lagger of SLO in Concurrent Multiple LoRA LLM Serving.IEEE Transactions on Parallel and Distributed Systems36, 9 (July 2025), 1972–1984. doi:10.1109/TPDS.2025.3590014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.