pith. machine review for the scientific record. sign in

arxiv: 2605.09735 · v1 · submitted 2026-05-10 · 💻 cs.AR · cs.AI· cs.DC· cs.OS

Recognition: 3 theorem links

· Lean Theorem

KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:50 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.DCcs.OS
keywords KV-cache managementstatic-graph LLMLLM servingmemory regularizationruntime flexibilitylatency optimizationthroughput
0
0 comments X

The pith

Regularizing KV-cache movement lets static-graph LLM decoders absorb variable request lengths without over-reserving memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Static-graph LLM decoders deliver predictable launches and fixed shapes but encounter irregular KV-cache behavior during online decoding because request lengths vary, EOS events arrive asynchronously, and histories fragment. KV-RM addresses this by decoupling logical KV histories from physical storage, tracking state with a block pager, and coalescing non-contiguous mappings into few large transfers before each fixed-shape attention step. If the approach works, static-graph executors gain much of the flexibility of dynamic runtimes at the movement layer alone, lowering reserved memory and eliminating burst-time latency spikes while preserving low submission overhead. A reader would care because production LLM serving routinely mixes short and long requests, making memory over-reservation and tail-latency outliers costly.

Core claim

KV-RM decouples logical KV histories from physical storage, tracks active KV state through a block pager, and materializes each decode step through a single committed descriptor after a merge-staged transport path coalesces non-contiguous mappings into a small number of large transfer groups. The design absorbs variability from differing lengths and asynchronous events below the fixed decode interface; optional bounded far-history summaries can be added but are not required. On a 2-GPU NVIDIA A100 node this yields higher mixed-length throughput, lower tail latency, reduced reserved KV memory across workloads, and removal of severe burst-time spikes under production-trace replay.

What carries the argument

The merge-staged transport path that coalesces non-contiguous KV mappings into a small number of large transfer groups before a fixed-shape attention kernel consumes them under a single committed descriptor per decode step.

If this is right

  • Mixed-length decoding throughput rises on 2-GPU A100 nodes relative to an unmodified static-graph baseline.
  • Tail latency falls for the same mixed workloads.
  • Reserved KV memory drops across families of production workloads.
  • Severe burst-time latency spikes disappear when the system replays real production traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Movement regularization can serve as an alternative interface for runtime flexibility when dynamic kernel shapes are unavailable.
  • Static-graph serving may scale to more heterogeneous request mixes if the coalescing strategy generalizes beyond the evaluated traces.
  • The optional far-history summaries could be combined with the core mechanism to further reduce memory pressure in long-context settings.

Load-bearing premise

Variability from differing request lengths, asynchronous EOS events, and fragmented histories can be absorbed below a fixed decode interface primarily through KV-cache movement regularization.

What would settle it

A production-trace workload in which the merge-staged transport still produces many small transfers, leaving throughput, tail latency, and reserved memory unchanged or worse than the static-graph baseline.

Figures

Figures reproduced from arXiv: 2605.09735 by Bolun Sun, Jian Zhang, Weijian Zheng, Xiaodong Yu, Zhijing Ye, Zhiqing Zhong.

Figure 1
Figure 1. Figure 1: GPU-side structural limits of static-graph decoding. (a) Under identical dense-attention semantics, static-graph execution retains a large idle memory floor compared with a paged runtime. In panel (a), after-idle process-resident foot￾print is reported as the aggregate across the two active GPUs of the 2× A100-40GB PCIe node used by the run. (b) A sepa￾rate internal sweep shows the O(T) bandwidth wall unde… view at source ↗
Figure 2
Figure 2. Figure 2: KV-RM architecture and invariants. The con￾trol plane (left) shapes a logical KV view using a fixed￾shape near-window decoder, a KV pager, an optional far￾view policy, and lookahead placement/prefetch. The merge￾staged transport pipeline (right) bridges the control plane and the DMA engine, turning these mappings into coalesced DMA trains that feed the static attention kernel. Across all mechanisms, the ke… view at source ↗
Figure 3
Figure 3. Figure 3: One decode step under the KV-RM contract. Run￾time variability is expressed as mapping edits (Alias, Trim, Reserve) and sealed by a single Frame commit via shadow￾to-active descriptor swap. Descriptor merging then converts fragmented page descriptors into a small constant number of trains, typically a near-window train and, when needed, one far-view train in the main operating regime, that feed the same fi… view at source ↗
Figure 4
Figure 4. Figure 4: GPU main end-to-end behavior on a 2× A100- 40GB PCIe node. (a,b) Under a 60-second high-load Azure replay window, KV-RM approaches the dynamic-runtime baselines while tightening replay-window p99/p99.9 latency relative to the static-graph baseline. (c,d) Under controlled mixed-length serving, KV-RM improves throughput and p99 relative to the static-graph baseline and remains close to the dynamic-runtime ba… view at source ↗
Figure 6
Figure 6. Figure 6: Mechanism audit and bounded-budget qual [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Boundary stress at the main 2× A100-40GB PCIe node operating point. (a–c) Under a concurrency sweep, KV-RM preserves a single committed descriptor per step, bounded control-plane cost, and competitive through￾put/tail behavior as concurrency rises. In panel (c), submit share denotes host submit plus frame commit divided by per-step wall time, and commit cost is measured per commit￾ted step. (d–f) Under har… view at source ↗
read the original abstract

Static-graph LLM decoders provide predictable launches, fixed tensor shapes, and low submission overhead, but online decoding exposes highly irregular KV-cache behavior: request lengths differ, EOS events arrive asynchronously, and logical histories fragment over time. Dynamic runtimes recover flexibility through paged KV management and step-level scheduling, while static-graph executors often over-reserve memory and suffer burst-time latency outliers. This paper studies whether much of this variability can be absorbed below a fixed decode interface. We present KV-RM, a runtime design that regularizes KV-cache movement beneath a static-graph LLM decoder. KV-RM decouples logical KV histories from physical storage, tracks active KV state through a block pager, and materializes each decode step through a single committed descriptor. A merge-staged transport path coalesces non-contiguous KV mappings into a small number of large transfer groups before a fixed-shape attention kernel consumes them. Optional bounded far-history summaries can be enabled under the same interface, but the core design does not depend on them. On a 2-GPU NVIDIA A100 node, KV-RM improves mixed-length decoding throughput and tail latency relative to a static-graph baseline, reduces reserved KV memory across workload families, and removes severe burst-time latency spikes under production-trace replay. These results suggest that KV-cache movement, rather than kernel shape, can be an effective boundary for recovering runtime flexibility in static-graph LLM serving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces KV-RM, a runtime design to regularize KV-cache movement beneath a static-graph LLM decoder. It decouples logical KV histories from physical storage via a block pager, materializes each decode step with a single committed descriptor, and uses merge-staged transport to coalesce non-contiguous KV mappings into large transfers for a fixed-shape attention kernel. Optional bounded far-history summaries are supported but stated to be non-essential. On a 2-GPU NVIDIA A100 node, the design is reported to improve mixed-length decoding throughput and tail latency versus a static-graph baseline, reduce reserved KV memory across workloads, and eliminate severe burst-time latency spikes under production-trace replay.

Significance. If the central claims hold after addressing the experimental gaps, KV-RM would demonstrate that KV-cache movement regularization can recover substantial runtime flexibility for static-graph LLM serving without altering kernel shapes or adopting fully dynamic runtimes. This targets a practical production challenge and could influence the design of efficient, predictable LLM inference systems.

major comments (1)
  1. The abstract states that 'the core design does not depend on' the optional bounded far-history summaries and that variability from differing request lengths, asynchronous EOS events, and fragmented histories 'can be absorbed below a fixed decode interface' primarily via KV-cache movement regularization (block pager, single committed descriptor, merge-staged coalescing). However, the evaluation provides no ablation that disables the summaries while keeping the rest of the KV-RM path fixed. This omission is load-bearing because the reported gains in throughput, tail latency, memory reduction, and spike elimination could still rely on the summaries to bound fragmentation rather than the regularization mechanism alone.
minor comments (2)
  1. The abstract describes empirical improvements on A100 hardware with mixed workloads and production traces but supplies no quantitative numbers, error bars, baseline details, or workload exclusion criteria. Adding at least one key metric (e.g., throughput delta or p99 latency reduction) would make the summary more informative.
  2. The design entities ('block pager', 'merge-staged transport path', 'single committed descriptor') are introduced without accompanying pseudocode, diagram, or interface specification in the provided description; explicit definitions and an illustration of the coalescing step would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: The abstract states that 'the core design does not depend on' the optional bounded far-history summaries and that variability from differing request lengths, asynchronous EOS events, and fragmented histories 'can be absorbed below a fixed decode interface' primarily via KV-cache movement regularization (block pager, single committed descriptor, merge-staged coalescing). However, the evaluation provides no ablation that disables the summaries while keeping the rest of the KV-RM path fixed. This omission is load-bearing because the reported gains in throughput, tail latency, memory reduction, and spike elimination could still rely on the summaries to bound fragmentation rather than the regularization mechanism alone.

    Authors: We agree that an explicit ablation disabling the optional bounded far-history summaries (while keeping the block pager, single committed descriptor, and merge-staged transport fixed) would strengthen the claim that the gains derive primarily from KV-cache movement regularization. The manuscript already states that the summaries are optional and non-essential, serving only to optionally compress far-history tokens under the same fixed interface; the core mechanisms are designed to absorb length variability, asynchronous EOS, and fragmentation independently. In the revised manuscript we will add this ablation study to quantify the incremental contribution of the summaries versus the regularization path alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical engineering design with runtime measurements

full rationale

The paper describes an engineering runtime design (KV-RM) for regularizing KV-cache movement under a static-graph LLM decoder, evaluated via direct throughput, latency, and memory measurements on a 2-GPU A100 node against a baseline. No derivation chain, equations, first-principles predictions, or fitted parameters exist that could reduce to self-defined inputs. Claims rest on implementation choices (block pager, single descriptor, merge-staged coalescing) and observed outcomes, with optional far-history summaries explicitly stated as non-dependent. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The design rests on standard systems assumptions about memory management and kernel interfaces rather than new mathematical axioms; it introduces software components whose correctness is asserted through empirical results rather than formal proof.

axioms (1)
  • domain assumption Static-graph LLM decoders can maintain a fixed decode interface while absorbing request variability through lower-level KV-cache regularization.
    This is the core premise stated in the abstract that enables the entire approach.
invented entities (2)
  • block pager no independent evidence
    purpose: Tracks active KV state by decoupling logical histories from physical storage.
    New runtime component introduced to manage fragmentation without altering the static graph.
  • merge-staged transport path no independent evidence
    purpose: Coalesces non-contiguous KV mappings into large transfer groups for fixed-shape attention kernels.
    New mechanism to regularize movement and enable single-descriptor materialization per decode step.

pith-pipeline@v0.9.0 · 5572 in / 1473 out tokens · 69674 ms · 2026-05-12T03:50:47.463862+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 10 internal anchors

  1. [1]

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. InProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Associ- ation, Berkeley, CA, USA, 103–12...

  2. [2]

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Min- jia Zhang, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale. InProceedings of the International Conference on High Perfor- mance Computing, Networki...

  3. [3]

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. 2024. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. arXiv preprint arXiv:2406.02069.https://arxiv.org/abs/24 06.02069

  4. [4]

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yu Hu, Luis Ceze, et al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, Berkeley, CA, USA, 578–594.https://www.usenix.org...

  5. [5]

    Yihua Cheng, Kuntai Du, Jiayi Yao, and Junchen Jiang. 2024. Do Large Language Models Need a Content Delivery Network? arXiv preprint arXiv:2409.13761.https://arxiv.org/abs/2409.13761

  6. [6]

    Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang

  7. [7]

    Huang et al

    LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference. arXiv preprint arXiv:2510.09665.https://arxiv.org/ab s/2510.09665

  8. [8]

    Tri Dao. 2024. FlashAttention-2: Faster Attention with Better Paral- lelism and Work Partitioning. International Conference on Learning Representations (ICLR).https://arxiv.org/abs/2307.08691

  9. [9]

    Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

  10. [10]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. Curran Associates, Inc., Red Hook, NY, USA, 16344– 16359.https://arxiv.org/abs/2205.14135

  11. [11]

    Smith, and Matt Gardner

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. 2021. A Dataset of Information-Seeking Ques- tions and Answers Anchored in Research Papers. arXiv preprint arXiv:2105.03011.https://arxiv.org/abs/2105.03011

  12. [12]

    Hao Geng, Phitchaya Mangpo Phothilimthana, et al. 2024. vAttention: Dynamic Memory Management for Serving LLMs without PagedAt- tention. arXiv preprint arXiv:2405.04437.https://arxiv.org/abs/2405.0 4437 12

  13. [13]

    Graphcore. 2023. Poplar SDK Documentation.https://docs.graphcore .ai/

  14. [14]

    Graphcore Research. 2024. SparQ Attention: Speed up LLM inference with top-k Value reads.https://graphcore-research.github.io/posts/s parq/

  15. [15]

    Arpan Gujarati, Reza Karrell, Seth Ganesan, Avinash Vachharajani, H Ramachandran, et al. 2020. Serving DNNs like Clockwork: Perfor- mance Predictability from the Bottom Up. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX As- sociation, Berkeley, CA, USA, 443–462.https://www.usenix.org/con ference/osdi20/presentation/gujarati

  16. [16]

    Huanle He, Lin Jin, Yizhe Cai, Jeeyong Kim, Taejoon Kang, Simin Chen, Ran Tian, Yifeng Zhang, Sizhe Zheng, Yi Yang, et al . 2025. WaferLLM: Large Language Model Inference at Wafer Scale. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). USENIX Association, Berkeley, CA, USA, 41–57.https: //www.usenix.org/conference/osdi25/p...

  17. [17]

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models? arXiv preprint arXiv:2404.06654.https://arxiv.org/abs/2404.06654

  18. [18]

    Ziyu Hu, Zhiqing Zhong, Weijian Zheng, Zhijing Ye, Xuwei Tan, Xueru Zhang, Zheng Xie, Rajkumar Kettimuthu, and Xiaodong Yu. 2025. DABench-LLM: Standardized and In-Depth Benchmarking of Post- Moore Dataflow AI Accelerators for LLMs . In2025 IEEE International Symposium on Workload Characterization (IISWC). IEEE Computer Society, Los Alamitos, CA, USA, 127–...

  19. [19]

    Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele P Scarpazza. 2019. Dissecting the Graphcore IPU architecture via microbenchmarking. arXiv preprint arXiv:1912.03413.https://arxiv.org/abs/1912.03413

  20. [20]

    Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. 2023. TPU v4: An optically reconfigurable supercom- puter for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA). Ass...

  21. [21]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E Gonzalez, Hao Zhang, and Ion Stoica

  22. [22]

    InProceedings of the 29th Symposium on Operating Systems Principles (SOSP)

    Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP). Association for Computing Machinery, New York, NY, USA, 611–626.https://arxiv.org/abs/2309 .06180

  23. [23]

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen

  24. [24]

    SnapKV: LLM Knows What You are Looking for Before Generation

    SnapKV: LLM knows what you are looking for before gener- ation. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada)(NIPS ’24). Curran Associates Inc., Red Hook, NY, USA, Article 722, 24 pages. https://arxiv.org/abs/2404.14469

  25. [25]

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI).https: //www.usenix.org/conference/osdi...

  26. [26]

    Haotian Liu, Hanchen Li, Qing Yang, and Yongqiang Wang. 2023. Scissorhands: Exploiting the Persistence of Importance Hypothe- sis for LLM KV Cache Compression at Test Time. arXiv preprint arXiv:2305.17118.https://arxiv.org/abs/2305.17118

  27. [27]

    Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, et al. 2024. Cachegen: Kv cache compression and stream- ing for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference. Association for Computing Machinery, New York, NY, USA, 38–56. doi:10.1...

  28. [28]

    Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Luis Ceze, Alex Beutel, and Christopher Ré. 2023. Deja Vu: Contextual sparsity for efficient LLMs at inference time. InInternational Conference on Machine Learning (ICML). PMLR, PMLR, Honolulu, HI, USA, 22137–22176.https://arxiv.org/abs/2310.1 7157

  29. [29]

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: a tuning- free asymmetric 2bit quantization for KV cache. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria) (ICML’24). JMLR.org, Cambridge, MA, USA, Article 1311, 13 pages. https://arxiv.org/...

  30. [30]

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher

  31. [31]

    Pointer Sentinel Mixture Models

    Pointer Sentinel Mixture Models. International Conference on Learning Representations (ICLR).https://arxiv.org/abs/1609.07843

  32. [32]

    Microsoft. 2023. AzureLLMInferenceDataset2023.https://github.com /Azure/AzurePublicDataset/blob/master/AzureLLMInferenceDatas et2023.md

  33. [33]

    NVIDIA. 2023. TensorRT-LLM: A High-Performance Library for LLM Inference.https://github.com/NVIDIA/TensorRT-LLM

  34. [34]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aagan Shah, Iñigo Goiri Wisniewski, David Koufaty, et al. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, Los Alamitos, CA, USA, 211–224. doi:10.1109/ISCA59077.2024.00019

  35. [35]

    Firoozshahian, A., Coburn, J., Levenstein, R., Nattoji, R., Kamath, A., Wu, O., Grewal, G., Aepala, H., Jakka, B., Dreyer, B., Hutchin, A., Diril, U., Nair, K., Aredestani, E

    Raghu Prabhakar, Ram Sivaramakrishnan, Darshan Gandhi, Yun Du, Mingran Wang, Xiangyu Song, Kejie Zhang, Tianren Gao, Angela Wang, Xiaoyan Li, Yongning Sheng, Joshua Brot, Denis Sokolov, Apurv Vivek, Calvin Leung, Arjun Sabnis, Jiayu Bai, Tuowen Zhao, Mark Gottscho, David Jackson, Mark Luttrell, Manish K. Shah, Zhengyu Chen, Kaizhao Liang, Swayambhoo Jain,...

  36. [36]

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25). USENIX Association, Santa Clara, CA, 155–170.https://www.us...

  37. [37]

    Amit Sabne. 2020. XLA: Compiling machine learning for peak per- formance. Proceedings of the 3rd Conference on Machine Learning and Systems (MLSys).https://research.google/pubs/xla-compiling- machine-learning-for-peak-performance/

  38. [38]

    Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI).https://www.usenix.org /conference/osdi24/presentation/sun-biaoUSENIX Association

  39. [39]

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. QUEST: query-aware sparsity for efficient long- context LLM inference. InProceedings of the 41st International Con- ference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Cambridge, MA, USA, Article 1955, 11 pages.https://arxiv.org/abs/24 06.10774 13

  40. [40]

    Bingyang Wu, Yinmin Zhong, Zizheng Gupta, Whan Huang, Christo- pher Ré, DA Ce, and Kai Li. 2023. FastServe: Lean-quanta imple- mentation of preemptive scheduling for distributed LLM serving. In Proceedings of the 17th USENIX Symposium on Operating Systems De- sign and Implementation (OSDI). USENIX Association, Berkeley, CA, USA, 1197–1213.https://arxiv.or...

  41. [41]

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks.https://arxiv.org/abs/2309.17453

  42. [42]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP).https://arxiv.org/abs/1809.09600

  43. [43]

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. InProceedings of the Twentieth European Conference on Com- puter Systems. Association for Computing Machinery, New York, NY, USA, 94–109. doi:10.1145/36...

  44. [44]

    Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024. ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition. arXiv:2402.15220 [cs.LG]https://arxiv.org/abs/2402.15220

  45. [45]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for Transformer-based generative models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Associ- ation, Berkeley, CA, USA, 521–538.https://www.usenix.org/confere nce/osdi22/presentation/yu

  46. [46]

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. 2024. H2O: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems (NeurIPS).https://arxiv.org/abs/2306.14048

  47. [47]

    SGLang: Efficient Execution of Structured Language Model Programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: efficient execution of structured language model programs. InPro- ceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC,...

  48. [48]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI).https://www.usenix.org/conference/osdi24 /presentation/zhong-yinminUSENIX Ass...