pith. machine review for the scientific record. sign in

arxiv: 2605.06046 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:02 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM inferencebatch schedulingprefix sharingKV cache localitythroughput optimizationreinforcement learningrequest schedulingmemory-bound workloads
0
0 comments X

The pith

Smaller prefix-homogeneous batches can deliver higher decode throughput than larger heterogeneous batches in LLM inference when requests share prefixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper observes that auto-regressive LLM token generation is memory-bound by KV cache accesses, and that batches in which all requests share a common prefix achieve better spatial and temporal locality than larger mixed batches, leading to faster processing. Existing schedulers maximize prefix reuse mainly to reduce memory footprint and continue forming larger batches even when smaller homogeneous ones would run faster. Feather addresses this by training a reinforcement learning policy to select batches that balance size against prefix homogeneity and by introducing a Chunked Hash Tree for rapid prefix detection that avoids expensive tree traversals. On workloads with prefix sharing this produces 2-10x higher end-to-end throughput while matching the status quo on workloads without enough sharing, with the gains coming from fewer total KV cache accesses.

Core claim

Feather shows that an RL-driven scheduler can learn to form smaller, prefix-homogeneous batches that outperform larger heterogeneous batches on decode throughput. The scheduler is enabled by a Chunked Hash Tree that performs fast prefix detection and request selection without the CPU overhead of radix-tree traversals used in prior systems. When integrated into existing engines the approach reduces overall KV cache accesses and exceeds the gains available from prefix-aware attention kernels alone.

What carries the argument

The reinforcement learning policy that selects batch composition by trading off size against prefix homogeneity, supported by the Chunked Hash Tree for low-overhead prefix detection and request grouping.

If this is right

  • End-to-end throughput rises 2-10x over existing schedulers on prefix-sharing workloads.
  • Performance stays comparable to current schedulers when workloads lack sufficient prefix sharing.
  • Total KV cache accesses drop enough to beat the speedups from prefix-aware attention kernels alone.
  • The same scheduler integrates into vLLM and SGLang without requiring new hardware or model changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same locality principle could guide batch formation in other memory-bound serving systems that handle repeated data structures.
  • Future hardware might add direct support for homogeneous batch execution to amplify the locality gains.
  • Extending the RL state to include runtime metrics such as current memory pressure could further improve the policy without changing its core logic.
  • Efficient prefix detection appears more critical to scheduler performance than the absolute size of the batch.

Load-bearing premise

The reinforcement learning policy learns a tradeoff between batch size and prefix homogeneity that generalizes across workloads, hardware, and models, while the Chunked Hash Tree detects shared prefixes accurately with negligible overhead.

What would settle it

A direct measurement on a held-out workload or hardware platform showing that batches chosen by the RL policy produce lower throughput than standard schedulers or that Chunked Hash Tree detection time exceeds the cost of existing radix-tree methods.

Figures

Figures reproduced from arXiv: 2605.06046 by Mythili Vutukuru, Preeti, Saksham Rathi.

Figure 1
Figure 1. Figure 1: LLM Inference Primitives techniques [32, 37], memory management methods [17, 31, 42], quantization [9, 21], and kernel-level optimizations [27, 41], efficiently serving LLM workloads, especially under real￾istic high-throughput conditions, remains an open problem. An LLM inference request goes through two phases: prefill and decode. The prefill phase processes all prompt tokens in parallel to produce the f… view at source ↗
Figure 2
Figure 2. Figure 2: Batch Size vs. Prefix Homogeneity kernels avoid redundant loading of shared KV blocks for each request separately, thereby reducing memory traffic. Despite these advances, we identify two critical gaps in current scheduling strategies for prefix-shared workloads. First, prior work largely focuses on maximizing batch size during the decode phase. However, we show via controlled experiments that prefix homog… view at source ↗
Figure 3
Figure 3. Figure 3: An Example of a vLLM Merkle Tree hashing scheme ( view at source ↗
Figure 5
Figure 5. Figure 5: Fraction of Prefix Shared view at source ↗
Figure 9
Figure 9. Figure 9: shows this throughput across the different batching strategies. For small decode lengths, throughput is domi￾nated by prefill, and large heterogeneous batches perform the best due to higher tensor core utilization; splitting 1×800 into smaller batches slightly reduces throughput due to lower compute efficiency, as locality benefits are not yet dominant. As decode length increases, performance trends revers… view at source ↗
Figure 10
Figure 10. Figure 10: Feather Pipeline Hash Vectors. Given a request as a token sequence, we partition it into non-overlapping chunks of size 𝐾 and build a sequence of hashes, one for each chunk, where each hash is cumulative over all tokens up to that chunk. For exam￾ple, in view at source ↗
Figure 11
Figure 11. Figure 11: Chunked Hash Tree Operations times within a scheduling round without state changes, so we cache its last result and reuse it until it is invalidated by any mutation (e.g., AddToBatch and Finish). Second, we use lazy heap updates: when a request’s missing count changes, we push a new entry instead of updating it in place and skip stale entries during FindBest. Third, we maintain the shared prefix tip incre… view at source ↗
Figure 12
Figure 12. Figure 12: Poisson Workload (a) Throughput (b) Average Batch Size view at source ↗
Figure 21
Figure 21. Figure 21: CHT vs. Radix Tree view at source ↗
Figure 20
Figure 20. Figure 20: Sensitivity to Chunk Size contrast, Feather actively preserves prefix homogeneity within batches, leading to stable throughput at all levels. Comparison with PAT view at source ↗
Figure 24
Figure 24. Figure 24: shows the visualization of the experiment corre￾sponding to takeaway 2 of §3. We have a total of 100 requests, each with a length of 2K tokens. Each request shares a com￾mon prefix of length 2𝐾 × 𝑓 tokens, where 𝑓 ∈ [0, 1], with the remaining 2𝐾 × (1 − 𝑓 ) unique. A.3 Radix Tree Sharing Patterns view at source ↗
Figure 25
Figure 25. Figure 25: shows the visualization corresponding to takeaway 4 in §3. We consider four different radix tree sharing pat￾terns. The sequence length in each case is fixed at 4𝐿. (i) In view at source ↗
Figure 26
Figure 26. Figure 26: illustrates vLLM’s FCFS throughput alongside the total KV cache memory footprint for prefix tokens in the Qwen 0.5B model, evaluated across varying prefix lengths and numbers of prefix groups. As expected, throughput de￾creases as the prefix sequence length increases. A more pro￾nounced drop is observed when the number of prefix groups increases from 1 to 2, which can be attributed to a loss of locality. … view at source ↗
Figure 27
Figure 27. Figure 27: Key Observations on Prefix Homogeneity threshold is exceeded. This indicates that effective spatial and temporal locality depend more on whether the subset of KV blocks actively accessed by batched requests at a given time (i.e. the dynamic working set) fits in the L2 cache. It is intuitive that this dynamic working set is smaller in size for prefix-homogeneous batches which explains the higher throughput… view at source ↗
Figure 23
Figure 23. Figure 23: Two Large Prefixes view at source ↗
Figure 28
Figure 28. Figure 28: Hash Computation in Chunked Hash Tree in parallel across chunks. The key property we want is that all requests sharing the first 𝑐 · 𝐾 tokens produce an identi￾cal ℎ𝑐 . This is achieved by partitioning the token sequence into chunks 𝑆𝑐 = (𝑡(𝑐−1)𝐾+1, . . . , 𝑡𝑐𝐾), hashing each chunk independently in parallel, and then chaining the results via ℎ𝑐 = Hash(ℎ𝑐−1 ∥ Hash(𝑆𝑐 )), with ℎ0 set to a fixed initial￾izat… view at source ↗
Figure 29
Figure 29. Figure 29: FindBest - Alternative Heuristic Example Using our contrary assumption 𝑚𝑆 (𝑟𝑆 ) < 𝑚𝑆 (𝑟𝑊 ), we eval￾uate the cost of both requests under the working set metric, 𝑚𝑊 (𝑥) = 𝐶 − |𝐻 𝑥 ∩𝑊 |. We consider two cases for 𝑟𝑆 : Case 1: 𝑟𝑆 fully matches the shared prefix tip (𝑚𝑆 (𝑟𝑆 ) = 0). Since 𝑟𝑆 matches 𝑆 entirely, |𝐻 𝑟𝑆 ∩ 𝑆 | = ℓ ∗ . It may also overlap with deeper branches in𝑊 , meaning |𝐻 𝑟𝑆 ∩𝑊 | ≥ ℓ ∗ . Theref… view at source ↗
Figure 30
Figure 30. Figure 30: FindBest cases: 𝑟1 and 𝑟2 are both present in the active batch, and hence the working set is the union of all of their chunks. The shared prefix tip is at the third level, and the set 𝑆 of all shared prefix chunks are marked in red. In the first case, 𝑟𝑆 fully matches the shared prefix tip, whereas𝑟𝑊 hangs from the middle. For this case, 𝑚𝑊 (𝑟𝑊 ) = 2,𝑚𝑊 (𝑟𝑆 ) = 1,𝑚𝑆 (𝑟𝑊 ) = 1,𝑚𝑆 (𝑟𝑆 ) = 0. In the second c… view at source ↗
Figure 31
Figure 31. Figure 31: Different Models (a) Throughput (b) Average Batch Size (c) Time between tokens (TBT) view at source ↗
Figure 32
Figure 32. Figure 32: Varying number of prefix groups for LongChat 13B and the next-best baseline widens from Qwen-0.5B to the 8B-scale models. The performance of SGLang’s LPM and DFS-W schedulers degrades noticeably for smaller models. Both policies rely on CPU-side tree traversal, whose over￾head scales with sequence length rather than model size; as the model shrinks, GPU execution time decreases while scheduler overhead re… view at source ↗
Figure 33
Figure 33. Figure 33: Workload with No Prefix Sharing view at source ↗
Figure 35
Figure 35. Figure 35: Performance of Bandit across Varying Workload (a) Tensor Core Utilization (b) Average Batch Size view at source ↗
read the original abstract

Auto-regressive token generation in large language models is memory-bound because it requires "attending to" key and value tensors (KV cache) of all previous tokens. Prior work aims to improve the efficiency of this decode process by batching multiple requests together, and maximizing batch size subject to GPU memory constraints. The key observation of our work is that with prefix-sharing workloads, smaller, prefix-homogeneous batches -- where all requests share a common prefix -- can achieve higher decode throughput than larger, heterogeneous batches, due to better spatial and temporal locality during KV cache accesses. However, prefix-aware schedulers in state-of-the-art inference engines maximize prefix reuse within a batch only to reduce KV cache memory footprint, but do not stop batch formation at smaller homogeneous batches that could have performed better. Further, we show that shared prefix detection in existing schedulers relies on radix-tree traversals, incurring substantial CPU overhead that is often comparable to GPU execution time. This paper presents Feather, a prefix-aware scheduler that uses reinforcement learning (RL) to learn the optimal tradeoff between batch size and prefix homogeneity. We also introduce Chunked Hash Tree (CHT), a lightweight data structure that enables fast prefix detection and efficient request selection for the RL scheduler, avoiding expensive tree traversals. We integrate Feather into vLLM and SGLang, and our evaluation shows that Feather achieves 2--10$\times$ higher end-to-end throughput as compared to existing schedulers, while doing no worse than the status quo when the workload does not have enough prefix sharing. Feather achieves these gains by reducing the total number of KV cache accesses, surpassing the performance of prefix-aware attention kernels that have the same goal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that current LLM inference schedulers in systems like vLLM and SGLang fail to exploit the observation that smaller prefix-homogeneous batches can outperform larger heterogeneous ones due to better KV cache locality. It introduces Feather, which uses reinforcement learning to learn the optimal batch-size vs. prefix-homogeneity tradeoff, paired with a new Chunked Hash Tree (CHT) data structure for low-overhead prefix detection that avoids expensive radix-tree traversals. When integrated into vLLM and SGLang, Feather is reported to deliver 2–10× higher end-to-end throughput on prefix-sharing workloads while matching baseline performance when sharing is low, with the gains attributed to fewer KV cache accesses.

Significance. If the throughput gains and 'no worse than baseline' guarantee hold under broader conditions, the work would meaningfully advance practical LLM serving efficiency for common prefix-sharing scenarios such as multi-turn chat or RAG. The RL-driven scheduler and CHT represent concrete engineering contributions that directly target an overlooked locality tradeoff; the open integration into two production engines is a positive for reproducibility.

major comments (2)
  1. [Evaluation (implied by abstract claims and § on RL scheduler)] The headline 2–10× throughput claim and the 'no worse than status quo' guarantee both depend on the RL policy learning a transferable mapping between batch size and prefix homogeneity. The evaluation provides no held-out workload testing, cross-model, or cross-hardware transfer experiments, leaving open the possibility that the learned policy overfits the training traces and reverts to baseline behavior on new prefix-sharing statistics.
  2. [Abstract and Evaluation section] The abstract states that Feather surpasses prefix-aware attention kernels by reducing total KV cache accesses, yet supplies no workload details, baseline scheduler configurations, statistical tests, or ablation results isolating the contribution of the RL policy versus CHT. Without these, the central performance numbers cannot be independently verified.
minor comments (2)
  1. [CHT design subsection] The description of CHT overhead as 'negligible' relative to radix trees would benefit from a direct CPU-cycle or latency comparison table against the radix-tree baseline used in vLLM/SGLang.
  2. [RL scheduler description] Notation for the RL state (batch size, prefix homogeneity metrics) and reward function is introduced without an explicit equation or pseudocode listing, making the policy learning process harder to reproduce.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical contributions of Feather. We address the major comments point by point below. Where the evaluation can be strengthened, we will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Evaluation (implied by abstract claims and § on RL scheduler)] The headline 2–10× throughput claim and the 'no worse than status quo' guarantee both depend on the RL policy learning a transferable mapping between batch size and prefix homogeneity. The evaluation provides no held-out workload testing, cross-model, or cross-hardware transfer experiments, leaving open the possibility that the learned policy overfits the training traces and reverts to baseline behavior on new prefix-sharing statistics.

    Authors: We agree that explicit transfer experiments would strengthen the claims. The current evaluation already covers workloads with varying prefix-sharing statistics (including low-sharing cases where Feather matches baseline), and the RL state/reward formulation uses only observable, hardware-agnostic metrics (batch size and prefix homogeneity). Nevertheless, to directly address the concern, we will add held-out workload tests and limited cross-model results in the revised manuscript, along with a discussion of why the learned policy is expected to generalize. revision: yes

  2. Referee: [Abstract and Evaluation section] The abstract states that Feather surpasses prefix-aware attention kernels by reducing total KV cache accesses, yet supplies no workload details, baseline scheduler configurations, statistical tests, or ablation results isolating the contribution of the RL policy versus CHT. Without these, the central performance numbers cannot be independently verified.

    Authors: We acknowledge that additional detail is needed for independent verification. In the revised manuscript we will expand the evaluation section with: (1) precise workload descriptions including prefix-sharing ratios and request arrival patterns, (2) exact vLLM/SGLang baseline configurations, (3) results with statistical tests and error bars from repeated runs, and (4) ablations that isolate the RL scheduler contribution from the CHT data structure. These additions will clarify how the reported 2–10× gains and KV-cache-access reductions are obtained. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical throughput claims rest on direct measurements

full rationale

The paper introduces Feather, an RL-based prefix-aware scheduler, and the Chunked Hash Tree data structure. All central claims (2-10x end-to-end throughput gains, reduced KV cache accesses, and 'no worse than status quo' behavior) are supported solely by experimental integration into vLLM and SGLang plus benchmarking on request traces. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the RL policy is trained on finite traces and its performance is reported from held-out runs rather than forced by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the unstated premise that prefix homogeneity produces sufficient locality gains to outweigh larger batch sizes, and that RL can discover this tradeoff reliably. No explicit free parameters or axioms are listed in the abstract.

invented entities (1)
  • Chunked Hash Tree (CHT) no independent evidence
    purpose: Lightweight structure for fast prefix detection and request selection avoiding radix-tree traversals
    New data structure introduced to reduce CPU overhead during scheduling.

pith-pipeline@v0.9.0 · 5616 in / 1264 out tokens · 54881 ms · 2026-05-08T14:02:31.313124+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 26 canonical work pages · 5 internal anchors

  1. [1]

    Gulavani, Alexey Tumanov, and Ramachandran Ramjee

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming throughput-latency tradeoff in LLM inference with sarathi-serve. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation. USENIX Associa- tion

  2. [2]

    AI@Meta. 2024. Llama 3 Model Card. (2024).https://github.com/meta- llama/llama3/blob/main/MODEL_CARD.md

  3. [3]

    Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. 2024. L-Eval: Insti- tuting Standardized Evaluation for Long Context Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. doi:10....

  4. [4]

    Anthropic. 2024. Claude.https://claude.ai

  5. [5]

    Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning (2002). doi:10.1023/A:1013689704352

  6. [6]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

  7. [7]

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2025. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers...

  8. [8]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  9. [9]

    Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa

  10. [10]

    In Proceedings of the 37th International Conference on Neural Information Processing Systems

    QuIP: 2-bit quantization of large language models with guaran- tees. In Proceedings of the 37th International Conference on Neural Information Processing Systems. Curran Associates Inc

  11. [11]

    Sumit Kumar Dam, Choong Seon Hong, Yu Qiao, and Chaoning Zhang. 2024. A Complete Survey on LLM-based AI Chatbots. arXiv:2406.16937 [cs.CL]https://arxiv.org/abs/2406.16937

  12. [12]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FLASHATTENTION: fast and memory-efficient exact at- tention with IO-awareness. In Proceedings of the 36th International Conference on Neural Information Processing Systems. Curran Asso- ciates Inc

  13. [13]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  14. [14]

    Xin Luna Dong, Seungwhan Moon, Yifan Ethan Xu, Kshitiz Malik, and Zhou Yu. 2023. Towards Next-Generation Intelligent Assistants Lever- aging LLM Techniques. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery. doi:10.1145/3580305.3599572

  15. [15]

    Google. 2024. Gemini.https://gemini.google.com

  16. [16]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report. arXiv:2409.12186 [cs.CL...

  17. [17]

    Kamath, Ramya Prabhu, Jayashree Mohan, Simon Pe- ter, Ramachandran Ramjee, and Ashish Panwar

    Aditya K. Kamath, Ramya Prabhu, Jayashree Mohan, Simon Pe- ter, Ramachandran Ramjee, and Ashish Panwar. 2025. POD- Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM In- ference. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. Association for Computing...

  18. [18]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

  19. [19]

    and Zhang, Hao and Stoica, Ion , booktitle =

    Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles. Association for Computing Machinery. doi:10.1145/3600006.3613165

  20. [20]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems. Curran As- sociates, Inc.https://proceedi...

  21. [21]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems. C...

  22. [22]

    Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang

    Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. 2023. How Long Can Open-Source LLMs Truly Promise on Context Length? https://lmsys.org/blog/2023-06-29-longchat

  23. [23]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Guangxuan Xiao, and Song Han. 2025. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. (2025). doi:10. 1145/3714983.3714987

  24. [24]

    Zejia Lin, Hongxin Xu, Guanyi Chen, Zhiguang Chen, Yutong Lu, and Xianwei Zhang. 2026. Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration. In Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. As- sociation for Computing Machinery...

  25. [25]

    2024.NVIDIA RTX 6000 Ada Generation Datasheet

    NVIDIA Corporation. 2024.NVIDIA RTX 6000 Ada Generation Datasheet. https://resources.nvidia.com/en-us-briefcase-for-datasheets/proviz- print-rtx6000-1?ncid=no-ncidAccessed: January 25, 2026

  26. [26]

    NVIDIA Corporation. 2026. NVIDIA Data Center GPU Manager (DCGM) Documentation.https://docs.nvidia.com/datacenter/dcgm/ latest/user-guide/

  27. [27]

    OpenAI. 2024. ChatGPT.https://chatgpt.com

  28. [28]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...

  29. [29]

    Zaifeng Pan, Yitong Ding, Yue Guan, Zheng Wang, Zhongkai Yu, Xulong Tang, Yida Wang, and Yufei Ding. 2025. FastTree: Op- timizing Attention Kernel and Runtime for Tree-Structured LLM Inference. In Proceedings of Machine Learning and Systems. ML- Sys.https://proceedings.mlsys.org/paper_files/paper/2025/file/ 96894468eb44631a32d7ebd56f9892c7-Paper-Conference.pdf

  30. [30]

    Bowen Pang, Kai Li, and Feifan Wang. 2025. Optimizing LLM Infer- ence Throughput via Memory-aware and SLA-constrained Dynamic Batching. arXiv preprint arXiv:2503.05248 (2025)

  31. [31]

    Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Er- win Laure, and Stefano Markidis. 2017. Exploring the Performance Benefit of Hybrid Memory System on HPC Environments. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE. doi:10.1109/ipdpsw.2017.115

  32. [32]

    Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True Few-Shot Learning with Language Models. In Advances in Neural Information Processing Systems. Curran Associates, Inc.https://proceedings.neurips.cc/paper_files/paper/2021/file/ 5c04925674920eb58467fb52ce4ef728-Paper.pdf

  33. [33]

    Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, and Ashish Panwar. 2025. vAttention: Dynamic Memory Manage- ment for Serving LLMs without PagedAttention. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. Associa- tion for Computing Machinery. doi:10.1...

  34. [34]

    Qidong Su, Wei Zhao, Xin Li, Muralidhar Andoorveedu, Chenhao Jiang, Zhanda Zhu, Kevin Song, Christina Giannoula, and Gennady Pekhimenko. 2025. Seesaw: High-throughput LLM Inference via Model Re-sharding. arXiv:2503.06433 [cs.DC]https://arxiv.org/abs/2503. 06433

  35. [35]

    The Big Prompt Library Contributors. 2024. The Big Prompt Library: A Collection of Prompts, System Prompts and LLM Instructions.https: //github.com/0xeb/TheBigPromptLibrary. GitHub repository

  36. [36]

    vLLM Project Contributors. 2024. Automatic Prefix Caching.https:// docs.vllm.ai/en/stable/design/prefix_caching/. vLLM Documentation. Accessed: 2026-03-10

  37. [37]

    Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-Learning. Machine Learning (1992). doi:10.1007/BF00992698

  38. [38]

    Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebas- tian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emergent Abilities of Large Lan- guage Models. Transactions on Machine Learning Research (2022). https://openreview.ne...

  39. [39]

    Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. Association for Computing Machinery. doi:10.1145/3694715.3695948

  40. [40]

    Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, and Tao Lin. 2024. DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference. In International Conference on Learning Representations.https: //api.semanticscholar.org/CorpusID:268819748

  41. [41]

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowl- edge Fusion. In Proceedings of the Twentieth European Conference on Computer Systems. Association for Computing Machinery. doi:10. 1145/3689031.3696098

  42. [42]

    Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024. ChunkAttention: Effi- cient Self-Attention with Prefix-Aware KV Cache and Two-Phase Par- tition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. doi:10.18653/v1/2024.acl-long.623

  43. [43]

    Jinjun Yi, Zhixin Zhao, Yitao Hu, Ke Yan, Weiwei Sun, Hao Wang, Laiping Zhao, Yuhao Zhang, Wenxin Li, and Keqiu Li. 2026. PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Re- source Efficient Multi-Tile Kernel. In Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vo...

  44. [44]

    Chen Zhang, Kuntai Du, Shu Liu, Woosuk Kwon, Xiangxi Mo, Yufeng Wang, Xiaoxuan Liu, Kaichao You, Zhuohan Li, Mingsheng Long, Jidong Zhai, Joseph Gonzalez, and Ion Stoica. 2025. Jenga: Effec- tive Memory Management for Serving LLM with Heterogeneity. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. Association for Computing ...

  45. [45]

    Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yifan Qiao, Yang Zhou, Jiarong Xing, and Ion Stoica. 2026. Blend- Serve: Optimizing Offline Inference with Resource-Aware Batch- ing. In Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. Association for Comp...

  46. [46]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Sto- ica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2024. SGLang: efficient execution of structured language model programs. In Proceedings of the 38th International Conference on Neural Information Processing Systems. Curran Assoc...

  47. [47]

    Wanyi Zheng, Minxian Xu, Shengye Song, and Kejiang Ye. 2025. BucketServe: Bucket-Based Dynamic Batching for Smart and Ef- ficient LLM Inference Serving. arXiv:2507.17120 [cs.DC]https: //arxiv.org/abs/2507.17120

  48. [48]

    Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, and Gang Peng. 2025. BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching. arXiv:2412.03594 [cs.CL]https://arxiv.org/abs/2412.03594

  49. [49]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: disaggregat- ing prefill and decoding for goodput-optimized large language model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation. USENIX Association. A Experimental Setup of §3.1 A.1 Two L...

  50. [50]

    Throughput is significantly lower at smaller values of 𝑓1, primarily because limited GPU memory leads to frequent cache evictions

    This occurs because at 𝑓2 = 1, the entire batch becomes homogeneous, resulting in fully aligned KV cache traversal. Throughput is significantly lower at smaller values of 𝑓1, primarily because limited GPU memory leads to frequent cache evictions. This is further supported by the increase in throughput with rising 𝑓2 at low 𝑓1, as more requests share the s...