pith. sign in

arxiv: 2605.23389 · v1 · pith:VECEKTK3new · submitted 2026-05-22 · 💻 cs.DC

AlignedServe: Orchestrating Prefix-aware Batching to Build a High-throughput and Computing-efficient LLM Serving System

Pith reviewed 2026-05-25 03:09 UTC · model grok-4.3

classification 💻 cs.DC
keywords LLM servinginference batchingKV cachethroughput optimizationdecode iterationprefix awareness
0
0 comments X

The pith

AlignedServe groups LLM requests by similar KV-cache lengths to cut iteration bubbles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes AlignedServe to address bubbles that occur within each decode iteration in LLM serving. Tokens in the same batch can have different KV-cache lengths, so longer ones slow the entire iteration. The system groups requests with similar cache lengths into batches to align their costs. It keeps a large pool of requests in CPU memory to enable good batch formation and uses a GPU-to-GPU prefetch design to hide transfer costs. Experiments report up to 1.98 times higher decoding throughput and 7.4 times lower latency compared with prior systems.

Core claim

By grouping requests with similar KV-cache lengths into the same batch, AlignedServe reduces iteration-level bubbles caused by varying per-token costs; it supports this with a large CPU-resident request pool, batch-level scheduling, and a GPU-Prefetch-For-GPU architecture that moves KV caches between GPUs.

What carries the argument

Prefix-aware batching that groups requests by KV-cache length similarity to align computation times within each decode iteration.

Load-bearing premise

Grouping requests by similar KV-cache lengths will produce large reductions in iteration-level bubbles without being offset by the overhead of maintaining a large CPU-resident request pool or by changes in cache hit rates.

What would settle it

Run the same workloads on a system where all requests already have nearly identical KV-cache lengths and measure whether throughput gains disappear.

Figures

Figures reproduced from arXiv: 2605.23389 by Fengyao Bai, Hongbin Zhang, Jiangsu Du, Yutong Lu, Zhiguang Chen, Zhitao Chen.

Figure 1
Figure 1. Figure 1: The negative impact introduced by the tokens with long prefix in each iteration. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CDF of the lengths of prefix. tokens. For the GPT4 and summarization, the ratio of prefix longer than 4000 tokens can be as much as 40%. However, the results presented in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of batching policies with and without considering the lengths of prefix. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The overall architecture. KVCache between CPU and GPU via PCIe directly. However, as NVLink has been widely adopted by high-end GPUs, and our work mostly focuses on the high-performance computing clusters, the novel GPU-Prefetch-For-GPU architecture works well in this kind of commonplace systems. The three components described above are orchestrated by the batching and scheduling policies, as the data and … view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of Density First Search, the two numbers associated with each internal node are the [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of (a) Scheduling from Candidate Requests Buffer, and (b) Scheduling from Candidate [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Decoding throughput (tokens/s) on synthetic workloads. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Decoding Throughput on application workloads. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: P99 TPOT on synthetic workloads. LongBench ShareGPT AzurePublicDataset 0 200 400 600 800 AlignedServe FastGen vLLM DistServe (a) OPT-6.7B. LongBench ShareGPT AzurePublicDataset 0 200 400 600 (b) OPT-13B [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: P99 TPOT on application workloads. 4.4 Ablation Study Subsection 4.3 demonstrates that our framework achieves much lower latency compared with the others. In this subsection, we further give an ablation study to the latency of each iteration. Generally, an iteration in the decoding can be divided into two stages, i.e., iteration preparation and forward computing. In the period of iteration preparation, th… view at source ↗
Figure 11
Figure 11. Figure 11: The overhead of iteration scheduling. 2000 4000 6000 8000 10000 40 60 80 100 Length of Long Requests (a) OPT-2.7B. 2000 4000 6000 8000 10000 40 60 80 Length of Long Requests (b) OPT-6.7B. 2000 4000 6000 8000 10000 40 60 80 Length of Long Requests (c) OPT-13B. 2000 4000 6000 8000 10000 40 60 80 Length of Long Requests (d) OPT-30B [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison about the latency of forward computing in each iteration on synthetic workloads. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison between our prefix-aware batching policy and FCFS in terms of latency involved in [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Ablation of prefetching and prefix-aware batching. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: CDF of TTFT. many iterations that the batch switch is occurring under the OPT-6.7B model. Experimental results demonstrate that the fraction of iterations that contain requests from different batches is no more than 8.61% and 12.37% on ShareGPT and LongBench, respectively. Overhead of KV pool. AlignedServe offloads large volume of KVCache into CPU memory. In our experiments, the KV pool is set to be 800GB… view at source ↗
read the original abstract

High-throughput inference serving is essential for applications built on large language models (LLMs). Existing serving frameworks reduce request-level and batch-level bubbles through batching and scheduling, but often overlook bubbles within each decode iteration. Tokens generated in the same iteration may incur different costs because they depend on KV caches of different lengths; tokens with long KV caches can become bottlenecks and delay the next iteration. We propose AlignedServe, an LLM serving framework built around prefix-aware batching. It groups requests with similar KV-cache lengths into the same batch to reduce iteration-level bubbles. To support this policy efficiently, AlignedServe uses large CPU memory to maintain sufficient in-flight requests for batching and applies a batch-level scheduling policy to reduce batch-level bubbles. It also introduces a GPU-Prefetch-For-GPU architecture, where one GPU prefetches KV cache for another to reduce CPU-to-GPU transfer latency. Experiments on synthetic and application workloads show that AlignedServe improves decoding throughput by up to 1.98 times and reduces latency by up to 7.4 times over state-of-the-art systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents AlignedServe, an LLM serving framework that introduces prefix-aware batching to group requests with similar KV-cache lengths into the same batch, thereby reducing iteration-level bubbles caused by varying per-token costs. It supports this via a large CPU-resident pool of in-flight requests, batch-level scheduling to reduce batch bubbles, and a GPU-Prefetch-For-GPU architecture to hide CPU-to-GPU KV-cache transfer latency. Experiments on synthetic and application workloads are reported to yield up to 1.98× higher decoding throughput and 7.4× lower latency versus state-of-the-art systems.

Significance. If the experimental claims hold after detailed validation, the work would represent a practical advance in LLM inference serving by targeting an intra-iteration source of inefficiency that prior batching and scheduling techniques have largely ignored. The combination of CPU-side request pooling with cross-GPU prefetching offers a concrete engineering path to higher utilization; explicit credit is due for the focus on measurable iteration bubbles, though no machine-checked proofs or fully reproducible artifacts are referenced.

major comments (2)
  1. [Abstract / Evaluation] Abstract and Evaluation section: The headline claims (1.98× throughput, 7.4× latency) are stated without any description of workloads, baseline configurations, hardware, measurement methodology, or error bars. This absence is load-bearing because the central contribution is an empirical performance improvement whose magnitude cannot be assessed or reproduced from the given information.
  2. [System Design] System overview / request-pool description: The design relies on maintaining a large CPU-resident request pool to enable prefix-aware grouping, yet no quantitative breakdown is supplied of CPU-side scheduling overhead, memory pressure, or possible degradation in cache hit rates versus the claimed reduction in per-iteration bubbles. Without this accounting, it is impossible to confirm that the GPU-side savings are not offset by the mechanisms introduced to support the policy.
minor comments (2)
  1. [Introduction] The term 'iteration-level bubbles' is used repeatedly but never given a concise operational definition (e.g., variance in per-token decode time within one forward pass); adding one sentence in the introduction would improve accessibility.
  2. Figure captions and axis labels should explicitly state whether throughput numbers are normalized to a particular baseline or reported in absolute tokens/s.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to provide the requested details and analysis.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: The headline claims (1.98× throughput, 7.4× latency) are stated without any description of workloads, baseline configurations, hardware, measurement methodology, or error bars. This absence is load-bearing because the central contribution is an empirical performance improvement whose magnitude cannot be assessed or reproduced from the given information.

    Authors: We agree that the abstract omits these specifics due to length constraints. The Evaluation section describes synthetic and application workloads but does not explicitly enumerate all configurations, hardware, methodology, or error bars. We will revise the Evaluation section to add a dedicated experimental setup subsection listing workloads, baselines and configurations, hardware, measurement methodology, and error bars from repeated runs. A brief reference to key setup elements will also be added to the abstract if space permits. revision: yes

  2. Referee: [System Design] System overview / request-pool description: The design relies on maintaining a large CPU-resident request pool to enable prefix-aware grouping, yet no quantitative breakdown is supplied of CPU-side scheduling overhead, memory pressure, or possible degradation in cache hit rates versus the claimed reduction in per-iteration bubbles. Without this accounting, it is impossible to confirm that the GPU-side savings are not offset by the mechanisms introduced to support the policy.

    Authors: We acknowledge that the current manuscript does not provide quantitative measurements of CPU-side overheads. We will add an analysis section in the revision that reports CPU scheduling overhead, memory usage of the request pool, and any impact on cache hit rates, then compare these costs against the measured reduction in iteration bubbles to confirm net gains. revision: yes

Circularity Check

0 steps flagged

No circularity: experimental validation of system design

full rationale

The paper describes an engineering system (prefix-aware batching, CPU-resident request pool, GPU-Prefetch-For-GPU) and reports measured throughput/latency gains on synthetic and application workloads. No equations, fitted parameters, or derivation chain appear in the provided text; the central claims are presented as direct experimental outcomes rather than predictions derived from a model. No self-citation load-bearing steps, self-definitional constructs, or fitted-input-as-prediction patterns are present. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, parameters, or modeling assumptions; therefore no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5745 in / 1122 out tokens · 36656 ms · 2026-05-25T03:09:53.952716+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 40 canonical work pages · 5 internal anchors

  1. [1]

    Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, and Shiv Saini. 2025. Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation.Proc. ACM Manag. Data3, 3, Article 136 (June 2025), 28 pages. doi:10.1145/3725273

  2. [2]

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 117–134. https://www.usenix....

  3. [3]

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee

  4. [4]

    Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills.arXiv preprint arXiv:2308.16369 (2023)

  5. [5]

    Brown, Benjamin Mann, Nick Ryder, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, et al. 2020. Language models are few-shot learners. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 33. 1877–1901

  6. [6]

    Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. 2025. IMPRESS: An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language Model Inference. In23rd USENIX Conference on File and Storage Technologies (FAST 25). USENIX Association, Santa Clara, CA, 187–201. https://www.use...

  7. [7]

    Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian

  8. [8]

    InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

    A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 615–621. doi:10.18653/v1/N18-2097

  9. [9]

    Yangyang Feng, Minhui Xie, Zijie Tian, Shuo Wang, Youyou Lu, and Jiwu Shu. 2023. Mobius: Fine tuning large-scale models on commodity gpu servers. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 489–501. Proc. ACM Manag. Data, Vol. 4, No. 3 (SIGMOD), Article 132. Pub...

  10. [10]

    Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost-efficient large language model serving for multi-turn conversations with CachedAttention. In Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference(Santa Clara, CA, USA)(USENIX ATC’24). USENIX Associatio...

  11. [11]

    Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving.Proc. ACM Manag. Data3, 3, Article 130 (June 2025), 28 pages. doi:10.1145/3725394

  12. [12]

    GitHub. 2021. GitHub Copilot. https://github.com/features/copilot

  13. [13]

    Google. 2024. Our next-generation model: Gemini 1.5. https://blog.google/technology/ai/google-gemini-next- generationmodel-february-2024/

  14. [14]

    Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yaz- dani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, et al. 2024. Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference.arXiv preprint arXiv:2401.08671(2024)

  15. [15]

    Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, et al. 2024. Memserve: Context caching for disaggregated llm serving with elastic memory pool.arXiv preprint arXiv:2406.17565(2024)

  16. [16]

    Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Jiang Xu, Shuang Chen, Hao Feng, Chenxi Wang, Sa Wang, Yungang Bao, et al. 2024. Inference without interference: Disaggregate llm inference for mixed downstream workloads.arXiv preprint arXiv:2401.11181(2024)

  17. [17]

    Yitao Hu, Xiulong Liu, Guotao Yang, Linxuan Li, Kai Zeng, Zhixin Zhao, Sheng Chen, Laiping Zhao, Wenxin Li, and Keqiu Li. 2025. TightLLM: Maximizing Throughput for LLM Inference via Adaptive Offloading Policy.IEEE Trans. Comput.74, 7 (2025), 2195–2209. doi:10.1109/TC.2025.3558009

  18. [18]

    Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. 2020. Efficient Attention: A Fast and Memory-Efficient Method for Transformers. InAdvances in Neural Information Processing Systems, Vol. 33. 17902–17914

  19. [19]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626

  20. [20]

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 155–172. https://www.usenix.org/conference/ osdi24/presentation/lee

  21. [21]

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, ...

  22. [22]

    Yuhang Li, Rong Gu, Chengying Huan, Zhibin Wang, Renjie Yao, Chen Tian, and Guihai Chen. 2025. HotPrefix: Hotness-Aware KV Cache Scheduling for Efficient Prefix Sharing in LLM Inference Systems.Proc. ACM Manag. Data3, 4, Article 250 (Sept. 2025), 27 pages. doi:10.1145/3749168

  23. [23]

    Meta AI. 2023. Code Llama: An Open Foundation Model for Code. https://ai.meta.com/research/code-llama/

  24. [24]

    Moonshot AI. 2024. Kimi: Your AI Assistant. https://kimi.moonshot.cn/

  25. [25]

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. InProceedings of the 27th ACM symposium on operating systems principles. 1–15

  26. [26]

    OpenAI. 2022. ChatGPT. https://openai.com/blog/chatgpt/

  27. [27]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini

  28. [28]

    In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)

    Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132

  29. [29]

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25). USENIX Association, Santa Clara, CA, 155–170. https://www.u...

  30. [30]

    Schmid, O

    P. Schmid, O. Sanseviero, P. Cuenca, and L. Tunstall. 2023. Llama 2 is here - Get it on Hugging Face. https://huggingface. co/blog/llama2. [Online; accessed May 25, 2026]

  31. [31]

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: high-throughput generative inference of large language models with a single GPU. InProceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA)(ICML’23). JMLR.org, Article 1288,...

  32. [32]

    Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, and Yiying Zhang. 2025. Preble: Efficient Distributed Prompt Scheduling for LLM Serving. InThe Thirteenth International Conference on Learning Representations. https: //openreview.net/forum?id=meKEKDhdnx

  33. [33]

    Foteini Strati, Sara McAllister, Amar Phanishayee, Jakub Tarnawski, and Ana Klimovic. [n. d.]. DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving. InForty-first International Conference on Machine Learning

  34. [34]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models.CoRRabs/2302.13971 (2023). arXiv:2302.13971 doi:10.48550/ARXIV.2302.13971

  35. [35]

    A Vaswani. 2017. Attention is all you need.Advances in Neural Information Processing Systems(2017)

  36. [36]

    vllm-project. 2024. vllm: Easy, fast, and cheap LLM serving for everyone. https://github.com/vllm-project/vllm

  37. [37]

    Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. 2023. OpenChat: Advancing Open- Source Language Models with Mixed-Quality Data

  38. [38]

    Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism.arXiv preprint arXiv:2404.09526(2024)

  39. [39]

    Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Xuanzhe Liu, and Xin Jin. 2023. Fast distributed inference serving for large language models.arXiv preprint arXiv:2305.05920(2023)

  40. [40]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 521–538. https://www.usenix.org/conference/ osdi22/presentation/yu

  41. [41]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. OPT: Open Pre-trained Transformer Language Models. CoRRabs/2205.01068 (2...

  42. [42]

    Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, and Gang Peng. 2025. BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching. arXiv:2412.03594 [cs.CL] https://arxiv.org/abs/2412.03594

  43. [43]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. arXiv:2401.09670 [cs.DC] Received October 2025; revised January 2026; accepted February 2026 Proc. ACM Manag. Data, Vol. 4, No. 3 (SIGMOD), Article 132. Pu...