RTP-LLM: High-Performance Alibaba LLM Inference Engine

Bo Cai; Boyu Tan; Chi Zhang; Guiyang Huang; Guoding Li; Hanbo Sun; Jianning Zhang; Jiarui Guo; Juncheng Yin; Kan Liu

arxiv: 2605.29639 · v1 · pith:HQ5TUCISnew · submitted 2026-05-28 · 💻 cs.OS

RTP-LLM: High-Performance Alibaba LLM Inference Engine

Boyu Tan , Jiarui Guo , Zongwei Lv , Hanbo Sun , Tong Yang , Kan Liu , Xinfei Shi , Zetao Hu

show 21 more authors

Yaxin Yu Chi Zhang Jianning Zhang Xi Yang Wei Zhang Bo Cai Silu Zhou Xiyu Wang Na He Yinghao Yu Wending Bao Guiyang Huang Yuxing Yuan Juncheng Yin Nan Wang Lin Yang Zechao Zhang Lu Chen Guoding Li Tao Lan Lin Qu

This is my paper

Pith reviewed 2026-06-28 23:50 UTC · model grok-4.3

classification 💻 cs.OS

keywords LLM inference engineprefill decode disaggregationKV cache managementspeculative decodingmodel servingperformance optimizationindustrial deploymentmultimodal inference

0 comments

The pith

RTP-LLM uses prefill-decode disaggregation and hierarchical KV cache management to deliver 4.7x-6.3x faster model loading and 35-37% lower TTFT latency than vLLM and SGLang.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RTP-LLM as a production inference engine that integrates file-order I/O optimization, parallel overlapping, prefill-decode disaggregation, and multi-tiered KV cache reuse to handle industrial-scale LLM serving. It reports concrete gains across model loading, scheduling, speculative decoding, multimodal workloads, and quantized inference on models ranging from 8B to 235B parameters. These results are measured both in controlled benchmarks and in real traffic serving over 100 million users. A sympathetic reader would care because the claimed speedups directly affect cost, responsiveness, and cache efficiency when running large models at scale.

Core claim

RTP-LLM addresses fundamental bottlenecks through integrated design: file-order-driven I/O and parallel I/O-communication overlapping for model loading; a Prefill-Decode Disaggregation architecture that decouples compute-intensive prefill from memory-bound decode phases, combined with hierarchical multi-tiered KV cache management for efficient cache reuse; modular speculative decoding supporting multiple algorithms; adaptive KV cache quantization; and decoupled multimodal processing with multi-level parallelism. Evaluations against vLLM and SGLang show 4.7x-6.3x model loading speedup, 35-37% TTFT P95 latency reduction with 215% cache reuse improvement, 1.12x-2.48x and 1.86x-2.52x throughput

What carries the argument

The Prefill-Decode Disaggregation architecture paired with hierarchical multi-tiered KV cache management, which separates prefill and decode phases while enabling efficient cache reuse across tiers.

If this is right

Model loading becomes 4.7x-6.3x faster via file-order-driven I/O and overlapping.
Production traffic scheduling achieves 35-37% TTFT P95 reduction alongside 215% cache reuse improvement.
Speculative decoding delivers 1.12x-2.48x throughput improvement.
Multimodal inference reaches 1.86x-2.52x throughput gains.
Quantized inference reduces batch latency 35-40% and improves TTFT by 1.9x-3.0x.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The disaggregation technique could extend to other phases where compute and memory demands mismatch in distributed AI systems.
Open release of the engine may encourage similar I/O and cache layering patterns in other serving frameworks.
Further gains might appear if the multi-level parallelism is tuned against specific interconnect topologies not detailed in the evaluations.
The hierarchical cache approach suggests potential benefits for energy efficiency in data-center LLM fleets if reuse rates hold under varied traffic.

Load-bearing premise

The production workloads and benchmark setups used for evaluation are representative of typical industrial traffic and the measured gains arise primarily from the described architectural choices rather than unstated hardware configurations or tuning.

What would settle it

An experiment that runs the same benchmarks and production traces on identical hardware with vLLM and SGLang producing equal or better results in loading time, TTFT, throughput, and cache reuse would falsify the performance claims.

Figures

Figures reproduced from arXiv: 2605.29639 by Bo Cai, Boyu Tan, Chi Zhang, Guiyang Huang, Guoding Li, Hanbo Sun, Jianning Zhang, Jiarui Guo, Juncheng Yin, Kan Liu, Lin Qu, Lin Yang, Lu Chen, Na He, Nan Wang, Silu Zhou, Tao Lan, Tong Yang, Wei Zhang, Wending Bao, Xinfei Shi, Xi Yang, Xiyu Wang, Yaxin Yu, Yinghao Yu, Yuxing Yuan, Zechao Zhang, Zetao Hu, Zongwei Lv.

**Figure 1.** Figure 1: RTP-LLM System Architecture inference. To address these challenges, paged memory management systems, such as PagedAttention [30], have emerged as a revolutionary approach, treating KV cache as a collection of fixedsize pages that can be allocated, deallocated, and shared across different requests, thereby enabling efficient memory management for variable-length sequences and significantly improving memor… view at source ↗

**Figure 2.** Figure 2: Model Load Optimizations synchronously forwarding the comprehensive request payload to the centralized Master node. The Master Node initiates the request processing workflow by generating the requisite prefix hash keys (H) from the incoming user request (Algorithm 1, Line 1: GenerateHashKeys). The Master node then utilizes these generated hash keys (H) to perform prefix matching against the global cache,… view at source ↗

**Figure 3.** Figure 3: EPD Disaggregation accuracy degradation, particularly when combined with hardware accelerators supporting the format. 7.2.2 KV Cache Quantization. The Key-Value (KV) cache, which stores intermediate attention states, dynamically grows with context length and quickly becomes the bottleneck in memory bandwidth and capacity, especially for models supporting contexts of 128K+ tokens. To mitigate this memory … view at source ↗

**Figure 4.** Figure 4: Model loading time comparison for medium-scale models (8B-32B parameters) across different TP configurations. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Batch latency and precision loss comparison for Qwen3-32B across different quantization configurations. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: TTFT and Tokens/s comparison for Qwen3-32B [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Performance and GPU memory utilization comparison for Qwen/Qwen2.5-VL-7B-Instruct on GQA dataset across [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have revolutionized AI applications, but deploying them at scale presents significant challenges. We present RTP-LLM, a high-performance inference engine for industrial-scale LLM deployment, successfully deployed across Alibaba Group serving over 100 million users. RTP-LLM addresses fundamental bottlenecks through integrated design. It optimizes model loading via file-order-driven I/O and parallel I/O-communication overlapping. The Prefill-Decode Disaggregation architecture decouples compute-intensive prefill from memory-bound decode phases, combined with hierarchical multi-tiered KV cache management enabling efficient cache reuse. In addition, RTP-LLM incorporates modular speculative decoding supporting multiple algorithms, adaptive KV cache quantization, and decoupled multimodal processing, with support for multi-level parallelism. Comprehensive evaluations across diverse model architectures (8B-235B parameters) have been conducted, where both controlled benchmarks and real production workloads are used. The results demonstrate RTP-LLM's superior performance against vLLM and SGLang: 4.7x-6.3x model loading speedup, 35-37% TTFT P95 latency reduction with 215% cache reuse improvement in production traffic scheduling, 1.12x-2.48x and 1.86x-2.52x throughput improvements in speculative decoding and multimodal inference, respectively, and 35-40% batch latency reduction with 1.9x-3.0x TTFT improvement in quantized inference. RTP-LLM's production-proven architecture and open-source availability make it a comprehensive solution for industrial LLM deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RTP-LLM bundles known serving techniques into a live Alibaba deployment and reports production gains over vLLM and SGLang, but the numbers need tighter hardware and tuning controls to attribute cleanly to the architecture.

read the letter

Look, the core of this paper is RTP-LLM, their inference engine that's live at Alibaba for over 100 million users. It stitches together prefill-decode disaggregation, a hierarchical KV cache, file-order I/O for loading, and modular speculative decoding.

What's actually new is the way they integrated the hierarchical cache for reuse and the production measurements that come with it. The results show solid improvements over vLLM and SGLang across loading speed, TTFT, throughput in speculative and multimodal cases, and quantized inference. The fact that some of this comes from real traffic is the useful bit.

They do a decent job describing the architecture for scale, and releasing it open source helps.

The soft spot is the comparison setup. The stress test note is right to flag that we don't know if the baselines had the same hardware or tuning level. If the paper doesn't pin that down in the full text, the deltas are harder to trust as purely from their choices. No mention of error bars either, which is a minor but real gap for a systems paper.

This one is for people who run LLM services at big scale or want to see how the pieces work together in practice. It has enough real data to be worth reading.

I'd put it through peer review. The deployment experience adds value even with the evaluation questions.

Referee Report

3 major / 1 minor

Summary. The paper presents RTP-LLM, an industrial LLM inference engine deployed at Alibaba serving over 100 million users. It describes optimizations including file-order-driven I/O with parallel overlapping, Prefill-Decode Disaggregation, hierarchical multi-tiered KV cache for reuse, modular speculative decoding, adaptive KV cache quantization, and decoupled multimodal processing. Evaluations on 8B-235B models against vLLM and SGLang report 4.7x-6.3x model loading speedup, 35-37% TTFT P95 reduction with 215% cache reuse improvement in production, 1.12x-2.48x and 1.86x-2.52x throughput gains in speculative and multimodal cases, and 35-40% batch latency reduction with 1.9x-3.0x TTFT improvement in quantized inference.

Significance. If the performance deltas can be isolated to the described architectural choices under controlled conditions, the work would offer a practically significant contribution to production LLM serving systems by demonstrating scalable disaggregation and cache management techniques in real traffic. The open-source release and multi-level parallelism support add value for the community.

major comments (3)

[Evaluation] Evaluation section: The reported speedups (4.7x-6.3x loading, 35-37% TTFT P95 reduction, etc.) are presented without explicit confirmation that vLLM and SGLang baselines used identical cluster hardware, interconnects, CUDA versions, or equivalent per-system tuning. This prevents isolating gains to Prefill-Decode Disaggregation and hierarchical KV cache as claimed in the abstract.
[Abstract] Abstract and Evaluation: All quantitative claims (e.g., 215% cache reuse improvement, 1.12x-2.48x throughput) are given as point estimates without error bars, workload characterization details, or statistical analysis, making it impossible to assess variability or reproducibility of the production traffic results.
[Evaluation] Evaluation section: No description of how production workloads were selected or whether they are representative; the 35-37% TTFT reduction and cache reuse claims rest on the unverified assumption that measured gains arise primarily from the listed features rather than unstated configuration differences.

minor comments (1)

[Abstract] The abstract states results across 'diverse model architectures (8B-235B parameters)' but provides no table or section listing the exact models, batch sizes, or sequence lengths used in each comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation methodology and reproducibility aspects. We address each major comment below and will revise the manuscript to improve clarity on experimental setups and workload details where feasible.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The reported speedups (4.7x-6.3x loading, 35-37% TTFT P95 reduction, etc.) are presented without explicit confirmation that vLLM and SGLang baselines used identical cluster hardware, interconnects, CUDA versions, or equivalent per-system tuning. This prevents isolating gains to Prefill-Decode Disaggregation and hierarchical KV cache as claimed in the abstract.

Authors: We acknowledge that the current manuscript does not explicitly detail the hardware equivalence for baselines. All reported comparisons were performed on the same production-grade cluster with identical hardware, interconnects (e.g., NVLink and InfiniBand), CUDA versions, and driver configurations; baseline systems received equivalent tuning efforts to the best of our ability. In the revised version, we will add a dedicated 'Experimental Setup' subsection in the Evaluation section that explicitly confirms these controls and describes how gains are isolated to the architectural features. revision: yes
Referee: [Abstract] Abstract and Evaluation: All quantitative claims (e.g., 215% cache reuse improvement, 1.12x-2.48x throughput) are given as point estimates without error bars, workload characterization details, or statistical analysis, making it impossible to assess variability or reproducibility of the production traffic results.

Authors: We agree that the absence of error bars and statistical details limits assessment of variability. Production metrics reflect aggregated observations over multi-day periods of live traffic rather than repeated controlled trials, which inherently limits the applicability of traditional error bars. In the revision, we will add a note on measurement methodology, include any available variability ranges for key metrics, and clarify that point estimates represent observed typical improvements under the described conditions. revision: partial
Referee: [Evaluation] Evaluation section: No description of how production workloads were selected or whether they are representative; the 35-37% TTFT reduction and cache reuse claims rest on the unverified assumption that measured gains arise primarily from the listed features rather than unstated configuration differences.

Authors: Production workloads were drawn from actual Alibaba user traffic spanning multiple model sizes and request patterns to ensure representativeness of real deployment scenarios. Configuration differences between systems were minimized by using the same cluster and equivalent tuning. In the revised manuscript, we will expand the Evaluation section with a high-level workload characterization (e.g., request rate distributions and model mix) while respecting confidentiality constraints, and reaffirm that comparisons control for non-architectural factors. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark results

full rationale

The paper is a systems description of an LLM inference engine. All performance claims (speedups, latency reductions, throughput gains) are presented as direct empirical measurements from controlled benchmarks and production traffic, with no mathematical derivations, first-principles predictions, fitted parameters renamed as outputs, or load-bearing self-citations of uniqueness theorems. No equations or ansatzes are invoked that could reduce to inputs by construction. This matches the expected non-finding for engineering papers whose central claims rest on external falsifiable measurements rather than internal derivation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a systems engineering paper the abstract introduces no mathematical free parameters, axioms, or new postulated entities; all contributions are implementation and measurement details.

pith-pipeline@v0.9.1-grok · 5907 in / 1249 out tokens · 32819 ms · 2026-06-28T23:50:32.267679+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 23 canonical work pages · 15 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Amey Agrawal, Nitin Kedia, Anmol Agarwal, Jayashree Mohan, Nipun Kwa- tra, Souvik Kundu, Ramachandran Ramjee, and Alexey Tumanov. 2025. On Evaluating Performance of LLM Inference Serving Systems.arXiv preprint arXiv:2507.09019(2025)

work page arXiv 2025
[3]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 117–134

2024
[4]

Alibaba. 2025. Source code of RTP-LLM. https://github.com/alibaba/rtp-llm

2025
[5]

Aone. 2025. Aone Copilot. Visual Studio Code Extension. https://marketplace. visualstudio.com/items?itemName=Aone.aone-copilot Accessed: 2025-11-15

2025
[6]

Hicham Badri and Appu Shaji. 2023. Half-Quadratic Quantization of Large Machine Learning Models. https://mobiusml.github.io/hqq_blog/

2023
[7]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.arXiv preprint arXiv:2308.12966(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

2020
[11]

NVIDIA Corporation. 2025. NVIDIA Collective Communications Library (NCCL) User Guide. Online Documentation. https://docs.nvidia.com/deeplearning/nccl/ user-guide/docs/overview.html Accessed: 2025-11-15

2025
[12]

Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

2022
[15]

Yangshen Deng, Zhengxin You, Long Xiang, Qilong Li, Peiqi Yuan, Zhaoyang Hong, Yitao Zheng, Wanting Li, Runzhong Li, Haotian Liu, et al. 2025. AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference. In Companion of the 2025 International Conference on Management of Data. 364–377

2025
[16]

Alexey Dosovitskiy. 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[17]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

2024
[18]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-Serve: Adap- tive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving. Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

2025
[20]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Kostas Hatalis, Despina Christou, Joshua Myers, Steven Jones, Keith Lambert, Adam Amos-Binks, Zohreh Dannenhauer, and Dustin Dannenhauer. 2023. Mem- ory matters: The need to improve long-term memory in llm-agents. InProceedings of the AAAI Symposium Series, Vol. 2. 277–280

2023
[22]

Yongjun He, Yao Lu, and Gustavo Alonso. 2024. Deferred continuous batching in resource-efficient large language model serving. InProceedings of the 4th Workshop on Machine Learning and Systems. 98–106

2024
[23]

Kalle Hilsenbek. 2024. Breaking the Attention Bottleneck.arXiv preprint arXiv:2406.10906(2024)

work page arXiv 2024
[24]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. arXiv:1902.09506 [cs.CL] https://arxiv.org/abs/1902.09506

work page internal anchor Pith review Pith/arXiv arXiv 2019
[25]

Hugging Face Inc. 2024. Text Generation Inference (TGI): High -Performance Inference Engine for Large Language Models. https://huggingface.co/text- generation-inference

2024
[26]

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2024. Ragcache: Efficient knowledge caching for retrieval-augmented generation.ACM Transactions on Computer Systems(2024)

2024
[27]

Saehan Jo and Immanuel Trummer. 2025. SpareLLM: Automatically Selecting Task-Specific Minimum-Cost Large Language Models under Equivalence Con- straint.Proceedings of the ACM on Management of Data3, 3 (2025), 1–26

2025
[28]

Uday Kamath, Kevin Keenan, Garrett Somers, and Sarah Sorenson. 2024. LLMs in Production. InLarge Language Models: A Deep Dive: Bridging Theory and Practice. Springer, 315–373

2024
[29]

Mikhail V Koroteev. 2021. BERT: a review of applications in natural language processing and understanding.arXiv preprint arXiv:2103.11943(2021)

work page arXiv 2021
[30]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAtten- tion. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

2023
[31]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

2024
[32]

Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. 2024. Llm infer- ence serving: Survey of recent advances and opportunities. In2024 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–8

2024
[33]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and accel- eration.Proceedings of machine learning and systems6 (2024), 87–100

2024
[35]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. 2025. Clus- terkv: Manipulating llm kv cache in semantic space for recallable compression. In2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 1–7

2025
[37]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

2023
[38]

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. 2025. Muon ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Mikasenghaas and Hugging Face Dataset Authors. 2024. Wikitext-2 Dataset Mir- ror. Hugging Face Dataset Card. https://hf-mirror.com/datasets/mikasenghaas/ wikitext-2 Accessed: 2024-11

2024
[40]

NVIDIA. 2023. TensorRT LLM. https://github.com/NVIDIA/TensorRT-LLM

2023
[41]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132

2024
[42]

Satya Naga Mallika Pothukuchi. 2025. LLMOps: A Comprehensive Guide to Deploying Large Language Models in Production.IJSAT-International Journal on Science and Technology16, 1 (2025)

2025
[43]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 155–170

2025
[44]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog1, 8 (2019), 9

2019
[45]

Gursimran Singh, Xinglu Wang, Yifan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, Yong Zhang, and Zhenan Fan
[46]

arXiv:2501.05460 [cs.DC] https://arxiv.org/abs/2501.05460

Efficiently Serving Large Multimodal Models Using EPD Disaggregation. arXiv:2501.05460 [cs.DC] https://arxiv.org/abs/2501.05460

work page arXiv
[47]

Benjamin Spector and Chris Re. 2023. Accelerating llm inference with staged speculative decoding.arXiv preprint arXiv:2308.04623(2023)

work page arXiv 2023
[48]

Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm. github.io/blog/qwen2.5/

2024
[49]

Qwen Team. 2025. Qwen2.5-VL. https://qwenlm.github.io/blog/qwen2.5-vl/

2025
[50]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388 Boyu Tan et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

2017
[52]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. 2023. Tabi: An efficient multi-level inference system for large language models. InProceedings of the Eighteenth European Conference on Computer Systems. 233–248

2023
[54]

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, et al . 2025. Burstgpt: A real-world workload dataset to optimize llm serving systems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 5831–5841

2025
[55]

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. 2024. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding.arXiv preprint arXiv:2401.07851(2024)

work page arXiv 2024
[56]

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Cheng- peng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfen...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Yuqing Yang, Lei Jiao, and Yuedong Xu. 2024. A queueing theoretic perspective on low-latency llm inference with variable token length. In2024 22nd Interna- tional Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt). IEEE, 273–280

2024
[58]

Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, and Swaminathan Sundararaman. 2025. Speeding up Model Loading with fastsafeten- sors. arXiv:2505.23072 [cs.DC] https://arxiv.org/abs/2505.23072

work page arXiv 2025
[59]

Haoran You, Yichao Fu, Zheng Wang, Amir Yazdanbakhsh, and Yingyan Ce- line Lin. 2024. When linear attention meets autoregressive decoding: Towards more effective and efficient linearized large language models.arXiv preprint arXiv:2406.07368(2024)

work page arXiv 2024
[60]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538

2022
[61]

Chen Zhang, Kuntai Du, Shu Liu, Woosuk Kwon, Xiangxi Mo, Yufeng Wang, Xiaoxuan Liu, Kaichao You, Zhuohan Li, Mingsheng Long, et al. 2025. JENGA: Ef- fective memory management for serving LLM with heterogeneity. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 446–461

2025
[62]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. 2023. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems36 (2023), 34661–34710

2023
[63]

Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, and Liang Zhao. 2025. DeepEP: an efficient expert-parallel communication library. https://github.com/deepseek-ai/DeepEP

2025
[64]

Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, et al . 2025. MEMO: Fine- grained Tensor Management For Ultra-long Context LLM Training.Proceedings of the ACM on Management of Data3, 1 (2025), 1–28

2025
[65]

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37 (2024), 62557–62583

2024
[66]

Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, and Gang Peng
[67]

Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching.arXiv preprint arXiv:2412.03594(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210

2024

[1] [1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Amey Agrawal, Nitin Kedia, Anmol Agarwal, Jayashree Mohan, Nipun Kwa- tra, Souvik Kundu, Ramachandran Ramjee, and Alexey Tumanov. 2025. On Evaluating Performance of LLM Inference Serving Systems.arXiv preprint arXiv:2507.09019(2025)

work page arXiv 2025

[3] [3]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 117–134

2024

[4] [4]

Alibaba. 2025. Source code of RTP-LLM. https://github.com/alibaba/rtp-llm

2025

[5] [5]

Aone. 2025. Aone Copilot. Visual Studio Code Extension. https://marketplace. visualstudio.com/items?itemName=Aone.aone-copilot Accessed: 2025-11-15

2025

[6] [6]

Hicham Badri and Appu Shaji. 2023. Half-Quadratic Quantization of Large Machine Learning Models. https://mobiusml.github.io/hqq_blog/

2023

[7] [7]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [9]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.arXiv preprint arXiv:2308.12966(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [10]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

2020

[10] [11]

NVIDIA Corporation. 2025. NVIDIA Collective Communications Library (NCCL) User Guide. Online Documentation. https://docs.nvidia.com/deeplearning/nccl/ user-guide/docs/overview.html Accessed: 2025-11-15

2025

[11] [12]

Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [13]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

2022

[13] [15]

Yangshen Deng, Zhengxin You, Long Xiang, Qilong Li, Peiqi Yuan, Zhaoyang Hong, Yitao Zheng, Wanting Li, Runzhong Li, Haotian Liu, et al. 2025. AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference. In Companion of the 2025 International Conference on Management of Data. 364–377

2025

[14] [16]

Alexey Dosovitskiy. 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[15] [17]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv e-prints(2024), arXiv–2407

2024

[16] [18]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [19]

Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-Serve: Adap- tive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving. Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

2025

[18] [20]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [21]

Kostas Hatalis, Despina Christou, Joshua Myers, Steven Jones, Keith Lambert, Adam Amos-Binks, Zohreh Dannenhauer, and Dustin Dannenhauer. 2023. Mem- ory matters: The need to improve long-term memory in llm-agents. InProceedings of the AAAI Symposium Series, Vol. 2. 277–280

2023

[20] [22]

Yongjun He, Yao Lu, and Gustavo Alonso. 2024. Deferred continuous batching in resource-efficient large language model serving. InProceedings of the 4th Workshop on Machine Learning and Systems. 98–106

2024

[21] [23]

Kalle Hilsenbek. 2024. Breaking the Attention Bottleneck.arXiv preprint arXiv:2406.10906(2024)

work page arXiv 2024

[22] [24]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. arXiv:1902.09506 [cs.CL] https://arxiv.org/abs/1902.09506

work page internal anchor Pith review Pith/arXiv arXiv 2019

[23] [25]

Hugging Face Inc. 2024. Text Generation Inference (TGI): High -Performance Inference Engine for Large Language Models. https://huggingface.co/text- generation-inference

2024

[24] [26]

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2024. Ragcache: Efficient knowledge caching for retrieval-augmented generation.ACM Transactions on Computer Systems(2024)

2024

[25] [27]

Saehan Jo and Immanuel Trummer. 2025. SpareLLM: Automatically Selecting Task-Specific Minimum-Cost Large Language Models under Equivalence Con- straint.Proceedings of the ACM on Management of Data3, 3 (2025), 1–26

2025

[26] [28]

Uday Kamath, Kevin Keenan, Garrett Somers, and Sarah Sorenson. 2024. LLMs in Production. InLarge Language Models: A Deep Dive: Bridging Theory and Practice. Springer, 315–373

2024

[27] [29]

Mikhail V Koroteev. 2021. BERT: a review of applications in natural language processing and understanding.arXiv preprint arXiv:2103.11943(2021)

work page arXiv 2021

[28] [30]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAtten- tion. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

2023

[29] [31]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

2024

[30] [32]

Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. 2024. Llm infer- ence serving: Survey of recent advances and opportunities. In2024 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–8

2024

[31] [33]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [34]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and accel- eration.Proceedings of machine learning and systems6 (2024), 87–100

2024

[33] [35]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [36]

Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. 2025. Clus- terkv: Manipulating llm kv cache in semantic space for recallable compression. In2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 1–7

2025

[35] [37]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

2023

[36] [38]

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. 2025. Muon ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [39]

Mikasenghaas and Hugging Face Dataset Authors. 2024. Wikitext-2 Dataset Mir- ror. Hugging Face Dataset Card. https://hf-mirror.com/datasets/mikasenghaas/ wikitext-2 Accessed: 2024-11

2024

[38] [40]

NVIDIA. 2023. TensorRT LLM. https://github.com/NVIDIA/TensorRT-LLM

2023

[39] [41]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132

2024

[40] [42]

Satya Naga Mallika Pothukuchi. 2025. LLMOps: A Comprehensive Guide to Deploying Large Language Models in Production.IJSAT-International Journal on Science and Technology16, 1 (2025)

2025

[41] [43]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 155–170

2025

[42] [44]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog1, 8 (2019), 9

2019

[43] [45]

Gursimran Singh, Xinglu Wang, Yifan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, Yong Zhang, and Zhenan Fan

[44] [46]

arXiv:2501.05460 [cs.DC] https://arxiv.org/abs/2501.05460

Efficiently Serving Large Multimodal Models Using EPD Disaggregation. arXiv:2501.05460 [cs.DC] https://arxiv.org/abs/2501.05460

work page arXiv

[45] [47]

Benjamin Spector and Chris Re. 2023. Accelerating llm inference with staged speculative decoding.arXiv preprint arXiv:2308.04623(2023)

work page arXiv 2023

[46] [48]

Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm. github.io/blog/qwen2.5/

2024

[47] [49]

Qwen Team. 2025. Qwen2.5-VL. https://qwenlm.github.io/blog/qwen2.5-vl/

2025

[48] [50]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388 Boyu Tan et al

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [51]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

2017

[50] [52]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [53]

Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. 2023. Tabi: An efficient multi-level inference system for large language models. InProceedings of the Eighteenth European Conference on Computer Systems. 233–248

2023

[52] [54]

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, et al . 2025. Burstgpt: A real-world workload dataset to optimize llm serving systems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 5831–5841

2025

[53] [55]

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. 2024. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding.arXiv preprint arXiv:2401.07851(2024)

work page arXiv 2024

[54] [56]

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Cheng- peng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfen...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [57]

Yuqing Yang, Lei Jiao, and Yuedong Xu. 2024. A queueing theoretic perspective on low-latency llm inference with variable token length. In2024 22nd Interna- tional Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt). IEEE, 273–280

2024

[56] [58]

Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, and Swaminathan Sundararaman. 2025. Speeding up Model Loading with fastsafeten- sors. arXiv:2505.23072 [cs.DC] https://arxiv.org/abs/2505.23072

work page arXiv 2025

[57] [59]

Haoran You, Yichao Fu, Zheng Wang, Amir Yazdanbakhsh, and Yingyan Ce- line Lin. 2024. When linear attention meets autoregressive decoding: Towards more effective and efficient linearized large language models.arXiv preprint arXiv:2406.07368(2024)

work page arXiv 2024

[58] [60]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538

2022

[59] [61]

Chen Zhang, Kuntai Du, Shu Liu, Woosuk Kwon, Xiangxi Mo, Yufeng Wang, Xiaoxuan Liu, Kaichao You, Zhuohan Li, Mingsheng Long, et al. 2025. JENGA: Ef- fective memory management for serving LLM with heterogeneity. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 446–461

2025

[60] [62]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. 2023. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems36 (2023), 34661–34710

2023

[61] [63]

Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, and Liang Zhao. 2025. DeepEP: an efficient expert-parallel communication library. https://github.com/deepseek-ai/DeepEP

2025

[62] [64]

Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, et al . 2025. MEMO: Fine- grained Tensor Management For Ultra-long Context LLM Training.Proceedings of the ACM on Management of Data3, 1 (2025), 1–28

2025

[63] [65]

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems37 (2024), 62557–62583

2024

[64] [66]

Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, and Gang Peng

[65] [67]

Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching.arXiv preprint arXiv:2412.03594(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [68]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210

2024