pith. machine review for the scientific record. sign in

arxiv: 2309.06180 · v1 · submitted 2023-09-12 · 💻 cs.LG · cs.DC

Recognition: 3 theorem links

· Lean Theorem

Efficient Memory Management for Large Language Model Serving with PagedAttention

Cody Hao Yu, Hao Zhang, Ion Stoica, Joseph E. Gonzalez, Lianmin Zheng, Siyuan Zhuang, Woosuk Kwon, Ying Sheng, Zhuohan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 14:58 UTC · model grok-4.3

classification 💻 cs.LG cs.DC
keywords large language modelsLLM servingkey-value cachepaged attentionmemory managementthroughput
0
0 comments X

The pith

PagedAttention manages LLM key-value caches like operating-system virtual memory to eliminate fragmentation and allow sharing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the dynamic size of the key-value cache limits batch sizes in LLM serving because existing systems waste memory through fragmentation and duplication. By borrowing paging from virtual memory systems, PagedAttention allocates and shares cache blocks flexibly, achieving near-zero waste while preserving correctness. This enables the vLLM system to pack more requests into each batch. Experiments show the approach raises throughput by 2-4 times at unchanged latency versus prior systems, with larger gains on long sequences and complex decoding. The result matters because higher effective batching directly increases hardware utilization for large-model inference.

Core claim

PagedAttention is an attention algorithm that stores the key-value cache in non-contiguous blocks managed like virtual-memory pages; on top of it, vLLM achieves near-zero KV-cache waste and flexible intra- and inter-request sharing, delivering 2-4x higher throughput than FasterTransformer or Orca at the same latency.

What carries the argument

PagedAttention, the algorithm that divides the key-value cache into fixed-size blocks (pages) that can be allocated, swapped, and shared independently of contiguous memory layout.

If this is right

  • Batch sizes can grow without proportional memory increase, directly raising tokens processed per second.
  • Long-context and beam-search workloads become practical on the same hardware because memory is no longer the dominant limit.
  • Sharing of cache blocks across requests reduces total memory footprint when prompts overlap.
  • Memory usage becomes more predictable, simplifying capacity planning for production serving clusters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The paging abstraction could be reused for other dynamic tensor structures that grow during inference.
  • Hardware accelerators might add native support for paged attention to remove the remaining software mapping cost.
  • Because the code is open, other serving frameworks can adopt the same block layout without reimplementing the attention kernel.

Load-bearing premise

That translating the key-value cache into paged blocks adds negligible cost to attention arithmetic and produces identical model outputs on every workload.

What would settle it

Measure KV-cache memory utilization and exact token outputs on a benchmark with highly variable sequence lengths; if utilization stays far above zero waste or any output token differs from a non-paged baseline, the central claim does not hold.

read the original abstract

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4$\times$ with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes PagedAttention, an attention algorithm modeled on OS paging to manage the dynamic, per-request key-value cache during LLM inference. By organizing the KV cache into fixed-size blocks with a page table, vLLM achieves near-zero fragmentation and enables KV cache sharing within and across requests. Empirical results on standard models claim 2-4× higher throughput than FasterTransformer and Orca at equivalent latency, with larger gains for long sequences, bigger models, and complex decoding.

Significance. If the throughput claims hold under the reported conditions, the work is significant for production LLM serving: it directly attacks the memory-fragmentation bottleneck that limits batch size, potentially lowering inference cost and latency for long-context workloads. Public code release aids reproducibility and adoption.

major comments (1)
  1. [§5] §5 (Evaluation): the 2-4× throughput numbers are obtained by comparing against external systems; the manuscript does not isolate the incremental latency cost of PagedAttention’s page-table indirection and non-contiguous loads inside the fused attention kernels (e.g., via a same-batch-size contiguous-cache baseline). Without this measurement it remains possible that the reported gains are partly offset by reduced GPU utilization at the larger batch sizes enabled by reduced fragmentation.
minor comments (2)
  1. [Abstract, §5] Abstract and §5: benchmark configurations (sequence lengths, batch sizes, hardware, exact model variants) are summarized but not tabulated; adding a concise table would improve clarity.
  2. [§3] §3: the description of block allocation and page-table lookup is clear at a high level but does not specify the exact data structures or cache-line effects inside the CUDA kernels; a short pseudocode listing would help readers replicate the implementation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive summary and recommendation for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [§5] §5 (Evaluation): the 2-4× throughput numbers are obtained by comparing against external systems; the manuscript does not isolate the incremental latency cost of PagedAttention’s page-table indirection and non-contiguous loads inside the fused attention kernels (e.g., via a same-batch-size contiguous-cache baseline). Without this measurement it remains possible that the reported gains are partly offset by reduced GPU utilization at the larger batch sizes enabled by reduced fragmentation.

    Authors: We agree that an intra-system ablation isolating the page-table and non-contiguous access overhead would strengthen the evaluation. Our end-to-end results compare against external baselines because that is the relevant metric for practitioners; the reported throughput gains arise primarily from the larger batch sizes made possible by near-zero fragmentation. Nevertheless, the concern is valid: any kernel-level slowdown could partially offset those gains at scale. We will revise §5 to include a same-batch-size contiguous-cache baseline inside vLLM (by temporarily disabling the page table and forcing contiguous allocation) and report the resulting latency difference for the attention kernels. This addition will quantify the incremental cost directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in PagedAttention derivation or claims

full rationale

The paper proposes PagedAttention as an OS-paging-inspired algorithm for KV-cache management, implements it in vLLM, and supports its 2-4x throughput claims via direct empirical benchmarks against independent external systems (FasterTransformer, Orca). No mathematical derivations, fitted parameters presented as predictions, self-definitional equations, or load-bearing self-citations appear in the abstract or described chain. All performance results are externally falsifiable measurements rather than reductions to internal inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that fine-grained paging of KV cache memory is feasible on GPU hardware without correctness or performance penalties.

axioms (1)
  • domain assumption GPU memory allocation can be performed at fine granularity with low overhead for attention operations.
    This enables the paging technique to avoid fragmentation while preserving model behavior.
invented entities (1)
  • PagedAttention no independent evidence
    purpose: To manage dynamic KV cache memory using paging for efficient LLM serving
    New algorithm proposed to solve the fragmentation problem.

pith-pipeline@v0.9.0 · 5524 in / 1159 out tokens · 49404 ms · 2026-05-12T14:58:39.794128+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding

    cs.DC 2026-05 unverdicted novelty 7.0

    NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.

  2. The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

    cs.DC 2026-05 unverdicted novelty 7.0

    Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.

  3. Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

    cs.DC 2026-05 unverdicted novelty 7.0

    EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a f...

  4. Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

    cs.LG 2026-05 accept novelty 7.0

    Apple MPS decoding exhibits non-monotonic latency with spikes up to 21x due to KV cache interactions and execution regimes, unlike monotonic behavior on CPU and CUDA.

  5. DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 7.0

    DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

  6. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  7. CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

    cs.DC 2026-04 unverdicted novelty 7.0

    CacheFlow cuts TTFT by 10-62% in batched LLM serving via 3D-parallel KV cache restoration and a two-pointer scheduler that overlaps recompute and I/O.

  8. PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

    cs.LG 2026-04 unverdicted novelty 7.0

    Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.

  9. Neural Garbage Collection: Learning to Forget while Learning to Reason

    cs.LG 2026-04 conditional novelty 7.0

    Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.

  10. Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

    cs.LG 2026-04 unverdicted novelty 7.0

    Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

  11. LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    LoSA caches prefix attention for stable tokens in block-wise DLMs and applies sparse attention only to active tokens, preserving near-dense accuracy while achieving 1.54x lower attention density and up to 4.14x speedup.

  12. Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...

  13. Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents

    cs.MA 2026-05 unverdicted novelty 6.0

    Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.

  14. Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

    cs.CL 2026-04 unverdicted novelty 6.0

    HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

  15. Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon

    cs.LG 2026-04 unverdicted novelty 6.0

    Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.

  16. Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

    cs.DC 2026-04 unverdicted novelty 6.0

    R^3 optimizes full scientific applications on GPUs better than tuning kernel parameters or compiler flags alone while running nearly an order of magnitude faster than modern evolutionary search methods.

  17. Reduced-Mass Orbital AI Inference via Integrated Solar, Compute, and Radiator Panels

    cs.DC 2026-04 unverdicted novelty 6.0

    Integrated solar-compute-radiator panels enable orbital satellites to achieve over 100 kW of AI inference compute per metric ton launched, supporting thousands of simultaneous large language model sessions.

  18. MemFactory: Unified Inference & Training Framework for Agent Memory

    cs.CL 2026-03 unverdicted novelty 6.0

    MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.

  19. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

    cs.CL 2023-10 unverdicted novelty 6.0

    Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.

  20. Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

    cs.CE 2026-05 unverdicted novelty 5.0

    LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.

  21. An Executable Benchmarking Suite for Tool-Using Agents

    cs.SE 2026-05 unverdicted novelty 5.0

    The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.

  22. How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study

    cs.SE 2026-05 conditional novelty 5.0

    Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.

  23. VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.

  24. StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

    cs.LG 2026-05 accept novelty 5.0

    Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

  25. EdgeFM: Efficient Edge Inference for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...

  26. Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning

    eess.SY 2026-04 unverdicted novelty 5.0

    High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.

  27. Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference

    cs.PF 2026-05 unverdicted novelty 4.0

    Nvidia achieves 1.6x throughput with NVFP4 but hits a VRAM wall for 70B+ models, while Apple UMA enables linear scaling to 80B at 4-bit with up to 23x better energy efficiency.

  28. EasyVideoR1: Easier RL for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

  29. Hierarchical vs. Flat Iteration in Shared-Weight Transformers

    cs.CL 2026-04 unverdicted novelty 4.0

    Hierarchical two-speed shared-weight recurrence in Transformers shows a sharp performance gap compared to independent layer stacking in empirical language modeling tests.

  30. Secure On-Premise Deployment of Open-Weights Large Language Models in Radiology: An Isolation-First Architecture with Prospective Pilot Evaluation

    cs.CY 2026-03 conditional novelty 4.0

    An isolation-first on-premise architecture for open-weights LLMs in radiology achieved regulatory approval for processing PHI and showed good utility for text-anchored tasks in a one-week pilot with 22 users.

  31. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

  32. Yi: Open Foundation Models by 01.AI

    cs.CL 2024-03 unverdicted novelty 4.0

    Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.

  33. SLM Finetuning for Natural Language to Domain Specific Code Generation in Production

    cs.LG 2026-04 unverdicted novelty 3.0

    Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losi...

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 33 Pith papers · 12 internal anchors

  1. [1]

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Am- mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, et al. 2022. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv preprint arXiv:2207.00032 (2022)

  2. [2]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  3. [3]

    Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A neural probabilistic language model. Advances in neural information process- ing systems 13 (2000)

  4. [4]

    Ond rej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Gra- ham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Car- olina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. Findings ...

  5. [5]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al . 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

  6. [6]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

  7. [7]

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016)

  8. [8]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys. org/blog/2023-03-30-vicuna/

  9. [9]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling lan- guage modeling with pathways.arXiv preprint arXiv:2204.02311 (2022)

  10. [10]

    Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency- aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing. 477–491

  11. [11]

    Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) . 613–627

  12. [12]

    Weihao Cui, Han Zhao, Quan Chen, Hao Wei, Zirui Li, Deze Zeng, Chao Li, and Minyi Guo. 2022. DVABatch: Diversity-aware Multi- Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 183–198

  13. [13]

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

  14. [14]

    Advances in Neural Information Processing Systems 35 (2022), 16344–16359

    Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359

  15. [15]

    Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTrans- formers: an efficient GPU serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . 389–402

  16. [16]

    FastAPI. 2023. FastAPI. https://github.com/tiangolo/fastapi

  17. [17]

    Pin Gao, Lingfan Yu, Yongwei Wu, and Jinyang Li. 2018. Low latency rnn inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference. 1–15

  18. [18]

    Amir Gholami, Zhewei Yao, Sehoon Kim, Michael W Mahoney, and Kurt Keutzer. 2021. Ai and memory wall.RiseLab Medium Post 1 (2021), 6

  19. [19]

    Github. 2022. https://github.com/features/copilot

  20. [20]

    Google. 2023. https://bard.google.com/

  21. [21]

    Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kauf- mann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving{DNNs} like Clockwork: Performance Predictability from the Bottom Up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 443–462

  22. [22]

    Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen

  23. [23]

    In 16th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 22)

    Microsecond-scale Preemption for Concurrent {GPU- accelerated} {DNN} Inferences. In 16th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 22) . 539–558

  24. [24]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition . 770–778

  25. [25]

    Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. Swapadvisor: Push- ing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Archi- tectural Support for Programming Languages and Operating Systems . 1341–1355

  26. [26]

    Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Check- mate: Breaking the memory wall with optimal tensor rematerialization. 14 Proceedings of Machine Learning and Systems 2 (2020), 497–511

  27. [27]

    Tom Kilburn, David BG Edwards, Michael J Lanigan, and Frank H Sumner. 1962. One-level storage system. IRE Transactions on Electronic Computers 2 (1962), 223–235

  28. [28]

    Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021)

  29. [29]

    Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing contin- uous prompts for generation. arXiv preprint arXiv:2101.00190 (2021)

  30. [30]

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. arXiv preprint arXiv:2302.11665 (2023)

  31. [31]

    Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling holistic deep learning compiler optimizations with rtasks. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation . 881–897

  32. [32]

    NVIDIA. [n. d.]. Triton Inference Server. https://developer.nvidia.com/ nvidia-triton-inference-server

  33. [33]

    NVIDIA. 2023. FasterTransformer. https://github.com/NVIDIA/ FasterTransformer

  34. [34]

    NVIDIA. 2023. NCCL: The NVIDIA Collective Communication Library. https://developer.nvidia.com/nccl

  35. [35]

    Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke

  36. [36]

    TensorFlow-Serving: Flexible, High-Performance ML Serving

    Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139 (2017)

  37. [37]

    OpenAI. 2020. https://openai.com/blog/openai-api

  38. [38]

    OpenAI. 2022. https://openai.com/blog/chatgpt

  39. [39]

    OpenAI. 2023. https://openai.com/blog/custom-instructions-for- chatgpt

  40. [40]

    OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

  41. [41]

    LMSYS ORG. 2023. Chatbot Arena Leaderboard Week 8: Introduc- ing MT-Bench and Vicuna-33B. https://lmsys.org/blog/2023-06-22- leaderboard/

  42. [42]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al . 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural informa- tion processing systems 32 (2019)

  43. [43]

    Shishir G Patil, Paras Jain, Prabal Dutta, Ion Stoica, and Joseph Gon- zalez. 2022. POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging. In International Conference on Machine Learning. PMLR, 17573–17583

  44. [44]

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2022. Efficiently Scaling Transformer Inference. arXiv preprint arXiv:2211.05102 (2022)

  45. [45]

    Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He

  46. [46]

    In USENIX Annual Technical Conference

    ZeRO-Offload: Democratizing Billion-Scale Model Training.. In USENIX Annual Technical Conference. 551–564

  47. [47]

    Reuters. 2023. https://www.reuters.com/technology/tech-giants-ai- like-bing-bard-poses-billion-dollar-search-problem-2023-02-22/

  48. [48]

    Amazon Web Services. 2023. https://aws.amazon.com/bedrock/

  49. [49]

    Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: A GPU cluster engine for accelerating DNN-based video anal- ysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 322–337

  50. [50]

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E Gon- zalez, et al . 2023. High-throughput Generative Inference of Large Language Models with a Single GPU. arXiv preprint arXiv:2303.06865 (2023)

  51. [51]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi- billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)

  52. [52]

    Benoit Steiner, Mostafa Elhoushi, Jacob Kahn, and James Hegarty. 2022. OLLA: Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks. (2022). https://doi.org/10.48550/ arXiv.2210.12924

  53. [53]

    Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to se- quence learning with neural networks. Advances in neural information processing systems 27 (2014)

  54. [54]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https:// github.com/tatsu-lab/stanford_alpaca

  55. [55]

    ShareGPT Team. 2023. https://sharegpt.com/

  56. [56]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  57. [57]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention is all you need. Advances in neural information processing systems 30 (2017)

  58. [58]

    Jing Wang, Youyou Lu, Qing Wang, Minhui Xie, Keji Huang, and Jiwu Shu. 2022. Pacman: An Efficient Compaction Approach for {Log- Structured} {Key-Value} Store on Persistent Memory. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 773–788

  59. [59]

    Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuai- wen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dy- namic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming . 41–53

  60. [60]

    Xiaohui Wang, Ying Xiong, Yang Wei, Mingxuan Wang, and Lei Li

  61. [61]

    In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies: Industry Papers

    LightSeq: A High Performance Inference Library for Transform- ers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies: Industry Papers. 113–120

  62. [62]

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-Instruct: Aligning Language Model with Self Generated Instructions. arXiv preprint arXiv:2212.10560 (2022)

  63. [63]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations . 38–45

  64. [64]

    Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al . 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)

  65. [65]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for {Transformer-Based} Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 521–538

  66. [66]

    Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) . USENIX As- sociation, Boston, MA, 787–808. https://www.usenix.org/conference/ nsdi23/presentation/zhang-hong 15

  67. [67]

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)

  68. [68]

    Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 559–578

  69. [69]

    Zhe Zhou, Xuechao Wei, Jiejing Zhang, and Guangyu Sun. 2022. PetS: A Unified Framework for Parameter-Efficient Transformers Serving. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) . 489–504. 16