arxiv: 2309.06180 · v1 · submitted 2023-09-12 · 💻 cs.LG · cs.DC

Recognition: 3 theorem links

· Lean Theorem

Efficient Memory Management for Large Language Model Serving with PagedAttention

Cody Hao Yu, Hao Zhang, Ion Stoica, Joseph E. Gonzalez, Lianmin Zheng, Siyuan Zhuang, Woosuk Kwon, Ying Sheng, Zhuohan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 14:58 UTC · model grok-4.3

classification 💻 cs.LG cs.DC

keywords large language modelsLLM servingkey-value cachepaged attentionmemory managementthroughput

0 comments

The pith

PagedAttention manages LLM key-value caches like operating-system virtual memory to eliminate fragmentation and allow sharing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the dynamic size of the key-value cache limits batch sizes in LLM serving because existing systems waste memory through fragmentation and duplication. By borrowing paging from virtual memory systems, PagedAttention allocates and shares cache blocks flexibly, achieving near-zero waste while preserving correctness. This enables the vLLM system to pack more requests into each batch. Experiments show the approach raises throughput by 2-4 times at unchanged latency versus prior systems, with larger gains on long sequences and complex decoding. The result matters because higher effective batching directly increases hardware utilization for large-model inference.

Core claim

PagedAttention is an attention algorithm that stores the key-value cache in non-contiguous blocks managed like virtual-memory pages; on top of it, vLLM achieves near-zero KV-cache waste and flexible intra- and inter-request sharing, delivering 2-4x higher throughput than FasterTransformer or Orca at the same latency.

What carries the argument

PagedAttention, the algorithm that divides the key-value cache into fixed-size blocks (pages) that can be allocated, swapped, and shared independently of contiguous memory layout.

If this is right

Batch sizes can grow without proportional memory increase, directly raising tokens processed per second.
Long-context and beam-search workloads become practical on the same hardware because memory is no longer the dominant limit.
Sharing of cache blocks across requests reduces total memory footprint when prompts overlap.
Memory usage becomes more predictable, simplifying capacity planning for production serving clusters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The paging abstraction could be reused for other dynamic tensor structures that grow during inference.
Hardware accelerators might add native support for paged attention to remove the remaining software mapping cost.
Because the code is open, other serving frameworks can adopt the same block layout without reimplementing the attention kernel.

Load-bearing premise

That translating the key-value cache into paged blocks adds negligible cost to attention arithmetic and produces identical model outputs on every workload.

What would settle it

Measure KV-cache memory utilization and exact token outputs on a benchmark with highly variable sequence lengths; if utilization stays far above zero waste or any output token differs from a non-paged baseline, the central claim does not hold.

read the original abstract

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4$\times$ with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

vLLM's block-based KV cache cuts fragmentation enough to raise batch sizes and deliver 2-4x throughput on standard models, with the open-source code making the gains checkable.

read the letter

The core advance is mapping OS-style paging to the dynamic KV cache in transformer inference. Instead of reserving one contiguous block per request, PagedAttention allocates fixed-size pages and uses a page table to track them. This removes most internal fragmentation and lets the system share pages across requests when prefixes match. vLLM builds the serving loop around this layout and adds the necessary kernel changes to keep attention correct.

Referee Report

1 major / 2 minor

Summary. The paper proposes PagedAttention, an attention algorithm modeled on OS paging to manage the dynamic, per-request key-value cache during LLM inference. By organizing the KV cache into fixed-size blocks with a page table, vLLM achieves near-zero fragmentation and enables KV cache sharing within and across requests. Empirical results on standard models claim 2-4× higher throughput than FasterTransformer and Orca at equivalent latency, with larger gains for long sequences, bigger models, and complex decoding.

Significance. If the throughput claims hold under the reported conditions, the work is significant for production LLM serving: it directly attacks the memory-fragmentation bottleneck that limits batch size, potentially lowering inference cost and latency for long-context workloads. Public code release aids reproducibility and adoption.

major comments (1)

[§5] §5 (Evaluation): the 2-4× throughput numbers are obtained by comparing against external systems; the manuscript does not isolate the incremental latency cost of PagedAttention’s page-table indirection and non-contiguous loads inside the fused attention kernels (e.g., via a same-batch-size contiguous-cache baseline). Without this measurement it remains possible that the reported gains are partly offset by reduced GPU utilization at the larger batch sizes enabled by reduced fragmentation.

minor comments (2)

[Abstract, §5] Abstract and §5: benchmark configurations (sequence lengths, batch sizes, hardware, exact model variants) are summarized but not tabulated; adding a concise table would improve clarity.
[§3] §3: the description of block allocation and page-table lookup is clear at a high level but does not specify the exact data structures or cache-line effects inside the CUDA kernels; a short pseudocode listing would help readers replicate the implementation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive summary and recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [§5] §5 (Evaluation): the 2-4× throughput numbers are obtained by comparing against external systems; the manuscript does not isolate the incremental latency cost of PagedAttention’s page-table indirection and non-contiguous loads inside the fused attention kernels (e.g., via a same-batch-size contiguous-cache baseline). Without this measurement it remains possible that the reported gains are partly offset by reduced GPU utilization at the larger batch sizes enabled by reduced fragmentation.

Authors: We agree that an intra-system ablation isolating the page-table and non-contiguous access overhead would strengthen the evaluation. Our end-to-end results compare against external baselines because that is the relevant metric for practitioners; the reported throughput gains arise primarily from the larger batch sizes made possible by near-zero fragmentation. Nevertheless, the concern is valid: any kernel-level slowdown could partially offset those gains at scale. We will revise §5 to include a same-batch-size contiguous-cache baseline inside vLLM (by temporarily disabling the page table and forcing contiguous allocation) and report the resulting latency difference for the attention kernels. This addition will quantify the incremental cost directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in PagedAttention derivation or claims

full rationale

The paper proposes PagedAttention as an OS-paging-inspired algorithm for KV-cache management, implements it in vLLM, and supports its 2-4x throughput claims via direct empirical benchmarks against independent external systems (FasterTransformer, Orca). No mathematical derivations, fitted parameters presented as predictions, self-definitional equations, or load-bearing self-citations appear in the abstract or described chain. All performance results are externally falsifiable measurements rather than reductions to internal inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that fine-grained paging of KV cache memory is feasible on GPU hardware without correctness or performance penalties.

axioms (1)

domain assumption GPU memory allocation can be performed at fine granularity with low overhead for attention operations.
This enables the paging technique to avoid fragmentation while preserving model behavior.

invented entities (1)

PagedAttention no independent evidence
purpose: To manage dynamic KV cache memory using paging for efficient LLM serving
New algorithm proposed to solve the fragmentation problem.

pith-pipeline@v0.9.0 · 5524 in / 1159 out tokens · 49404 ms · 2026-05-12T14:58:39.794128+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LedgerForcing conservation_from_balance unclear
we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear
vLLM improves the throughput of popular LLMs by 2-4× with the same level of latency compared to the state-of-the-art systems
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
the KV cache memory for each request is huge and grows and shrinks dynamically

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
cs.DC 2026-05 unverdicted novelty 7.0

NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
cs.DC 2026-05 unverdicted novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
cs.DC 2026-05 unverdicted novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a f...
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
cs.LG 2026-05 accept novelty 7.0

Apple MPS decoding exhibits non-monotonic latency with spikes up to 21x due to KV cache interactions and execution regimes, unlike monotonic behavior on CPU and CUDA.
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
cs.LG 2026-05 unverdicted novelty 7.0

DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
KL for a KL: On-Policy Distillation with Control Variate Baseline
cs.LG 2026-05 unverdicted novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
cs.DC 2026-04 unverdicted novelty 7.0

CacheFlow cuts TTFT by 10-62% in batched LLM serving via 3D-parallel KV cache restoration and a two-pointer scheduler that overlaps recompute and I/O.
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
cs.LG 2026-04 unverdicted novelty 7.0

Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
Neural Garbage Collection: Learning to Forget while Learning to Reason
cs.LG 2026-04 conditional novelty 7.0

Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
cs.LG 2026-04 unverdicted novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 7.0

LoSA caches prefix attention for stable tokens in block-wise DLMs and applies sparse attention only to active tokens, preserving near-dense accuracy while achieving 1.54x lower attention density and up to 4.14x speedup.
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
cs.MA 2026-05 unverdicted novelty 6.0

Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
cs.CL 2026-04 unverdicted novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
cs.LG 2026-04 unverdicted novelty 6.0

Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search
cs.DC 2026-04 unverdicted novelty 6.0

R^3 optimizes full scientific applications on GPUs better than tuning kernel parameters or compiler flags alone while running nearly an order of magnitude faster than modern evolutionary search methods.
Reduced-Mass Orbital AI Inference via Integrated Solar, Compute, and Radiator Panels
cs.DC 2026-04 unverdicted novelty 6.0

Integrated solar-compute-radiator panels enable orbital satellites to achieve over 100 kW of AI inference compute per metric ton launched, supporting thousands of simultaneous large language model sessions.
MemFactory: Unified Inference & Training Framework for Agent Memory
cs.CL 2026-03 unverdicted novelty 6.0

MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
cs.CL 2023-10 unverdicted novelty 6.0

Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
cs.CE 2026-05 unverdicted novelty 5.0

LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
An Executable Benchmarking Suite for Tool-Using Agents
cs.SE 2026-05 unverdicted novelty 5.0

The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.
How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study
cs.SE 2026-05 conditional novelty 5.0

Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
cs.CV 2026-05 unverdicted novelty 5.0

Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
cs.LG 2026-05 accept novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
EdgeFM: Efficient Edge Inference for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 5.0

EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...
Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning
eess.SY 2026-04 unverdicted novelty 5.0

High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.
Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference
cs.PF 2026-05 unverdicted novelty 4.0

Nvidia achieves 1.6x throughput with NVFP4 but hits a VRAM wall for 70B+ models, while Apple UMA enables linear scaling to 80B at 4-bit with up to 23x better energy efficiency.
EasyVideoR1: Easier RL for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Hierarchical vs. Flat Iteration in Shared-Weight Transformers
cs.CL 2026-04 unverdicted novelty 4.0

Hierarchical two-speed shared-weight recurrence in Transformers shows a sharp performance gap compared to independent layer stacking in empirical language modeling tests.
Secure On-Premise Deployment of Open-Weights Large Language Models in Radiology: An Isolation-First Architecture with Prospective Pilot Evaluation
cs.CY 2026-03 conditional novelty 4.0

An isolation-first on-premise architecture for open-weights LLMs in radiology achieved regulatory approval for processing PHI and showed good utility for text-anchored tasks in a one-week pilot with 22 users.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Yi: Open Foundation Models by 01.AI
cs.CL 2024-03 unverdicted novelty 4.0

Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
SLM Finetuning for Natural Language to Domain Specific Code Generation in Production
cs.LG 2026-04 unverdicted novelty 3.0

Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losi...

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 33 Pith papers · 12 internal anchors

[1]

Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Am- mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, et al. 2022. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv preprint arXiv:2207.00032 (2022)

work page arXiv 2022
[2]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A neural probabilistic language model. Advances in neural information process- ing systems 13 (2000)

work page 2000
[4]

Ond rej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Gra- ham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Car- olina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. Findings ...

work page 2016
[5]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al . 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

work page 2020
[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys. org/blog/2023-03-30-vicuna/

work page 2023
[9]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling lan- guage modeling with pathways.arXiv preprint arXiv:2204.02311 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Daniel Crankshaw, Gur-Eyal Sela, Xiangxi Mo, Corey Zumar, Ion Stoica, Joseph Gonzalez, and Alexey Tumanov. 2020. InferLine: latency- aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing. 477–491

work page 2020
[11]

Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) . 613–627

work page 2017
[12]

Weihao Cui, Han Zhao, Quan Chen, Hao Wei, Zirui Li, Deze Zeng, Chao Li, and Minyi Guo. 2022. DVABatch: Diversity-aware Multi- Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 183–198

work page 2022
[13]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

work page
[14]

Advances in Neural Information Processing Systems 35 (2022), 16344–16359

Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359

work page 2022
[15]

Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTrans- formers: an efficient GPU serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . 389–402

work page 2021
[16]

FastAPI. 2023. FastAPI. https://github.com/tiangolo/fastapi

work page 2023
[17]

Pin Gao, Lingfan Yu, Yongwei Wu, and Jinyang Li. 2018. Low latency rnn inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference. 1–15

work page 2018
[18]

Amir Gholami, Zhewei Yao, Sehoon Kim, Michael W Mahoney, and Kurt Keutzer. 2021. Ai and memory wall.RiseLab Medium Post 1 (2021), 6

work page 2021
[19]

Github. 2022. https://github.com/features/copilot

work page 2022
[20]

Google. 2023. https://bard.google.com/

work page 2023
[21]

Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kauf- mann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving{DNNs} like Clockwork: Performance Predictability from the Bottom Up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 443–462

work page 2020
[22]

Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen

work page
[23]

In 16th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 22)

Microsecond-scale Preemption for Concurrent {GPU- accelerated} {DNN} Inferences. In 16th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 22) . 539–558

work page
[24]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition . 770–778

work page 2016
[25]

Chien-Chin Huang, Gu Jin, and Jinyang Li. 2020. Swapadvisor: Push- ing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Archi- tectural Support for Programming Languages and Operating Systems . 1341–1355

work page 2020
[26]

Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Check- mate: Breaking the memory wall with optimal tensor rematerialization. 14 Proceedings of Machine Learning and Systems 2 (2020), 497–511

work page 2020
[27]

Tom Kilburn, David BG Edwards, Michael J Lanigan, and Frank H Sumner. 1962. One-level storage system. IRE Transactions on Electronic Computers 2 (1962), 223–235

work page 1962
[28]

Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing contin- uous prompts for generation. arXiv preprint arXiv:2101.00190 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[30]

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. arXiv preprint arXiv:2302.11665 (2023)

work page arXiv 2023
[31]

Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling holistic deep learning compiler optimizations with rtasks. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation . 881–897

work page 2020
[32]

NVIDIA. [n. d.]. Triton Inference Server. https://developer.nvidia.com/ nvidia-triton-inference-server

work page
[33]

NVIDIA. 2023. FasterTransformer. https://github.com/NVIDIA/ FasterTransformer

work page 2023
[34]

NVIDIA. 2023. NCCL: The NVIDIA Collective Communication Library. https://developer.nvidia.com/nccl

work page 2023
[35]

Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke

work page
[36]

TensorFlow-Serving: Flexible, High-Performance ML Serving

Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139 (2017)

work page Pith review arXiv 2017
[37]

OpenAI. 2020. https://openai.com/blog/openai-api

work page 2020
[38]

OpenAI. 2022. https://openai.com/blog/chatgpt

work page 2022
[39]

OpenAI. 2023. https://openai.com/blog/custom-instructions-for- chatgpt

work page 2023
[40]

OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

LMSYS ORG. 2023. Chatbot Arena Leaderboard Week 8: Introduc- ing MT-Bench and Vicuna-33B. https://lmsys.org/blog/2023-06-22- leaderboard/

work page 2023
[42]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al . 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural informa- tion processing systems 32 (2019)

work page 2019
[43]

Shishir G Patil, Paras Jain, Prabal Dutta, Ion Stoica, and Joseph Gon- zalez. 2022. POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging. In International Conference on Machine Learning. PMLR, 17573–17583

work page 2022
[44]

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2022. Efficiently Scaling Transformer Inference. arXiv preprint arXiv:2211.05102 (2022)

work page arXiv 2022
[45]

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He

work page
[46]

In USENIX Annual Technical Conference

ZeRO-Offload: Democratizing Billion-Scale Model Training.. In USENIX Annual Technical Conference. 551–564

work page
[47]

Reuters. 2023. https://www.reuters.com/technology/tech-giants-ai- like-bing-bard-poses-billion-dollar-search-problem-2023-02-22/

work page 2023
[48]

Amazon Web Services. 2023. https://aws.amazon.com/bedrock/

work page 2023
[49]

Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: A GPU cluster engine for accelerating DNN-based video anal- ysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 322–337

work page 2019
[50]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E Gon- zalez, et al . 2023. High-throughput Generative Inference of Large Language Models with a Single GPU. arXiv preprint arXiv:2303.06865 (2023)

work page arXiv 2023
[51]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi- billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[52]

Benoit Steiner, Mostafa Elhoushi, Jacob Kahn, and James Hegarty. 2022. OLLA: Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks. (2022). https://doi.org/10.48550/ arXiv.2210.12924

work page arXiv 2022
[53]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to se- quence learning with neural networks. Advances in neural information processing systems 27 (2014)

work page 2014
[54]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https:// github.com/tatsu-lab/stanford_alpaca

work page 2023
[55]

ShareGPT Team. 2023. https://sharegpt.com/

work page 2023
[56]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention is all you need. Advances in neural information processing systems 30 (2017)

work page 2017
[58]

Jing Wang, Youyou Lu, Qing Wang, Minhui Xie, Keji Huang, and Jiwu Shu. 2022. Pacman: An Efficient Compaction Approach for {Log- Structured} {Key-Value} Store on Persistent Memory. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 773–788

work page 2022
[59]

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuai- wen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dy- namic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming . 41–53

work page 2018
[60]

Xiaohui Wang, Ying Xiong, Yang Wei, Mingxuan Wang, and Lei Li

work page
[61]

In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies: Industry Papers

LightSeq: A High Performance Inference Library for Transform- ers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies: Industry Papers. 113–120

work page 2021
[62]

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-Instruct: Aligning Language Model with Self Generated Instructions. arXiv preprint arXiv:2212.10560 (2022)

work page internal anchor Pith review arXiv 2022
[63]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations . 38–45

work page 2020
[64]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al . 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)

work page internal anchor Pith review arXiv 2016
[65]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for {Transformer-Based} Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 521–538

work page 2022
[66]

Hong Zhang, Yupeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. SHEPHERD: Serving DNNs in the Wild. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) . USENIX As- sociation, Boston, MA, 787–808. https://www.usenix.org/conference/ nsdi23/presentation/zhang-hong 15

work page 2023
[67]

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[68]

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 559–578

work page 2022
[69]

Zhe Zhou, Xuechao Wei, Jiejing Zhang, and Guangyu Sun. 2022. PetS: A Unified Framework for Parameter-Efficient Transformers Serving. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) . 489–504. 16

work page 2022