arxiv: 2308.16369 · v1 · submitted 2023-08-31 · 💻 cs.LG · cs.DC

Recognition: 2 theorem links

· Lean Theorem

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Amey Agrawal , Ashish Panwar , Jayashree Mohan , Nipun Kwatra , Bhargav S. Gulavani , Ramachandran Ramjee

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:27 UTC · model grok-4.3

classification 💻 cs.LG cs.DC

keywords LLM inferencechunked prefillsdecode-maximal batchingpipeline bubblesGPU utilizationthroughput optimizationautoregressive generation

0 comments

The pith

SARATHI splits each prefill into equal chunks and fills the rest of every batch with decode requests so the chunks saturate GPU compute while decodes piggyback at far lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM inference wastes GPU cycles because the prefill phase saturates compute while the decode phase generates one token per request and leaves most hardware idle. SARATHI breaks long prefills into fixed-size chunks and builds decode-maximal batches that contain one such chunk plus as many decode requests as possible. The chunk keeps the GPU busy, the decodes ride along at roughly one-tenth the cost of a pure decode batch, and multiple batches can be formed from a single original prefill. The same uniform batch shape also shrinks the bubbles that appear when pipeline parallelism is used. The net result is measured speed-ups of up to 10x on decode throughput and 1.9x on end-to-end throughput for large models.

Core claim

By dividing every prefill request into equal-sized chunks and constructing decode-maximal batches that pair one chunk with many decode requests, SARATHI keeps GPU utilization high throughout inference and removes most pipeline bubbles, delivering up to 10x higher decode throughput and 1.91x higher end-to-end throughput.

What carries the argument

Chunked-prefills combined with decode-maximal batching, in which each batch holds one prefill chunk plus the maximum number of decode requests that fit, allowing the chunk to saturate compute while decodes piggyback at low incremental cost.

If this is right

Decode throughput rises by up to 10x for LLaMA-13B on A6000 GPUs.
End-to-end throughput improves by up to 1.33x for the same model and hardware.
For LLaMA-33B on A100 the method yields 1.25x end-to-end and 4.25x decode throughput gains.
When pipeline parallelism is applied to GPT-3, pipeline bubbles drop by 6.29x and overall throughput rises 1.91x.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The uniform batch sizes may make it easier to schedule inference across heterogeneous GPUs or clusters.
The technique could be combined with existing KV-cache compression or quantization methods without redesigning the scheduler.
Smaller chunk sizes trade a modest increase in prefill overhead for even denser packing of decode requests.
The same chunking idea might reduce idle time in other autoregressive workloads such as image or audio generation.

Load-bearing premise

That a prefill chunk can be batched together with decode requests without changing the numerical results or correctness of the autoregressive generation.

What would settle it

Running the same prompts with and without SARATHI batching and observing any divergence in the generated token sequences or final perplexity.

read the original abstract

Large Language Model (LLM) inference consists of two distinct phases - prefill phase which processes the input prompt and decode phase which generates output tokens autoregressively. While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipeline parallelism, resulting in further inefficiency due to bubbles. We present SARATHI to address these challenges. SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes. During inference, the prefill chunk saturates GPU compute, while the decode requests 'piggyback' and cost up to an order of magnitude less compared to a decode-only batch. Chunked-prefills allows constructing multiple decode-maximal batches from a single prefill request, maximizing coverage of decodes that can piggyback. Furthermore, the uniform compute design of these batches ameliorates the imbalance between micro-batches, significantly reducing pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware. For the LLaMA-13B model on A6000 GPU, SARATHI improves decode throughput by up to 10x, and accelerates end-to-end throughput by up to 1.33x. For LLaMa-33B on A100 GPU, we achieve 1.25x higher end-to-end-throughput and up to 4.25x higher decode throughput. When used with pipeline parallelism on GPT-3, SARATHI reduces bubbles by 6.29x, resulting in an end-to-end throughput improvement of 1.91x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SARATHI's chunked-prefill plus decode-piggyback batching gives real throughput gains on measured hardware, but the mixed-batch correctness details are still thin.

read the letter

SARATHI splits each prefill into equal-sized chunks and builds batches that hold one chunk plus as many decode steps as the hardware allows. The prefill chunk keeps the GPU busy while the decodes ride along at much lower cost. They also use the uniform batch sizes to cut pipeline bubbles when running across multiple stages. The headline numbers are 10x decode throughput on LLaMA-13B and 1.9x end-to-end on GPT-3 with pipeline parallelism. Those are the kind of deltas that matter for serving economics. The paper does a clean job laying out the two classic problems—low decode utilization and micro-batch imbalance—and shows the same technique helps both. Measurements across 13B and 33B models on A6000 and A100 hardware give the claims some grounding. The pipeline result is especially useful because bubbles are a known pain point once you scale beyond a single GPU. The soft spot is exactly the one the stress test flags. Mixing a partial prefill with full decodes in one forward pass requires careful attention masks, per-sequence position IDs, and KV-cache updates that do not leak across requests. The abstract says accuracy is preserved, but the text gives no equations or pseudocode for how the mask or rotary embeddings are built for the heterogeneous lengths. Without that, it is hard to be sure the 10x number reflects correct generation rather than an implementation shortcut. No error bars or workload traces appear in the summary either, so it is difficult to judge how sensitive the gains are to prompt length or batch composition. This paper is for people who run production LLM inference and need to squeeze more tokens per GPU. A serving engineer or systems researcher would get concrete ideas from the batch-construction rule. It deserves a serious referee. The core engineering move is simple and the measured improvements are large enough to justify the review time, even if the correctness argument needs more explicit support in the final version.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SARATHI, a technique for LLM inference that splits prefill requests into equal-sized chunks and constructs decode-maximal batches consisting of one prefill chunk plus as many decode requests as possible. This allows decode requests to piggyback on the high-utilization prefill chunk, improving decode throughput while also reducing pipeline bubbles under pipeline parallelism. Concrete claims include up to 10x decode throughput and 1.33x end-to-end throughput for LLaMA-13B on A6000, 1.25x end-to-end and 4.25x decode for LLaMA-33B on A100, and 6.29x bubble reduction yielding 1.91x end-to-end for GPT-3 under pipeline parallelism.

Significance. If the mixed-batch correctness invariants hold, the approach offers a practical way to raise GPU utilization during the memory-bound decode phase and to balance micro-batch compute in pipeline-parallel serving without changing model architecture or requiring extra memory, which would be a useful systems contribution for production LLM inference.

major comments (2)

[Abstract and §3] Abstract and §3 (decode-maximal batching): the claim that a single forward pass containing one prefill chunk plus multiple decode requests produces exactly the same autoregressive token sequences as running the full prefill first requires that (a) KV-cache writes occur only for the prefill chunk's tokens, (b) rotary position embeddings and attention masks are computed correctly for heterogeneous sequence lengths, and (c) no cross-sequence attention occurs. No equations, pseudocode, or mask-construction details are supplied, leaving the central performance claims unverifiable.
[§4] §4 (experimental evaluation): reported speedups (10x decode, 1.33–1.91x end-to-end) are presented without workload descriptions (prompt/output length distributions), number of runs, error bars, or ablations that isolate the contribution of chunk size versus batch composition; this makes it impossible to assess whether the gains are robust or sensitive to the unstated assumptions about mixed-batch correctness.

minor comments (2)

[Abstract] Abstract: the phrase 'up to' is used for all quantitative claims without stating the corresponding input conditions (e.g., batch size, sequence length) that achieve the maximum.
[Throughout] Throughout: notation for batch composition (e.g., how many decode slots are filled per prefill chunk) is introduced informally and would benefit from a compact table or diagram.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript's verifiability without altering its core technical claims.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (decode-maximal batching): the claim that a single forward pass containing one prefill chunk plus multiple decode requests produces exactly the same autoregressive token sequences as running the full prefill first requires that (a) KV-cache writes occur only for the prefill chunk's tokens, (b) rotary position embeddings and attention masks are computed correctly for heterogeneous sequence lengths, and (c) no cross-sequence attention occurs. No equations, pseudocode, or mask-construction details are supplied, leaving the central performance claims unverifiable.

Authors: We agree that the absence of explicit implementation details leaves the mixed-batch correctness claims difficult to verify from the current text. The approach relies on per-sequence attention masks that prevent cross-sequence attention, KV-cache updates restricted exclusively to tokens in the current prefill chunk, and rotary embeddings applied using each token's position within its own sequence. In the revised manuscript we will add a dedicated subsection with pseudocode for batch construction, mask generation, and KV-cache handling, plus a short proof sketch showing that the resulting token sequences are identical to those from a standard prefill-then-decode schedule. This addition directly addresses the verifiability concern. revision: yes
Referee: [§4] §4 (experimental evaluation): reported speedups (10x decode, 1.33–1.91x end-to-end) are presented without workload descriptions (prompt/output length distributions), number of runs, error bars, or ablations that isolate the contribution of chunk size versus batch composition; this makes it impossible to assess whether the gains are robust or sensitive to the unstated assumptions about mixed-batch correctness.

Authors: The referee correctly notes that the experimental section lacks sufficient methodological detail. We will revise §4 to include: (i) explicit descriptions of the prompt and output length distributions used in each workload, (ii) the number of runs and any statistical measures (including error bars), and (iii) additional ablations that separately quantify the contributions of chunk size and decode-maximal batch composition. These changes will allow readers to evaluate robustness and sensitivity to the mixed-batch assumptions. revision: yes

Circularity Check

0 steps flagged

No circularity; performance claims are direct empirical measurements

full rationale

The paper introduces SARATHI via chunked-prefills and decode-maximal batching, reporting speedups (10x decode throughput, 1.33x end-to-end) as outcomes of direct GPU experiments on LLaMA-13B and GPT-3. No equations, fitted parameters, self-citations, or derivations appear in the provided text; the central claims rest on measured throughput and bubble reduction rather than any reduction to prior inputs by construction. The work is self-contained against external benchmarks through hardware-specific timing results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all claims rest on unstated implementation assumptions about batch construction and GPU scheduling.

pith-pipeline@v0.9.0 · 5673 in / 1038 out tokens · 34320 ms · 2026-05-16T06:27:19.762524+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.JcostCore Jcost_nonneg unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes.
IndisputableMonolith.Foundation.DimensionForcing eight_tick_forces_D3 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Chunked-prefills allows constructing multiple decode-maximal batches from a single prefill request, maximizing coverage of decodes that can piggyback.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
cs.DC 2026-05 unverdicted novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.
Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference
cs.DC 2026-05 unverdicted novelty 7.0

Kairos improves SLO attainment and throughput in LLM serving by adapting to request length imbalance with priority scheduling and adaptive batching.
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
cs.DC 2026-03 unverdicted novelty 7.0

GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
Training-Inference Consistent Segmented Execution for Long-Context LLMs
cs.CL 2026-05 conditional novelty 6.0

A training-inference consistent segmented execution framework for long-context LLMs matches full-context performance with substantially lower peak memory at very long lengths.
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
cs.DC 2026-05 conditional novelty 6.0

SPECTRE delivers up to 2.28x speedup on large-model LLM inference by turning idle tail-model services into remote speculative drafters using hybrid parallel decoding and priority scheduling.
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
cs.DC 2026-05 unverdicted novelty 6.0

SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.
ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
cs.LG 2026-05 unverdicted novelty 6.0

ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
cs.AR 2026-04 unverdicted novelty 6.0

AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H...
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding
cs.AR 2026-04 unverdicted novelty 6.0

Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
cs.LG 2026-04 unverdicted novelty 6.0

NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...
MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems
cs.OS 2026-04 conditional novelty 6.0

MARS coordinates heterogeneous GPU-CPU resources for agentic LLM workloads via decoupled admission control and agent-centric KV cache management, delivering up to 5.94x lower latency and 1.87x faster task completion.
Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees
cs.LG 2026-04 unverdicted novelty 6.0

A flow-control framework for LLM inference derives necessary and sufficient stability conditions and experimentally improves throughput, latency, and KV cache stability over common baselines.
Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate
cs.OS 2026-04 unverdicted novelty 6.0

Valve jointly bounds preemption latency and rate for online-offline LLM colocation on GPUs, delivering 34.6% higher cluster utilization and a 2,170-GPU saving in a production deployment of 8,054 GPUs with under 5% TTF...
HybridFlow: A Flexible and Efficient RLHF Framework
cs.LG 2024-09 unverdicted novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers
cs.DC 2026-05 unverdicted novelty 5.0

PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.
EdgeFM: Efficient Edge Inference for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 5.0

EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...
A Survey on Efficient Inference for Large Language Models
cs.CL 2024-04 accept novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
cs.LG 2025-05

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 18 Pith papers · 4 internal anchors

[1]

https://aws.amazon.com/ codewhisperer/

Amazon codewhisperer. https://aws.amazon.com/ codewhisperer/

work page
[2]

https://claude.ai

Anthropic claude. https://claude.ai

work page
[3]

https://www.bing.com/chat

Bing ai. https://www.bing.com/chat

work page
[4]

https://character.ai

Character ai. https://character.ai

work page
[5]

https://chat.openai.com

Chatgpt. https://chat.openai.com

work page
[6]

https://github.com/NVIDIA/ FasterTransformer

Faster Transformer. https://github.com/NVIDIA/ FasterTransformer

work page
[7]

https://github.com/features/ copilot

Github copilot. https://github.com/features/ copilot

work page
[8]

https://bard.google.com

Google bard. https://bard.google.com

work page
[9]

https://komo.ai/

Komo. https://komo.ai/

work page
[10]

https://huggingface.co/ decapoda-research/llama-13b-hf

Llama model card. https://huggingface.co/ decapoda-research/llama-13b-hf

work page
[11]

https://docs.nvidia.com/deeplearning/ performance/dl-performance-matrix- multiplication/index.html

Matrix multiplication background user’s guide. https://docs.nvidia.com/deeplearning/ performance/dl-performance-matrix- multiplication/index.html

work page
[12]

https://github.com/karpathy/nanoGPT

nanogpt. https://github.com/karpathy/nanoGPT

work page
[13]

https: //developer.nvidia.com/nvidia-triton- inference-server

NVIDIA Triton Inference Server. https: //developer.nvidia.com/nvidia-triton- inference-server

work page
[14]

https://www.theaidream.com/post/openai-gpt- 3-understanding-the-architecture

Openai gpt-3: Understanding the architecture. https://www.theaidream.com/post/openai-gpt- 3-understanding-the-architecture

work page
[15]

https://www.perplexity.ai/

Perplexity ai. https://www.perplexity.ai/

work page
[16]

https://replit.com/site/ ghostwriter

Replit ghostwriter. https://replit.com/site/ ghostwriter

work page
[17]

https://huggingface.co/ text-generation-inference

Text generation inference. https://huggingface.co/ text-generation-inference

work page
[18]

https: //blog.gopenai.com/how-to-speed-up-llms- and-use-100k-context-window-all-tricks-in- one-place-ffd40577b4c

The Secret Sauce behind 100K context win- dow in LLMs: all tricks in one place. https: //blog.gopenai.com/how-to-speed-up-llms- and-use-100k-context-window-all-tricks-in- one-place-ffd40577b4c

work page
[19]

https: //core.vmware.com/blog/using-nvidias-aiml- frameworks-generative-ai-vmware-vsphere

Using NVIDIA’s AI/ML Frameworks for Gen- erative AI on VMware vSphere. https: //core.vmware.com/blog/using-nvidias-aiml- frameworks-generative-ai-vmware-vsphere

work page
[20]

https://github.com/vllm-project/vllm

vllm: Easy, fast, and cheap llm serving for everyone. https://github.com/vllm-project/vllm

work page
[21]

https://facebookresearch.github.io/xformers/ components/ops.html

XFORMERS OPTIMIZED OPERATORS. https://facebookresearch.github.io/xformers/ components/ops.html

work page
[22]

https://you.com/

You.com. https://you.com/

work page
[23]

Ef- ficient large scale language modeling with mixtures of experts, 2022

Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mi- haylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anan- tharaman, Xian Li, Shuohui Chen, Halil Akin, Man- deep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, and Ves Stoyan...

work page 2022
[24]

Varuna: scal- able, low-cost training of massive deep learning models

Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ra- machandran Ramjee, and Nipun Kwatra. Varuna: scal- able, low-cost training of massive deep learning models. In Proceedings of the Seventeenth European Conference on Computer Systems, pages 472–487, 2022

work page 2022
[25]

Language models are few-shot learn- ers

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learn- ers. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[26]

Aakanksha Chowdhery, Sharan Narang, Jacob De- vlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prab- hakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Clipper: 15 A {Low-Latency} online prediction serving system

Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. Clipper: 15 A {Low-Latency} online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613–627, 2017

work page 2017
[28]

Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

work page 2023
[29]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness, 2022

work page 2022
[30]

Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022

work page 2022
[31]

Qlora: Efficient finetuning of quan- tized llms, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quan- tized llms, 2023

work page 2023
[32]

Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023

work page 2023
[33]

Lee, Anjali Sridhar, Shruti Bhosale, Carole-Jean Wu, and Benjamin Lee

Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Hsien-Hsin S. Lee, Anjali Sridhar, Shruti Bhosale, Carole-Jean Wu, and Benjamin Lee. Towards moe de- ployment: Mitigating inefficiencies in mixture-of-expert (moe) inference, 2023

work page 2023
[34]

Gpipe: Effi- cient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Effi- cient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019

work page 2019
[35]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[36]

Accelerating distributed MoE training and inference with lina

Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. Accelerating distributed MoE training and inference with lina. In 2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 945–959, Boston, MA, July 2023. USENIX Association

work page 2023
[37]

Pipedream: gen- eralized pipeline parallelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: gen- eralized pipeline parallelism for dnn training. In Pro- ceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019

work page 2019
[38]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Efficiently scaling transformer inference, 2022

Reiner Pope, Sholto Douglas, Aakanksha Chowdh- ery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference, 2022

work page 2022
[40]

Rabe and Charles Staats

Markus N. Rabe and Charles Staats. Self-attention does not need o(n2) memory, 2022

work page 2022
[41]

Fast transformer decoding: One write- head is all you need, 2019

Noam Shazeer. Fast transformer decoding: One write- head is all you need, 2019

work page 2019
[42]

Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023

work page 2023
[43]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[44]

Retentive network: A successor to transformer for large language models, 2023

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models, 2023

work page 2023
[45]

Chi, Tat- sunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Bar- ret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tat- sunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022

work page 2022
[46]

Fast distributed inference serving for large language models, 2023

Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models, 2023

work page 2023
[47]

Smoothquant: Accu- rate and efficient post-training quantization for large language models, 2023

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accu- rate and efficient post-training quantization for large language models, 2023

work page 2023
[48]

Orca: A distributed serving system for Transformer-Based generative mod- els

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- jeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for Transformer-Based generative mod- els. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association. 16

work page 2022