pith. machine review for the scientific record. sign in

arxiv: 2308.16369 · v1 · submitted 2023-08-31 · 💻 cs.LG · cs.DC

Recognition: 2 theorem links

· Lean Theorem

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:27 UTC · model grok-4.3

classification 💻 cs.LG cs.DC
keywords LLM inferencechunked prefillsdecode-maximal batchingpipeline bubblesGPU utilizationthroughput optimizationautoregressive generation
0
0 comments X

The pith

SARATHI splits each prefill into equal chunks and fills the rest of every batch with decode requests so the chunks saturate GPU compute while decodes piggyback at far lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM inference wastes GPU cycles because the prefill phase saturates compute while the decode phase generates one token per request and leaves most hardware idle. SARATHI breaks long prefills into fixed-size chunks and builds decode-maximal batches that contain one such chunk plus as many decode requests as possible. The chunk keeps the GPU busy, the decodes ride along at roughly one-tenth the cost of a pure decode batch, and multiple batches can be formed from a single original prefill. The same uniform batch shape also shrinks the bubbles that appear when pipeline parallelism is used. The net result is measured speed-ups of up to 10x on decode throughput and 1.9x on end-to-end throughput for large models.

Core claim

By dividing every prefill request into equal-sized chunks and constructing decode-maximal batches that pair one chunk with many decode requests, SARATHI keeps GPU utilization high throughout inference and removes most pipeline bubbles, delivering up to 10x higher decode throughput and 1.91x higher end-to-end throughput.

What carries the argument

Chunked-prefills combined with decode-maximal batching, in which each batch holds one prefill chunk plus the maximum number of decode requests that fit, allowing the chunk to saturate compute while decodes piggyback at low incremental cost.

If this is right

  • Decode throughput rises by up to 10x for LLaMA-13B on A6000 GPUs.
  • End-to-end throughput improves by up to 1.33x for the same model and hardware.
  • For LLaMA-33B on A100 the method yields 1.25x end-to-end and 4.25x decode throughput gains.
  • When pipeline parallelism is applied to GPT-3, pipeline bubbles drop by 6.29x and overall throughput rises 1.91x.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The uniform batch sizes may make it easier to schedule inference across heterogeneous GPUs or clusters.
  • The technique could be combined with existing KV-cache compression or quantization methods without redesigning the scheduler.
  • Smaller chunk sizes trade a modest increase in prefill overhead for even denser packing of decode requests.
  • The same chunking idea might reduce idle time in other autoregressive workloads such as image or audio generation.

Load-bearing premise

That a prefill chunk can be batched together with decode requests without changing the numerical results or correctness of the autoregressive generation.

What would settle it

Running the same prompts with and without SARATHI batching and observing any divergence in the generated token sequences or final perplexity.

read the original abstract

Large Language Model (LLM) inference consists of two distinct phases - prefill phase which processes the input prompt and decode phase which generates output tokens autoregressively. While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipeline parallelism, resulting in further inefficiency due to bubbles. We present SARATHI to address these challenges. SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes. During inference, the prefill chunk saturates GPU compute, while the decode requests 'piggyback' and cost up to an order of magnitude less compared to a decode-only batch. Chunked-prefills allows constructing multiple decode-maximal batches from a single prefill request, maximizing coverage of decodes that can piggyback. Furthermore, the uniform compute design of these batches ameliorates the imbalance between micro-batches, significantly reducing pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware. For the LLaMA-13B model on A6000 GPU, SARATHI improves decode throughput by up to 10x, and accelerates end-to-end throughput by up to 1.33x. For LLaMa-33B on A100 GPU, we achieve 1.25x higher end-to-end-throughput and up to 4.25x higher decode throughput. When used with pipeline parallelism on GPT-3, SARATHI reduces bubbles by 6.29x, resulting in an end-to-end throughput improvement of 1.91x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SARATHI, a technique for LLM inference that splits prefill requests into equal-sized chunks and constructs decode-maximal batches consisting of one prefill chunk plus as many decode requests as possible. This allows decode requests to piggyback on the high-utilization prefill chunk, improving decode throughput while also reducing pipeline bubbles under pipeline parallelism. Concrete claims include up to 10x decode throughput and 1.33x end-to-end throughput for LLaMA-13B on A6000, 1.25x end-to-end and 4.25x decode for LLaMA-33B on A100, and 6.29x bubble reduction yielding 1.91x end-to-end for GPT-3 under pipeline parallelism.

Significance. If the mixed-batch correctness invariants hold, the approach offers a practical way to raise GPU utilization during the memory-bound decode phase and to balance micro-batch compute in pipeline-parallel serving without changing model architecture or requiring extra memory, which would be a useful systems contribution for production LLM inference.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (decode-maximal batching): the claim that a single forward pass containing one prefill chunk plus multiple decode requests produces exactly the same autoregressive token sequences as running the full prefill first requires that (a) KV-cache writes occur only for the prefill chunk's tokens, (b) rotary position embeddings and attention masks are computed correctly for heterogeneous sequence lengths, and (c) no cross-sequence attention occurs. No equations, pseudocode, or mask-construction details are supplied, leaving the central performance claims unverifiable.
  2. [§4] §4 (experimental evaluation): reported speedups (10x decode, 1.33–1.91x end-to-end) are presented without workload descriptions (prompt/output length distributions), number of runs, error bars, or ablations that isolate the contribution of chunk size versus batch composition; this makes it impossible to assess whether the gains are robust or sensitive to the unstated assumptions about mixed-batch correctness.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'up to' is used for all quantitative claims without stating the corresponding input conditions (e.g., batch size, sequence length) that achieve the maximum.
  2. [Throughout] Throughout: notation for batch composition (e.g., how many decode slots are filled per prefill chunk) is introduced informally and would benefit from a compact table or diagram.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address each major comment below with clarifications and commit to revisions that strengthen the manuscript's verifiability without altering its core technical claims.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (decode-maximal batching): the claim that a single forward pass containing one prefill chunk plus multiple decode requests produces exactly the same autoregressive token sequences as running the full prefill first requires that (a) KV-cache writes occur only for the prefill chunk's tokens, (b) rotary position embeddings and attention masks are computed correctly for heterogeneous sequence lengths, and (c) no cross-sequence attention occurs. No equations, pseudocode, or mask-construction details are supplied, leaving the central performance claims unverifiable.

    Authors: We agree that the absence of explicit implementation details leaves the mixed-batch correctness claims difficult to verify from the current text. The approach relies on per-sequence attention masks that prevent cross-sequence attention, KV-cache updates restricted exclusively to tokens in the current prefill chunk, and rotary embeddings applied using each token's position within its own sequence. In the revised manuscript we will add a dedicated subsection with pseudocode for batch construction, mask generation, and KV-cache handling, plus a short proof sketch showing that the resulting token sequences are identical to those from a standard prefill-then-decode schedule. This addition directly addresses the verifiability concern. revision: yes

  2. Referee: [§4] §4 (experimental evaluation): reported speedups (10x decode, 1.33–1.91x end-to-end) are presented without workload descriptions (prompt/output length distributions), number of runs, error bars, or ablations that isolate the contribution of chunk size versus batch composition; this makes it impossible to assess whether the gains are robust or sensitive to the unstated assumptions about mixed-batch correctness.

    Authors: The referee correctly notes that the experimental section lacks sufficient methodological detail. We will revise §4 to include: (i) explicit descriptions of the prompt and output length distributions used in each workload, (ii) the number of runs and any statistical measures (including error bars), and (iii) additional ablations that separately quantify the contributions of chunk size and decode-maximal batch composition. These changes will allow readers to evaluate robustness and sensitivity to the mixed-batch assumptions. revision: yes

Circularity Check

0 steps flagged

No circularity; performance claims are direct empirical measurements

full rationale

The paper introduces SARATHI via chunked-prefills and decode-maximal batching, reporting speedups (10x decode throughput, 1.33x end-to-end) as outcomes of direct GPU experiments on LLaMA-13B and GPT-3. No equations, fitted parameters, self-citations, or derivations appear in the provided text; the central claims rest on measured throughput and bubble reduction rather than any reduction to prior inputs by construction. The work is self-contained against external benchmarks through hardware-specific timing results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all claims rest on unstated implementation assumptions about batch construction and GPU scheduling.

pith-pipeline@v0.9.0 · 5673 in / 1038 out tokens · 34320 ms · 2026-05-16T06:27:19.762524+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Cost.JcostCore Jcost_nonneg unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes.

  • IndisputableMonolith.Foundation.DimensionForcing eight_tick_forces_D3 unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Chunked-prefills allows constructing multiple decode-maximal batches from a single prefill request, maximizing coverage of decodes that can piggyback.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

    cs.DC 2026-05 unverdicted novelty 7.0

    Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.

  2. Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference

    cs.DC 2026-05 unverdicted novelty 7.0

    Kairos improves SLO attainment and throughput in LLM serving by adapting to request length imbalance with priority scheduling and adaptive batching.

  3. GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

    cs.DC 2026-03 unverdicted novelty 7.0

    GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.

  4. Training-Inference Consistent Segmented Execution for Long-Context LLMs

    cs.CL 2026-05 conditional novelty 6.0

    A training-inference consistent segmented execution framework for long-context LLMs matches full-context performance with substantially lower peak memory at very long lengths.

  5. SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

    cs.DC 2026-05 conditional novelty 6.0

    SPECTRE delivers up to 2.28x speedup on large-model LLM inference by turning idle tail-model services into remote speculative drafters using hybrid parallel decoding and priority scheduling.

  6. SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

    cs.DC 2026-05 unverdicted novelty 6.0

    SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.

  7. ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

    cs.LG 2026-05 unverdicted novelty 6.0

    ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.

  8. Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

    cs.LG 2026-04 unverdicted novelty 6.0

    SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

  9. AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

    cs.AR 2026-04 unverdicted novelty 6.0

    AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H...

  10. Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

    cs.AR 2026-04 unverdicted novelty 6.0

    Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.

  11. Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

    cs.LG 2026-04 unverdicted novelty 6.0

    NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...

  12. MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems

    cs.OS 2026-04 conditional novelty 6.0

    MARS coordinates heterogeneous GPU-CPU resources for agentic LLM workloads via decoupled admission control and agent-centric KV cache management, delivering up to 5.94x lower latency and 1.87x faster task completion.

  13. Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees

    cs.LG 2026-04 unverdicted novelty 6.0

    A flow-control framework for LLM inference derives necessary and sufficient stability conditions and experimentally improves throughput, latency, and KV cache stability over common baselines.

  14. Valve: Production Online-Offline Inference Colocation with Jointly-Bounded Preemption Latency and Rate

    cs.OS 2026-04 unverdicted novelty 6.0

    Valve jointly bounds preemption latency and rate for online-offline LLM colocation on GPUs, delivering 34.6% higher cluster utilization and a 2,170-GPU saving in a production deployment of 8,054 GPUs with under 5% TTF...

  15. HybridFlow: A Flexible and Efficient RLHF Framework

    cs.LG 2024-09 unverdicted novelty 6.0

    HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

  16. PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers

    cs.DC 2026-05 unverdicted novelty 5.0

    PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.

  17. EdgeFM: Efficient Edge Inference for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...

  18. A Survey on Efficient Inference for Large Language Models

    cs.CL 2024-04 accept novelty 3.0

    The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

  19. RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

    cs.LG 2025-05

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 18 Pith papers · 4 internal anchors

  1. [1]

    https://aws.amazon.com/ codewhisperer/

    Amazon codewhisperer. https://aws.amazon.com/ codewhisperer/

  2. [2]

    https://claude.ai

    Anthropic claude. https://claude.ai

  3. [3]

    https://www.bing.com/chat

    Bing ai. https://www.bing.com/chat

  4. [4]

    https://character.ai

    Character ai. https://character.ai

  5. [5]

    https://chat.openai.com

    Chatgpt. https://chat.openai.com

  6. [6]

    https://github.com/NVIDIA/ FasterTransformer

    Faster Transformer. https://github.com/NVIDIA/ FasterTransformer

  7. [7]

    https://github.com/features/ copilot

    Github copilot. https://github.com/features/ copilot

  8. [8]

    https://bard.google.com

    Google bard. https://bard.google.com

  9. [9]

    https://komo.ai/

    Komo. https://komo.ai/

  10. [10]

    https://huggingface.co/ decapoda-research/llama-13b-hf

    Llama model card. https://huggingface.co/ decapoda-research/llama-13b-hf

  11. [11]

    https://docs.nvidia.com/deeplearning/ performance/dl-performance-matrix- multiplication/index.html

    Matrix multiplication background user’s guide. https://docs.nvidia.com/deeplearning/ performance/dl-performance-matrix- multiplication/index.html

  12. [12]

    https://github.com/karpathy/nanoGPT

    nanogpt. https://github.com/karpathy/nanoGPT

  13. [13]

    https: //developer.nvidia.com/nvidia-triton- inference-server

    NVIDIA Triton Inference Server. https: //developer.nvidia.com/nvidia-triton- inference-server

  14. [14]

    https://www.theaidream.com/post/openai-gpt- 3-understanding-the-architecture

    Openai gpt-3: Understanding the architecture. https://www.theaidream.com/post/openai-gpt- 3-understanding-the-architecture

  15. [15]

    https://www.perplexity.ai/

    Perplexity ai. https://www.perplexity.ai/

  16. [16]

    https://replit.com/site/ ghostwriter

    Replit ghostwriter. https://replit.com/site/ ghostwriter

  17. [17]

    https://huggingface.co/ text-generation-inference

    Text generation inference. https://huggingface.co/ text-generation-inference

  18. [18]

    https: //blog.gopenai.com/how-to-speed-up-llms- and-use-100k-context-window-all-tricks-in- one-place-ffd40577b4c

    The Secret Sauce behind 100K context win- dow in LLMs: all tricks in one place. https: //blog.gopenai.com/how-to-speed-up-llms- and-use-100k-context-window-all-tricks-in- one-place-ffd40577b4c

  19. [19]

    https: //core.vmware.com/blog/using-nvidias-aiml- frameworks-generative-ai-vmware-vsphere

    Using NVIDIA’s AI/ML Frameworks for Gen- erative AI on VMware vSphere. https: //core.vmware.com/blog/using-nvidias-aiml- frameworks-generative-ai-vmware-vsphere

  20. [20]

    https://github.com/vllm-project/vllm

    vllm: Easy, fast, and cheap llm serving for everyone. https://github.com/vllm-project/vllm

  21. [21]

    https://facebookresearch.github.io/xformers/ components/ops.html

    XFORMERS OPTIMIZED OPERATORS. https://facebookresearch.github.io/xformers/ components/ops.html

  22. [22]

    https://you.com/

    You.com. https://you.com/

  23. [23]

    Ef- ficient large scale language modeling with mixtures of experts, 2022

    Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mi- haylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anan- tharaman, Xian Li, Shuohui Chen, Halil Akin, Man- deep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, and Ves Stoyan...

  24. [24]

    Varuna: scal- able, low-cost training of massive deep learning models

    Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ra- machandran Ramjee, and Nipun Kwatra. Varuna: scal- able, low-cost training of massive deep learning models. In Proceedings of the Seventeenth European Conference on Computer Systems, pages 472–487, 2022

  25. [25]

    Language models are few-shot learn- ers

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learn- ers. Advances in neural information processing systems, 33:1877–1901, 2020

  26. [26]

    Aakanksha Chowdhery, Sharan Narang, Jacob De- vlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- bastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prab- hakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James...

  27. [27]

    Clipper: 15 A {Low-Latency} online prediction serving system

    Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. Clipper: 15 A {Low-Latency} online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613–627, 2017

  28. [28]

    Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

  29. [29]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness, 2022

  30. [30]

    Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022

  31. [31]

    Qlora: Efficient finetuning of quan- tized llms, 2023

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quan- tized llms, 2023

  32. [32]

    Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023

  33. [33]

    Lee, Anjali Sridhar, Shruti Bhosale, Carole-Jean Wu, and Benjamin Lee

    Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Hsien-Hsin S. Lee, Anjali Sridhar, Shruti Bhosale, Carole-Jean Wu, and Benjamin Lee. Towards moe de- ployment: Mitigating inefficiencies in mixture-of-expert (moe) inference, 2023

  34. [34]

    Gpipe: Effi- cient training of giant neural networks using pipeline parallelism

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Effi- cient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019

  35. [35]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020

  36. [36]

    Accelerating distributed MoE training and inference with lina

    Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. Accelerating distributed MoE training and inference with lina. In 2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 945–959, Boston, MA, July 2023. USENIX Association

  37. [37]

    Pipedream: gen- eralized pipeline parallelism for dnn training

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: gen- eralized pipeline parallelism for dnn training. In Pro- ceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019

  38. [38]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023

  39. [39]

    Efficiently scaling transformer inference, 2022

    Reiner Pope, Sholto Douglas, Aakanksha Chowdh- ery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference, 2022

  40. [40]

    Rabe and Charles Staats

    Markus N. Rabe and Charles Staats. Self-attention does not need o(n2) memory, 2022

  41. [41]

    Fast transformer decoding: One write- head is all you need, 2019

    Noam Shazeer. Fast transformer decoding: One write- head is all you need, 2019

  42. [42]

    Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y . Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu, 2023

  43. [43]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019

  44. [44]

    Retentive network: A successor to transformer for large language models, 2023

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models, 2023

  45. [45]

    Chi, Tat- sunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Bar- ret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tat- sunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022

  46. [46]

    Fast distributed inference serving for large language models, 2023

    Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models, 2023

  47. [47]

    Smoothquant: Accu- rate and efficient post-training quantization for large language models, 2023

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accu- rate and efficient post-training quantization for large language models, 2023

  48. [48]

    Orca: A distributed serving system for Transformer-Based generative mod- els

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- jeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for Transformer-Based generative mod- els. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association. 16