arxiv: 2401.10774 · v3 · submitted 2024-01-19 · 💻 cs.LG · cs.CL

Recognition: 1 theorem link

· Lean Theorem

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Deming Chen, Hongwu Peng, Jason D. Lee, Tianle Cai, Tri Dao, Yuhong Li, Zhengyang Geng

Authors on Pith no claims yet

Pith reviewed 2026-05-13 10:32 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM inferencedecoding accelerationmultiple headstree attentionspeculative decodingparallel predictionfine-tuninggeneration speedup

0 comments

The pith

By adding multiple decoding heads to an LLM, Medusa predicts several future tokens in parallel and verifies them together in one step, reducing sequential decoding steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models generate one token at a time, forcing repeated loading of the full model weights. Medusa attaches extra decoding heads that forecast the next several tokens at once. These forecasts are arranged into a tree of candidate sequences that the model checks simultaneously via a special attention pattern. When the heads are accurate, multiple tokens advance per iteration instead of one. The method offers two training paths: heads alone on a frozen backbone for safe acceleration, or joint training of heads and backbone for larger gains.

Core claim

Medusa augments an LLM with additional decoding heads that output predictions for multiple subsequent tokens. These predictions form a tree of candidate continuations that are verified in parallel during each decoding step through a tree-based attention mask. This parallel verification replaces several sequential forward passes, yielding over 2.2x speedup when only the heads are fine-tuned and 2.3-3.6x speedup when the backbone is also updated, all while preserving the original generation quality.

What carries the argument

Extra decoding heads that predict logits for future positions, combined with a tree-structured attention mask that lets the model score multiple candidate token sequences in a single forward pass.

If this is right

The total number of sequential model calls drops because multiple tokens are accepted per step.
No separate draft model needs to be trained or maintained, unlike classic speculative decoding.
Medusa-1 keeps the backbone unchanged and still delivers over 2.2x speedup with unchanged output quality.
Medusa-2 reaches 2.3-3.6x speedup by jointly fine-tuning heads and backbone under a special training recipe.
Self-distillation allows the heads to be trained without external data while keeping quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could lower serving latency for interactive applications where each new token matters.
Similar head additions might speed up other autoregressive generators such as image or audio models.
Longer contexts may show different gains because prediction accuracy can vary with sequence length.
Pairing Medusa with quantization or KV-cache compression would likely multiply the observed speedups.

Load-bearing premise

The extra heads must generate predictions accurate enough that the tree verification accepts more than one token per step on average, outweighing the added computation cost.

What would settle it

Measure the average number of accepted tokens per decoding step on a fixed benchmark; if the effective rate stays near 1 after accounting for head overhead, the net speedup vanishes.

read the original abstract

Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator's cache. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft model. In this paper, we present Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Using a tree-based attention mechanism, Medusa constructs multiple candidate continuations and verifies them simultaneously in each decoding step. By leveraging parallel processing, Medusa substantially reduces the number of decoding steps required. We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa is fine-tuned together with the backbone LLM, enabling better prediction accuracy of Medusa heads and higher speedup but needing a special training recipe that preserves the backbone model's capabilities. Moreover, we propose several extensions that improve or expand the utility of Medusa, including a self-distillation to handle situations where no training data is available and a typical acceptance scheme to boost the acceptance rate while maintaining generation quality. We evaluate Medusa on models of various sizes and training procedures. Our experiments demonstrate that Medusa-1 can achieve over 2.2x speedup without compromising generation quality, while Medusa-2 further improves the speedup to 2.3-3.6x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Medusa adds multiple decoding heads directly to the target LLM and verifies them with tree attention to cut sequential steps, claiming over 2x speedups without a separate draft model.

read the letter

Medusa's main idea is to add several extra decoding heads to an existing large language model so it can predict the next few tokens in parallel right from the target model. They then use a tree-based attention setup to generate and check multiple possible continuations at the same time in each step. This avoids training or running a separate smaller draft model like in classic speculative decoding. The paper does a good job laying out two practical fine-tuning paths. In Medusa-1 the backbone stays frozen while only the heads are trained, which keeps the output distribution the same and gives lossless acceleration. Medusa-2 trains the heads along with the model but follows a special procedure to maintain the original capabilities. They also describe self-distillation when no extra data is handy and a typical acceptance method to improve how often predictions get kept. The experiments run on models of different sizes and report speedups above 2.2 times for the first version and 2.3 to 3.6 times for the second, all while holding generation quality steady. One area that needs more attention is the acceptance rates during verification. The speedup only works if the heads guess tokens that survive the check frequently enough to cut down on sequential steps without the added computation from heads and tree attention becoming a drag. The abstract claims consistent results across tasks, but without seeing the per-step acceptance numbers, their variance, or how they compare to baselines like other speculative methods, it's hard to judge how reliable the gains are on longer contexts or different domains. The full paper likely includes tables, but those details are what would make the claims stronger. This work targets engineers and researchers focused on making LLM inference faster in production settings. Readers who care about simple ways to reduce latency without extra models will find the concrete design useful. It is worth sending to peer review because the method is clearly described, the results are measurable, and feedback could help refine the experimental reporting. I would recommend putting it through review rather than desk rejecting it.

Referee Report

3 major / 2 minor

Summary. The paper introduces Medusa, a framework to accelerate autoregressive LLM inference by attaching multiple lightweight decoding heads that predict several future tokens in parallel. A tree-based attention mechanism constructs and verifies multiple candidate continuations in a single forward pass per decoding step. Two fine-tuning regimes are defined: Medusa-1 trains only the heads on a frozen backbone (claimed to be lossless), while Medusa-2 jointly optimizes heads and backbone under a special recipe that preserves original capabilities. Extensions include self-distillation and a typical acceptance scheme. Experiments on models of varying sizes report speedups exceeding 2.2× for Medusa-1 and 2.3–3.6× for Medusa-2 while maintaining generation quality.

Significance. If the reported speedups are robust, Medusa offers a practical alternative to speculative decoding that avoids maintaining a separate draft model, lowering implementation complexity for practitioners. The ability to achieve >2× acceleration with only head fine-tuning (Medusa-1) or modest joint training (Medusa-2) would be valuable for latency-sensitive deployments across model scales.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: the central claims of >2.2× (Medusa-1) and 2.3–3.6× (Medusa-2) speedups are presented without tabulated or plotted acceptance rates per decoding step, their variance across tasks, or error bars. Because net speedup is determined by the product of acceptance probability and tree branching factor minus the overhead of extra heads and tree attention, these quantities are load-bearing and must be shown explicitly to substantiate the acceleration numbers.
[Method] Method description of Medusa-2: the 'special training recipe' that jointly fine-tunes the backbone while preserving its original distribution is referenced but not specified with loss terms, weighting schedules, or regularization details. Without these, it is impossible to assess whether the reported higher speedups come at the cost of distribution shift that would only appear on held-out or longer-context data.
[Experiments] Experiments: direct head-to-head comparisons against established speculative decoding baselines (e.g., draft-model methods) are absent. Speedup and quality metrics should be reported on identical prompts and hardware so that the claimed simplicity advantage can be weighed against any difference in achieved tokens-per-second.

minor comments (2)

[Figures] Figure captions and tree diagrams should explicitly label the branching factor and depth used in the reported runs so readers can reproduce the verification cost.
[Extensions] The self-distillation procedure is mentioned as an extension but lacks a concise algorithmic description or pseudocode.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will incorporate the suggested clarifications and additions in the revised manuscript to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the central claims of >2.2× (Medusa-1) and 2.3–3.6× (Medusa-2) speedups are presented without tabulated or plotted acceptance rates per decoding step, their variance across tasks, or error bars. Because net speedup is determined by the product of acceptance probability and tree branching factor minus the overhead of extra heads and tree attention, these quantities are load-bearing and must be shown explicitly to substantiate the acceleration numbers.

Authors: We agree that explicit reporting of acceptance rates, their variance, and error bars would better substantiate the speedup claims by directly illustrating the contribution of acceptance probability and branching factor. In the revised manuscript, we will add a dedicated table in the Experiments section that reports per-step acceptance rates for Medusa-1 and Medusa-2 across all evaluated tasks, including standard deviations to capture variance. We will also augment the speedup plots with error bars derived from multiple runs. These additions will explicitly connect the measured end-to-end speedups to the underlying acceptance and tree statistics while accounting for overhead. revision: yes
Referee: [Method] Method description of Medusa-2: the 'special training recipe' that jointly fine-tunes the backbone while preserving its original distribution is referenced but not specified with loss terms, weighting schedules, or regularization details. Without these, it is impossible to assess whether the reported higher speedups come at the cost of distribution shift that would only appear on held-out or longer-context data.

Authors: We acknowledge that the description of the Medusa-2 training procedure lacks sufficient implementation detail. In the revised Method section, we will explicitly define the composite loss (standard next-token prediction loss on the backbone combined with a weighted Medusa-head prediction loss), the weighting schedule (e.g., linear ramp-up of the head-loss coefficient from 0.1 to 0.5 over the first 10% of training steps), and the regularization term (KL divergence between the fine-tuned backbone outputs and the original model outputs on a held-out calibration set). These additions will allow readers to verify that distribution shift is controlled and that the higher speedups do not compromise generalization on held-out or longer-context data. revision: yes
Referee: [Experiments] Experiments: direct head-to-head comparisons against established speculative decoding baselines (e.g., draft-model methods) are absent. Speedup and quality metrics should be reported on identical prompts and hardware so that the claimed simplicity advantage can be weighed against any difference in achieved tokens-per-second.

Authors: We agree that direct, apples-to-apples comparisons would strengthen the evaluation. In the revised Experiments section, we will add a new subsection presenting head-to-head results against representative speculative decoding baselines (e.g., the draft-model method of Leviathan et al.) using the exact same prompt sets, model sizes, and hardware configuration. We will report tokens-per-second, acceptance rates, and generation quality metrics side-by-side, enabling a clear assessment of the simplicity versus performance trade-off. revision: yes

Circularity Check

0 steps flagged

No circularity: speedups are empirical measurements on held-out data

full rationale

The paper's claims rest on measured inference speedups from adding and training extra decoding heads plus tree verification, evaluated against standard autoregressive baselines on held-out tasks. No equations reduce the reported gains (e.g., 2.2x or 2.3-3.6x) to quantities defined inside the paper by construction, and no self-citations or uniqueness theorems are invoked to force the central result. The training recipes and acceptance-rate improvements are presented as standard fine-tuning procedures whose net benefit is quantified externally rather than assumed tautologically.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that additional heads can learn useful multi-token predictions from standard next-token data and that tree verification overhead remains sub-linear in practice.

free parameters (2)

number of Medusa heads
Chosen empirically to trade off prediction coverage against added compute; typical values implied by reported speedups.
tree depth and branching factor
Hyperparameters that determine how many candidate sequences are generated and verified per step.

axioms (1)

domain assumption The backbone LLM’s hidden states remain sufficiently informative for the added heads to predict future tokens accurately.
Invoked when claiming that fine-tuning only the heads (Medusa-1) suffices for lossless acceleration.

invented entities (1)

Medusa decoding heads no independent evidence
purpose: Predict multiple future tokens in parallel from the same backbone representation.
New architectural component introduced by the paper.

pith-pipeline@v0.9.0 · 5630 in / 1361 out tokens · 28653 ms · 2026-05-13T10:32:04.406047+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
cs.LG 2026-05 unverdicted novelty 7.0

SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
cs.DC 2026-04 unverdicted novelty 7.0

FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference
cs.IT 2026-04 unverdicted novelty 7.0

WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% ac...
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
cs.LG 2026-04 unverdicted novelty 7.0

Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.
Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs
cs.LG 2026-05 unverdicted novelty 6.0

Language models trained on parallel streams of computation can overcome single-stream bottlenecks in autonomous agents by enabling simultaneous reading, thinking, and acting.
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
cs.CL 2026-05 unverdicted novelty 6.0

BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
Edit-Based Refinement for Parallel Masked Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 6.0

ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
cs.CL 2026-05 unverdicted novelty 6.0

PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
cs.CV 2026-05 unverdicted novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
cs.CV 2026-05 unverdicted novelty 6.0

WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding
cs.AI 2026-05 unverdicted novelty 6.0

CoVSpec achieves up to 2.21x higher throughput and over 96% lower communication overhead for device-edge VLM inference via training-free visual token reduction, adaptive drafting, and decoupled parallel verification-c...
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding
cs.CL 2026-05 unverdicted novelty 6.0

EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding
cs.CL 2026-04 unverdicted novelty 6.0

RACER unifies retrieval of exact matching patterns with logit-driven cues to produce better speculative drafts, achieving more than 2x speedup over autoregressive decoding and outperforming prior training-free specula...
SMART: When is it Actually Worth Expanding a Speculative Tree?
cs.DC 2026-04 unverdicted novelty 6.0

SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
cs.CL 2025-06 unverdicted novelty 6.0

MEM1 uses end-to-end RL to learn constant-memory agents that update a shared state for memory and reasoning, delivering 3.5x better performance and 3.7x lower memory use than larger baselines on long-horizon QA and sh...
SnapKV: LLM Knows What You are Looking for Before Generation
cs.CL 2024-04 conditional novelty 6.0

SnapKV selects clustered important KV positions per attention head from an observation window at the prompt end, yielding 3.6x faster generation and 8.2x better memory efficiency on 16K-token inputs with comparable pe...
Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

STOP is a new learnable internal path-pruning technique that improves efficiency and accuracy of parallel reasoning in LRMs under fixed compute budgets.
Acceptance Dynamics Across Cognitive Domains in Speculative Decoding
cs.AI 2026-04 unverdicted novelty 5.0

Empirical measurements across four NLP domains show task type is a stronger predictor of speculative decoding acceptance than tree depth, with chat uniquely achieving expected accepted length over 1 token per step.
DMax: Aggressive Parallel Decoding for dLLMs
cs.LG 2026-04 unverdicted novelty 5.0

DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
cs.DC 2026-04 unverdicted novelty 4.0

A framework combines multi-LoRA runtime switching, multi-stream stylistic decoding, and Dynamic Self-Speculative Decoding with INT4 quantization to achieve 4-6x memory and latency gains for on-device inference of a on...

Reference graph

Works this paper leans on

289 extracted references · 289 canonical work pages · cited by 24 Pith papers · 28 internal anchors

[2]

Axolotl. Axolotl . https://github.com/OpenAccess-AI-Collective/axolotl, 2023

work page 2023
[3]

S., Keskar, N

Basu, S., Ramachandran, G. S., Keskar, N. S., and Varshney, L. R. \ MIROSTAT \ : A \ neural \ \ text \ \ decoding \ \ algorithm \ \ that \ \ directly \ \ controls \ \ perplexity \ . In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=W1G1JZEIy5_

work page 2021
[4]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[5]

Accelerating Large Language Model Decoding with Speculative Sampling

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling. February 2023. doi:10.48550/ARXIV.2302.01318

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.01318 2023
[6]

Dissecting batching effects in gpt inference

Chen, L. Dissecting batching effects in gpt inference. https://le.qun.ch/en/blog/2023/05/13/transformer-batching/, 2023. Blog

work page 2023
[7]

E., Stoica, I., and Xing, E

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., and Xing, E. P. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023
[8]

PaLM: Scaling Language Modeling with Pathways

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

8-bit optimizers via block-wise quantization

Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L. 8-bit optimizers via block-wise quantization. International Conference on Learning Representations, 2021

work page 2021
[12]

Enhancing chat language models by scaling high-quality instructional conversations, 2023

Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., Liu, Z., Sun, M., and Zhou, B. Enhancing chat language models by scaling high-quality instructional conversations, 2023

work page 2023
[13]

Dubois, Y., Li, X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023

work page 2023
[14]

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

Elfwing, S., Uchibe, E., and Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 2017. doi:10.1016/j.neunet.2017.12.012

work page doi:10.1016/j.neunet.2017.12.012 2017
[15]

Hierarchical neural story generation

Fan, A., Lewis, M., and Dauphin, Y. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2018. doi:10.18653/v1/p18-1082

work page doi:10.18653/v1/p18-1082 2018
[17]

Palm 2 technical report, 2023

Google. Palm 2 technical report, 2023. URL https://ai.google/static/documents/palm2techreport.pdf

work page 2023
[18]

D., and Liang, P

Hewitt, J., Manning, C. D., and Liang, P. Truncation sampling as language model desmoothing. October 2022. doi:10.48550/ARXIV.2210.15191

work page doi:10.48550/arxiv.2210.15191 2022
[19]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

The curious case of neural text degeneration

Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH

work page 2020
[21]

J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W. Lora: Low-rank adaptation of large language models. ICLR, 2021

work page 2021
[22]

Assisted generation: a new direction toward low-latency text generation, 2023

Joao Gante . Assisted generation: a new direction toward low-latency text generation, 2023. URL https://huggingface.co/blog/assisted-generation

work page 2023
[24]

and Rush, A

Kim, Y. and Rush, A. M. Sequence-level knowledge distillation. EMNLP, 2016

work page 2016
[25]

Fine-tuning can distort pretrained features and underperform out-of-distribution

Kumar, A., Raghunathan, A., Jones, R., Ma, T., and Liang, P. Fine-tuning can distort pretrained features and underperform out-of-distribution. International Conference on Learning Representations, 2022

work page 2022
[26]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[27]

Fast inference from transformers via speculative decoding, 2023.URL https://arxiv

Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding. November 2022. doi:10.48550/ARXIV.2211.17192

work page doi:10.48550/arxiv.2211.17192 2022
[28]

Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023

work page 2023
[30]

On the probability-quality paradox in language generation

Meister, C., Wiher, G., Pimentel, T., and Cotterell, R. On the probability-quality paradox in language generation. March 2022. doi:10.48550/ARXIV.2203.17217

work page doi:10.48550/arxiv.2203.17217 2022
[31]

Locally typical sampling

Meister, C., Pimentel, T., Wiher, G., and Cotterell, R. Locally typical sampling. Transactions of the Association for Computational Linguistics, 11: 0 102--121, 2023

work page 2023
[33]

Nvidia a100 tensor core gpu

NVIDIA. Nvidia a100 tensor core gpu

work page
[34]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[35]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Tiny vicuna 1b

Pan, J. Tiny vicuna 1b. https://huggingface.co/Jiayi-Pan/Tiny-Vicuna-1B, 2023

work page 2023
[37]

MAUVE : Measuring the gap between neural text and human text using divergence frontiers

Pillutla, K., Swayamdipta, S., Zellers, R., Thickstun, J., Welleck, S., Choi, Y., and Harchaoui, Z. MAUVE : Measuring the gap between neural text and human text using divergence frontiers. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Tqx7nJp7PR

work page 2021
[38]

Efficiently scaling transformer inference

Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., and Dean, J. Efficiently scaling transformer inference. November 2022. doi:10.48550/ARXIV.2211.05102

work page doi:10.48550/arxiv.2211.05102 2022
[39]

ShareGPT

ShareGPT. ShareGPT . https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered, 2023

work page 2023
[42]

M., and Uszkoreit, J

Stern, M., Shazeer, N. M., and Uszkoreit, J. Blockwise parallel decoding for deep autoregressive models. Neural Information Processing Systems, 2018

work page 2018
[44]

M., and Wolf, T

Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A. M., and Wolf, T. Zephyr: Direct distillation of lm alignment, 2023

work page 2023
[45]

Speculative decoding: Lossless speedup of autoregressive translation, 2023

Xia, H., Ge, T., Chen, S.-Q., Wei, F., and Sui, Z. Speculative decoding: Lossless speedup of autoregressive translation, 2023. URL https://openreview.net/forum?id=H-VlwsYvVi

work page 2023
[46]

Smoothquant: Accurate and efficient post-training quantization for large language models

Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp.\ 38087--38099. PMLR, 2023 a

work page 2023
[47]

A survey on non-autoregressive generation for neural machine translation and beyond

Xiao, Y., Wu, L., Guo, J., Li, J., Zhang, M., Qin, T., and Liu, T.-y. A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023 b

work page 2023
[48]

Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34: 0 28877--28888, 2021

Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34: 0 28877--28888, 2021

work page 2021
[49]

Tinyllama: An open-source small language model, 2024

Zhang, P., Zeng, G., Wang, T., and Lu, W. Tinyllama: An open-source small language model, 2024

work page 2024
[50]

OPT: Open Pre-trained Transformer Language Models

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

P., Zhang, H., Gonzalez, J

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

work page 2023
[53]

Hashimoto , title =

Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[54]

2023 , eprint=

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. 2023 , eprint=

work page 2023
[55]

International Conference on Learning Representations , year =

8-bit Optimizers via Block-wise Quantization , author =. International Conference on Learning Representations , year =

work page
[56]

ICLR , year =

LoRA: Low-Rank Adaptation of Large Language Models , author =. ICLR , year =

work page
[57]

2016 , journal =

Sequence-Level Knowledge Distillation , author =. 2016 , journal =

work page 2016
[58]

2023 , eprint=

Zephyr: Direct Distillation of LM Alignment , author=. 2023 , eprint=

work page 2023
[59]

International Conference on Learning Representations , year =

Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution , author =. International Conference on Learning Representations , year =

work page
[60]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page
[61]

2023 , journal =

DistillSpec: Improving Speculative Decoding via Knowledge Distillation , author =. 2023 , journal =

work page 2023
[62]

2023 , journal =

Online Speculative Decoding , author =. 2023 , journal =

work page 2023
[63]

arXiv preprint arXiv:2310.18813 , year=

The Synergy of Speculative Decoding and Batching in Serving Large Language Models , author=. arXiv preprint arXiv:2310.18813 , year=

work page arXiv
[64]

2023 , journal =

REST: Retrieval-Based Speculative Decoding , author =. 2023 , journal =

work page 2023
[65]

Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding , url =

Yichao Fu and Peter Bailis and Ion Stoica and Hao Zhang , month =. Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding , url =

work page
[66]

2023 , eprint=

AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , author=. 2023 , eprint=

work page 2023
[67]

2023 , eprint=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023
[68]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

work page
[69]

Advances in Neural Information Processing Systems , volume=

Do transformers really perform badly for graph representation? , author=. Advances in Neural Information Processing Systems , volume=

work page
[70]

2021 , url=

Krishna Pillutla and Swabha Swayamdipta and Rowan Zellers and John Thickstun and Sean Welleck and Yejin Choi and Zaid Harchaoui , booktitle=. 2021 , url=

work page 2021
[71]

International Conference on Learning Representations , year=

The Curious Case of Neural Text Degeneration , author=. International Conference on Learning Representations , year=

work page
[72]

Language Model Evaluation Beyond Perplexity , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

work page
[73]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

A survey on non-autoregressive generation for neural machine translation and beyond , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page
[75]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author=. arXiv preprint arXiv:2306.00978 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

International Conference on Machine Learning , pages=

Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[77]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Llm. int8 (): 8-bit matrix multiplication for transformers at scale , author=. arXiv preprint arXiv:2208.07339 , year=

work page internal anchor Pith review arXiv
[78]

arXiv preprint arXiv:2307.13304 , year=

QuIP: 2-Bit Quantization of Large Language Models With Guarantees , author=. arXiv preprint arXiv:2307.13304 , year=

work page arXiv
[79]

arXiv preprint arXiv:2306.14048 , year=

H \_2 O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models , author=. arXiv preprint arXiv:2306.14048 , year=

work page arXiv
[80]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[81]

arXiv preprint arXiv:2306.07629 , year=

SqueezeLLM: Dense-and-Sparse Quantization , author=. arXiv preprint arXiv:2306.07629 , year=

work page arXiv
[82]

Fast Transformer Decoding: One Write-Head is All You Need

Fast transformer decoding: One write-head is all you need , author=. arXiv preprint arXiv:1911.02150 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1911
[83]

QLoRA: Efficient Finetuning of Quantized LLMs

Qlora: Efficient finetuning of quantized llms , author=. arXiv preprint arXiv:2305.14314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[84]

and Re, C

Accelerating LLM Inference with Staged Speculative Decoding , author=. arXiv preprint arXiv:2308.04623 , year=

work page arXiv
[85]

arXiv:2305.09781 , year=

SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification , author=. arXiv preprint arXiv:2305.09781 , year=

work page arXiv
[86]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[87]

2023 , publisher=

GPTCache , author=. 2023 , publisher=

work page 2023
[88]

Language models can solve computer tasks

Language Models can Solve Computer Tasks , author =. ARXIV.ORG , year =. doi:10.48550/arXiv.2303.17491 , bibSource =

work page doi:10.48550/arxiv.2303.17491
[89]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. arXiv preprint arXiv:2305.16291 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[90]

2023 , howpublished =

Lequn Chen , title =. 2023 , howpublished =

work page 2023
[91]

arXiv preprint arXiv.2304.08354,

Tool learning with foundation models , author=. arXiv preprint arXiv:2304.08354 , year=

work page arXiv

Showing first 80 references.