pith. machine review for the scientific record. sign in

arxiv: 2401.10774 · v3 · submitted 2024-01-19 · 💻 cs.LG · cs.CL

Recognition: 1 theorem link

· Lean Theorem

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Deming Chen, Hongwu Peng, Jason D. Lee, Tianle Cai, Tri Dao, Yuhong Li, Zhengyang Geng

Authors on Pith no claims yet

Pith reviewed 2026-05-13 10:32 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords LLM inferencedecoding accelerationmultiple headstree attentionspeculative decodingparallel predictionfine-tuninggeneration speedup
0
0 comments X

The pith

By adding multiple decoding heads to an LLM, Medusa predicts several future tokens in parallel and verifies them together in one step, reducing sequential decoding steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models generate one token at a time, forcing repeated loading of the full model weights. Medusa attaches extra decoding heads that forecast the next several tokens at once. These forecasts are arranged into a tree of candidate sequences that the model checks simultaneously via a special attention pattern. When the heads are accurate, multiple tokens advance per iteration instead of one. The method offers two training paths: heads alone on a frozen backbone for safe acceleration, or joint training of heads and backbone for larger gains.

Core claim

Medusa augments an LLM with additional decoding heads that output predictions for multiple subsequent tokens. These predictions form a tree of candidate continuations that are verified in parallel during each decoding step through a tree-based attention mask. This parallel verification replaces several sequential forward passes, yielding over 2.2x speedup when only the heads are fine-tuned and 2.3-3.6x speedup when the backbone is also updated, all while preserving the original generation quality.

What carries the argument

Extra decoding heads that predict logits for future positions, combined with a tree-structured attention mask that lets the model score multiple candidate token sequences in a single forward pass.

If this is right

  • The total number of sequential model calls drops because multiple tokens are accepted per step.
  • No separate draft model needs to be trained or maintained, unlike classic speculative decoding.
  • Medusa-1 keeps the backbone unchanged and still delivers over 2.2x speedup with unchanged output quality.
  • Medusa-2 reaches 2.3-3.6x speedup by jointly fine-tuning heads and backbone under a special training recipe.
  • Self-distillation allows the heads to be trained without external data while keeping quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could lower serving latency for interactive applications where each new token matters.
  • Similar head additions might speed up other autoregressive generators such as image or audio models.
  • Longer contexts may show different gains because prediction accuracy can vary with sequence length.
  • Pairing Medusa with quantization or KV-cache compression would likely multiply the observed speedups.

Load-bearing premise

The extra heads must generate predictions accurate enough that the tree verification accepts more than one token per step on average, outweighing the added computation cost.

What would settle it

Measure the average number of accepted tokens per decoding step on a fixed benchmark; if the effective rate stays near 1 after accounting for head overhead, the net speedup vanishes.

read the original abstract

Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator's cache. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft model. In this paper, we present Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Using a tree-based attention mechanism, Medusa constructs multiple candidate continuations and verifies them simultaneously in each decoding step. By leveraging parallel processing, Medusa substantially reduces the number of decoding steps required. We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa is fine-tuned together with the backbone LLM, enabling better prediction accuracy of Medusa heads and higher speedup but needing a special training recipe that preserves the backbone model's capabilities. Moreover, we propose several extensions that improve or expand the utility of Medusa, including a self-distillation to handle situations where no training data is available and a typical acceptance scheme to boost the acceptance rate while maintaining generation quality. We evaluate Medusa on models of various sizes and training procedures. Our experiments demonstrate that Medusa-1 can achieve over 2.2x speedup without compromising generation quality, while Medusa-2 further improves the speedup to 2.3-3.6x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Medusa, a framework to accelerate autoregressive LLM inference by attaching multiple lightweight decoding heads that predict several future tokens in parallel. A tree-based attention mechanism constructs and verifies multiple candidate continuations in a single forward pass per decoding step. Two fine-tuning regimes are defined: Medusa-1 trains only the heads on a frozen backbone (claimed to be lossless), while Medusa-2 jointly optimizes heads and backbone under a special recipe that preserves original capabilities. Extensions include self-distillation and a typical acceptance scheme. Experiments on models of varying sizes report speedups exceeding 2.2× for Medusa-1 and 2.3–3.6× for Medusa-2 while maintaining generation quality.

Significance. If the reported speedups are robust, Medusa offers a practical alternative to speculative decoding that avoids maintaining a separate draft model, lowering implementation complexity for practitioners. The ability to achieve >2× acceleration with only head fine-tuning (Medusa-1) or modest joint training (Medusa-2) would be valuable for latency-sensitive deployments across model scales.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: the central claims of >2.2× (Medusa-1) and 2.3–3.6× (Medusa-2) speedups are presented without tabulated or plotted acceptance rates per decoding step, their variance across tasks, or error bars. Because net speedup is determined by the product of acceptance probability and tree branching factor minus the overhead of extra heads and tree attention, these quantities are load-bearing and must be shown explicitly to substantiate the acceleration numbers.
  2. [Method] Method description of Medusa-2: the 'special training recipe' that jointly fine-tunes the backbone while preserving its original distribution is referenced but not specified with loss terms, weighting schedules, or regularization details. Without these, it is impossible to assess whether the reported higher speedups come at the cost of distribution shift that would only appear on held-out or longer-context data.
  3. [Experiments] Experiments: direct head-to-head comparisons against established speculative decoding baselines (e.g., draft-model methods) are absent. Speedup and quality metrics should be reported on identical prompts and hardware so that the claimed simplicity advantage can be weighed against any difference in achieved tokens-per-second.
minor comments (2)
  1. [Figures] Figure captions and tree diagrams should explicitly label the branching factor and depth used in the reported runs so readers can reproduce the verification cost.
  2. [Extensions] The self-distillation procedure is mentioned as an extension but lacks a concise algorithmic description or pseudocode.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will incorporate the suggested clarifications and additions in the revised manuscript to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the central claims of >2.2× (Medusa-1) and 2.3–3.6× (Medusa-2) speedups are presented without tabulated or plotted acceptance rates per decoding step, their variance across tasks, or error bars. Because net speedup is determined by the product of acceptance probability and tree branching factor minus the overhead of extra heads and tree attention, these quantities are load-bearing and must be shown explicitly to substantiate the acceleration numbers.

    Authors: We agree that explicit reporting of acceptance rates, their variance, and error bars would better substantiate the speedup claims by directly illustrating the contribution of acceptance probability and branching factor. In the revised manuscript, we will add a dedicated table in the Experiments section that reports per-step acceptance rates for Medusa-1 and Medusa-2 across all evaluated tasks, including standard deviations to capture variance. We will also augment the speedup plots with error bars derived from multiple runs. These additions will explicitly connect the measured end-to-end speedups to the underlying acceptance and tree statistics while accounting for overhead. revision: yes

  2. Referee: [Method] Method description of Medusa-2: the 'special training recipe' that jointly fine-tunes the backbone while preserving its original distribution is referenced but not specified with loss terms, weighting schedules, or regularization details. Without these, it is impossible to assess whether the reported higher speedups come at the cost of distribution shift that would only appear on held-out or longer-context data.

    Authors: We acknowledge that the description of the Medusa-2 training procedure lacks sufficient implementation detail. In the revised Method section, we will explicitly define the composite loss (standard next-token prediction loss on the backbone combined with a weighted Medusa-head prediction loss), the weighting schedule (e.g., linear ramp-up of the head-loss coefficient from 0.1 to 0.5 over the first 10% of training steps), and the regularization term (KL divergence between the fine-tuned backbone outputs and the original model outputs on a held-out calibration set). These additions will allow readers to verify that distribution shift is controlled and that the higher speedups do not compromise generalization on held-out or longer-context data. revision: yes

  3. Referee: [Experiments] Experiments: direct head-to-head comparisons against established speculative decoding baselines (e.g., draft-model methods) are absent. Speedup and quality metrics should be reported on identical prompts and hardware so that the claimed simplicity advantage can be weighed against any difference in achieved tokens-per-second.

    Authors: We agree that direct, apples-to-apples comparisons would strengthen the evaluation. In the revised Experiments section, we will add a new subsection presenting head-to-head results against representative speculative decoding baselines (e.g., the draft-model method of Leviathan et al.) using the exact same prompt sets, model sizes, and hardware configuration. We will report tokens-per-second, acceptance rates, and generation quality metrics side-by-side, enabling a clear assessment of the simplicity versus performance trade-off. revision: yes

Circularity Check

0 steps flagged

No circularity: speedups are empirical measurements on held-out data

full rationale

The paper's claims rest on measured inference speedups from adding and training extra decoding heads plus tree verification, evaluated against standard autoregressive baselines on held-out tasks. No equations reduce the reported gains (e.g., 2.2x or 2.3-3.6x) to quantities defined inside the paper by construction, and no self-citations or uniqueness theorems are invoked to force the central result. The training recipes and acceptance-rate improvements are presented as standard fine-tuning procedures whose net benefit is quantified externally rather than assumed tautologically.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that additional heads can learn useful multi-token predictions from standard next-token data and that tree verification overhead remains sub-linear in practice.

free parameters (2)
  • number of Medusa heads
    Chosen empirically to trade off prediction coverage against added compute; typical values implied by reported speedups.
  • tree depth and branching factor
    Hyperparameters that determine how many candidate sequences are generated and verified per step.
axioms (1)
  • domain assumption The backbone LLM’s hidden states remain sufficiently informative for the added heads to predict future tokens accurately.
    Invoked when claiming that fine-tuning only the heads (Medusa-1) suffices for lossless acceleration.
invented entities (1)
  • Medusa decoding heads no independent evidence
    purpose: Predict multiple future tokens in parallel from the same backbone representation.
    New architectural component introduced by the paper.

pith-pipeline@v0.9.0 · 5630 in / 1361 out tokens · 28653 ms · 2026-05-13T10:32:04.406047+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

    cs.LG 2026-05 unverdicted novelty 7.0

    SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.

  2. BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...

  3. FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving

    cs.DC 2026-04 unverdicted novelty 7.0

    FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.

  4. NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization

    cs.LG 2026-04 unverdicted novelty 7.0

    NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.

  5. WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference

    cs.IT 2026-04 unverdicted novelty 7.0

    WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% ac...

  6. Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

  7. Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit

    cs.LG 2026-04 unverdicted novelty 7.0

    Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.

  8. Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

    cs.LG 2026-05 unverdicted novelty 6.0

    Language models trained on parallel streams of computation can overcome single-stream bottlenecks in autonomous agents by enabling simultaneous reading, thinking, and acting.

  9. BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

    cs.CL 2026-05 unverdicted novelty 6.0

    BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.

  10. Edit-Based Refinement for Parallel Masked Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.

  11. PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

    cs.CL 2026-05 unverdicted novelty 6.0

    PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.

  12. CASCADE: Context-Aware Relaxation for Speculative Image Decoding

    cs.CV 2026-05 unverdicted novelty 6.0

    CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...

  13. WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

    cs.CV 2026-05 unverdicted novelty 6.0

    WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.

  14. CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding

    cs.AI 2026-05 unverdicted novelty 6.0

    CoVSpec achieves up to 2.21x higher throughput and over 96% lower communication overhead for device-edge VLM inference via training-free visual token reduction, adaptive drafting, and decoupled parallel verification-c...

  15. Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding

    cs.CL 2026-05 unverdicted novelty 6.0

    EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.

  16. Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

    cs.LG 2026-04 unverdicted novelty 6.0

    SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

  17. RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

    cs.CL 2026-04 unverdicted novelty 6.0

    RACER unifies retrieval of exact matching patterns with logit-driven cues to produce better speculative drafts, achieving more than 2x speedup over autoregressive decoding and outperforming prior training-free specula...

  18. SMART: When is it Actually Worth Expanding a Speculative Tree?

    cs.DC 2026-04 unverdicted novelty 6.0

    SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.

  19. MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

    cs.CL 2025-06 unverdicted novelty 6.0

    MEM1 uses end-to-end RL to learn constant-memory agents that update a shared state for memory and reasoning, delivering 3.5x better performance and 3.7x lower memory use than larger baselines on long-horizon QA and sh...

  20. SnapKV: LLM Knows What You are Looking for Before Generation

    cs.CL 2024-04 conditional novelty 6.0

    SnapKV selects clustered important KV positions per attention head from an observation window at the prompt end, yielding 3.6x faster generation and 8.2x better memory efficiency on 16K-token inputs with comparable pe...

  21. Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

    cs.CL 2026-04 unverdicted novelty 5.0

    STOP is a new learnable internal path-pruning technique that improves efficiency and accuracy of parallel reasoning in LRMs under fixed compute budgets.

  22. Acceptance Dynamics Across Cognitive Domains in Speculative Decoding

    cs.AI 2026-04 unverdicted novelty 5.0

    Empirical measurements across four NLP domains show task type is a stronger predictor of speculative decoding acceptance than tree depth, with chat uniquely achieving expected accepted length over 1 token per step.

  23. DMax: Aggressive Parallel Decoding for dLLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...

  24. Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

    cs.DC 2026-04 unverdicted novelty 4.0

    A framework combines multi-LoRA runtime switching, multi-stream stylistic decoding, and Dynamic Self-Speculative Decoding with INT4 quantization to achieve 4-6x memory and latency gains for on-device inference of a on...

Reference graph

Works this paper leans on

289 extracted references · 289 canonical work pages · cited by 24 Pith papers · 28 internal anchors

  1. [2]

    Axolotl. Axolotl . https://github.com/OpenAccess-AI-Collective/axolotl, 2023

  2. [3]

    S., Keskar, N

    Basu, S., Ramachandran, G. S., Keskar, N. S., and Varshney, L. R. \ MIROSTAT \ : A \ neural \ \ text \ \ decoding \ \ algorithm \ \ that \ \ directly \ \ controls \ \ perplexity \ . In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=W1G1JZEIy5_

  3. [4]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  4. [5]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling. February 2023. doi:10.48550/ARXIV.2302.01318

  5. [6]

    Dissecting batching effects in gpt inference

    Chen, L. Dissecting batching effects in gpt inference. https://le.qun.ch/en/blog/2023/05/13/transformer-batching/, 2023. Blog

  6. [7]

    E., Stoica, I., and Xing, E

    Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., and Xing, E. P. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

  7. [8]

    PaLM: Scaling Language Modeling with Pathways

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  8. [9]

    8-bit optimizers via block-wise quantization

    Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L. 8-bit optimizers via block-wise quantization. International Conference on Learning Representations, 2021

  9. [12]

    Enhancing chat language models by scaling high-quality instructional conversations, 2023

    Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., Liu, Z., Sun, M., and Zhou, B. Enhancing chat language models by scaling high-quality instructional conversations, 2023

  10. [13]

    Dubois, Y., Li, X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023

  11. [14]

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

    Elfwing, S., Uchibe, E., and Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 2017. doi:10.1016/j.neunet.2017.12.012

  12. [15]

    Hierarchical neural story generation

    Fan, A., Lewis, M., and Dauphin, Y. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2018. doi:10.18653/v1/p18-1082

  13. [17]

    Palm 2 technical report, 2023

    Google. Palm 2 technical report, 2023. URL https://ai.google/static/documents/palm2techreport.pdf

  14. [18]

    D., and Liang, P

    Hewitt, J., Manning, C. D., and Liang, P. Truncation sampling as language model desmoothing. October 2022. doi:10.48550/ARXIV.2210.15191

  15. [19]

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  16. [20]

    The curious case of neural text degeneration

    Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH

  17. [21]

    J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W

    Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W. Lora: Low-rank adaptation of large language models. ICLR, 2021

  18. [22]

    Assisted generation: a new direction toward low-latency text generation, 2023

    Joao Gante . Assisted generation: a new direction toward low-latency text generation, 2023. URL https://huggingface.co/blog/assisted-generation

  19. [24]

    and Rush, A

    Kim, Y. and Rush, A. M. Sequence-level knowledge distillation. EMNLP, 2016

  20. [25]

    Fine-tuning can distort pretrained features and underperform out-of-distribution

    Kumar, A., Raghunathan, A., Jones, R., Ma, T., and Liang, P. Fine-tuning can distort pretrained features and underperform out-of-distribution. International Conference on Learning Representations, 2022

  21. [26]

    H., Gonzalez, J

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  22. [27]

    Fast inference from transformers via speculative decoding, 2023.URL https://arxiv

    Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding. November 2022. doi:10.48550/ARXIV.2211.17192

  23. [28]

    Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023

  24. [30]

    On the probability-quality paradox in language generation

    Meister, C., Wiher, G., Pimentel, T., and Cotterell, R. On the probability-quality paradox in language generation. March 2022. doi:10.48550/ARXIV.2203.17217

  25. [31]

    Locally typical sampling

    Meister, C., Pimentel, T., Wiher, G., and Cotterell, R. Locally typical sampling. Transactions of the Association for Computational Linguistics, 11: 0 102--121, 2023

  26. [33]

    Nvidia a100 tensor core gpu

    NVIDIA. Nvidia a100 tensor core gpu

  27. [34]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  28. [35]

    Training language models to follow instructions with human feedback

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022

  29. [36]

    Tiny vicuna 1b

    Pan, J. Tiny vicuna 1b. https://huggingface.co/Jiayi-Pan/Tiny-Vicuna-1B, 2023

  30. [37]

    MAUVE : Measuring the gap between neural text and human text using divergence frontiers

    Pillutla, K., Swayamdipta, S., Zellers, R., Thickstun, J., Welleck, S., Choi, Y., and Harchaoui, Z. MAUVE : Measuring the gap between neural text and human text using divergence frontiers. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Tqx7nJp7PR

  31. [38]

    Efficiently scaling transformer inference

    Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., and Dean, J. Efficiently scaling transformer inference. November 2022. doi:10.48550/ARXIV.2211.05102

  32. [39]

    ShareGPT

    ShareGPT. ShareGPT . https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered, 2023

  33. [42]

    M., and Uszkoreit, J

    Stern, M., Shazeer, N. M., and Uszkoreit, J. Blockwise parallel decoding for deep autoregressive models. Neural Information Processing Systems, 2018

  34. [44]

    M., and Wolf, T

    Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A. M., and Wolf, T. Zephyr: Direct distillation of lm alignment, 2023

  35. [45]

    Speculative decoding: Lossless speedup of autoregressive translation, 2023

    Xia, H., Ge, T., Chen, S.-Q., Wei, F., and Sui, Z. Speculative decoding: Lossless speedup of autoregressive translation, 2023. URL https://openreview.net/forum?id=H-VlwsYvVi

  36. [46]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp.\ 38087--38099. PMLR, 2023 a

  37. [47]

    A survey on non-autoregressive generation for neural machine translation and beyond

    Xiao, Y., Wu, L., Guo, J., Li, J., Zhang, M., Qin, T., and Liu, T.-y. A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023 b

  38. [48]

    Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34: 0 28877--28888, 2021

    Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34: 0 28877--28888, 2021

  39. [49]

    Tinyllama: An open-source small language model, 2024

    Zhang, P., Zeng, G., Wang, T., and Lu, W. Tinyllama: An open-source small language model, 2024

  40. [50]

    OPT: Open Pre-trained Transformer Language Models

    Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

  41. [52]

    P., Zhang, H., Gonzalez, J

    Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

  42. [53]

    Hashimoto , title =

    Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  43. [54]

    2023 , eprint=

    Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. 2023 , eprint=

  44. [55]

    International Conference on Learning Representations , year =

    8-bit Optimizers via Block-wise Quantization , author =. International Conference on Learning Representations , year =

  45. [56]

    ICLR , year =

    LoRA: Low-Rank Adaptation of Large Language Models , author =. ICLR , year =

  46. [57]

    2016 , journal =

    Sequence-Level Knowledge Distillation , author =. 2016 , journal =

  47. [58]

    2023 , eprint=

    Zephyr: Direct Distillation of LM Alignment , author=. 2023 , eprint=

  48. [59]

    International Conference on Learning Representations , year =

    Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution , author =. International Conference on Learning Representations , year =

  49. [60]

    Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

  50. [61]

    2023 , journal =

    DistillSpec: Improving Speculative Decoding via Knowledge Distillation , author =. 2023 , journal =

  51. [62]

    2023 , journal =

    Online Speculative Decoding , author =. 2023 , journal =

  52. [63]

    arXiv preprint arXiv:2310.18813 , year=

    The Synergy of Speculative Decoding and Batching in Serving Large Language Models , author=. arXiv preprint arXiv:2310.18813 , year=

  53. [64]

    2023 , journal =

    REST: Retrieval-Based Speculative Decoding , author =. 2023 , journal =

  54. [65]

    Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding , url =

    Yichao Fu and Peter Bailis and Ion Stoica and Hao Zhang , month =. Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding , url =

  55. [66]

    2023 , eprint=

    AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , author=. 2023 , eprint=

  56. [67]

    2023 , eprint=

    Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  57. [68]

    and Stoica, Ion and Xing, Eric P

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

  58. [69]

    Advances in Neural Information Processing Systems , volume=

    Do transformers really perform badly for graph representation? , author=. Advances in Neural Information Processing Systems , volume=

  59. [70]

    2021 , url=

    Krishna Pillutla and Swabha Swayamdipta and Rowan Zellers and John Thickstun and Sean Welleck and Yejin Choi and Zaid Harchaoui , booktitle=. 2021 , url=

  60. [71]

    International Conference on Learning Representations , year=

    The Curious Case of Neural Text Degeneration , author=. International Conference on Learning Representations , year=

  61. [72]

    Language Model Evaluation Beyond Perplexity , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

  62. [73]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

  63. [74]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

    A survey on non-autoregressive generation for neural machine translation and beyond , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

  64. [75]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author=. arXiv preprint arXiv:2306.00978 , year=

  65. [76]

    International Conference on Machine Learning , pages=

    Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  66. [77]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    Llm. int8 (): 8-bit matrix multiplication for transformers at scale , author=. arXiv preprint arXiv:2208.07339 , year=

  67. [78]

    arXiv preprint arXiv:2307.13304 , year=

    QuIP: 2-Bit Quantization of Large Language Models With Guarantees , author=. arXiv preprint arXiv:2307.13304 , year=

  68. [79]

    arXiv preprint arXiv:2306.14048 , year=

    H \_2 O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models , author=. arXiv preprint arXiv:2306.14048 , year=

  69. [80]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=

  70. [81]

    arXiv preprint arXiv:2306.07629 , year=

    SqueezeLLM: Dense-and-Sparse Quantization , author=. arXiv preprint arXiv:2306.07629 , year=

  71. [82]

    Fast Transformer Decoding: One Write-Head is All You Need

    Fast transformer decoding: One write-head is all you need , author=. arXiv preprint arXiv:1911.02150 , year=

  72. [83]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Qlora: Efficient finetuning of quantized llms , author=. arXiv preprint arXiv:2305.14314 , year=

  73. [84]

    and Re, C

    Accelerating LLM Inference with Staged Speculative Decoding , author=. arXiv preprint arXiv:2308.04623 , year=

  74. [85]

    arXiv:2305.09781 , year=

    SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification , author=. arXiv preprint arXiv:2305.09781 , year=

  75. [86]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  76. [87]

    2023 , publisher=

    GPTCache , author=. 2023 , publisher=

  77. [88]

    Language models can solve computer tasks

    Language Models can Solve Computer Tasks , author =. ARXIV.ORG , year =. doi:10.48550/arXiv.2303.17491 , bibSource =

  78. [89]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. arXiv preprint arXiv:2305.16291 , year=

  79. [90]

    2023 , howpublished =

    Lequn Chen , title =. 2023 , howpublished =

  80. [91]

    arXiv preprint arXiv.2304.08354,

    Tool learning with foundation models , author=. arXiv preprint arXiv:2304.08354 , year=

Showing first 80 references.