Recognition: 1 theorem link
· Lean TheoremMedusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Pith reviewed 2026-05-13 10:32 UTC · model grok-4.3
The pith
By adding multiple decoding heads to an LLM, Medusa predicts several future tokens in parallel and verifies them together in one step, reducing sequential decoding steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Medusa augments an LLM with additional decoding heads that output predictions for multiple subsequent tokens. These predictions form a tree of candidate continuations that are verified in parallel during each decoding step through a tree-based attention mask. This parallel verification replaces several sequential forward passes, yielding over 2.2x speedup when only the heads are fine-tuned and 2.3-3.6x speedup when the backbone is also updated, all while preserving the original generation quality.
What carries the argument
Extra decoding heads that predict logits for future positions, combined with a tree-structured attention mask that lets the model score multiple candidate token sequences in a single forward pass.
If this is right
- The total number of sequential model calls drops because multiple tokens are accepted per step.
- No separate draft model needs to be trained or maintained, unlike classic speculative decoding.
- Medusa-1 keeps the backbone unchanged and still delivers over 2.2x speedup with unchanged output quality.
- Medusa-2 reaches 2.3-3.6x speedup by jointly fine-tuning heads and backbone under a special training recipe.
- Self-distillation allows the heads to be trained without external data while keeping quality.
Where Pith is reading between the lines
- The approach could lower serving latency for interactive applications where each new token matters.
- Similar head additions might speed up other autoregressive generators such as image or audio models.
- Longer contexts may show different gains because prediction accuracy can vary with sequence length.
- Pairing Medusa with quantization or KV-cache compression would likely multiply the observed speedups.
Load-bearing premise
The extra heads must generate predictions accurate enough that the tree verification accepts more than one token per step on average, outweighing the added computation cost.
What would settle it
Measure the average number of accepted tokens per decoding step on a fixed benchmark; if the effective rate stays near 1 after accounting for head overhead, the net speedup vanishes.
read the original abstract
Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Bandwidth Memory (HBM) to the accelerator's cache. While methods such as speculative decoding have been suggested to address this issue, their implementation is impeded by the challenges associated with acquiring and maintaining a separate draft model. In this paper, we present Medusa, an efficient method that augments LLM inference by adding extra decoding heads to predict multiple subsequent tokens in parallel. Using a tree-based attention mechanism, Medusa constructs multiple candidate continuations and verifies them simultaneously in each decoding step. By leveraging parallel processing, Medusa substantially reduces the number of decoding steps required. We present two levels of fine-tuning procedures for Medusa to meet the needs of different use cases: Medusa-1: Medusa is directly fine-tuned on top of a frozen backbone LLM, enabling lossless inference acceleration. Medusa-2: Medusa is fine-tuned together with the backbone LLM, enabling better prediction accuracy of Medusa heads and higher speedup but needing a special training recipe that preserves the backbone model's capabilities. Moreover, we propose several extensions that improve or expand the utility of Medusa, including a self-distillation to handle situations where no training data is available and a typical acceptance scheme to boost the acceptance rate while maintaining generation quality. We evaluate Medusa on models of various sizes and training procedures. Our experiments demonstrate that Medusa-1 can achieve over 2.2x speedup without compromising generation quality, while Medusa-2 further improves the speedup to 2.3-3.6x.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Medusa, a framework to accelerate autoregressive LLM inference by attaching multiple lightweight decoding heads that predict several future tokens in parallel. A tree-based attention mechanism constructs and verifies multiple candidate continuations in a single forward pass per decoding step. Two fine-tuning regimes are defined: Medusa-1 trains only the heads on a frozen backbone (claimed to be lossless), while Medusa-2 jointly optimizes heads and backbone under a special recipe that preserves original capabilities. Extensions include self-distillation and a typical acceptance scheme. Experiments on models of varying sizes report speedups exceeding 2.2× for Medusa-1 and 2.3–3.6× for Medusa-2 while maintaining generation quality.
Significance. If the reported speedups are robust, Medusa offers a practical alternative to speculative decoding that avoids maintaining a separate draft model, lowering implementation complexity for practitioners. The ability to achieve >2× acceleration with only head fine-tuning (Medusa-1) or modest joint training (Medusa-2) would be valuable for latency-sensitive deployments across model scales.
major comments (3)
- [Abstract / Experiments] Abstract and Experiments section: the central claims of >2.2× (Medusa-1) and 2.3–3.6× (Medusa-2) speedups are presented without tabulated or plotted acceptance rates per decoding step, their variance across tasks, or error bars. Because net speedup is determined by the product of acceptance probability and tree branching factor minus the overhead of extra heads and tree attention, these quantities are load-bearing and must be shown explicitly to substantiate the acceleration numbers.
- [Method] Method description of Medusa-2: the 'special training recipe' that jointly fine-tunes the backbone while preserving its original distribution is referenced but not specified with loss terms, weighting schedules, or regularization details. Without these, it is impossible to assess whether the reported higher speedups come at the cost of distribution shift that would only appear on held-out or longer-context data.
- [Experiments] Experiments: direct head-to-head comparisons against established speculative decoding baselines (e.g., draft-model methods) are absent. Speedup and quality metrics should be reported on identical prompts and hardware so that the claimed simplicity advantage can be weighed against any difference in achieved tokens-per-second.
minor comments (2)
- [Figures] Figure captions and tree diagrams should explicitly label the branching factor and depth used in the reported runs so readers can reproduce the verification cost.
- [Extensions] The self-distillation procedure is mentioned as an extension but lacks a concise algorithmic description or pseudocode.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and will incorporate the suggested clarifications and additions in the revised manuscript to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: the central claims of >2.2× (Medusa-1) and 2.3–3.6× (Medusa-2) speedups are presented without tabulated or plotted acceptance rates per decoding step, their variance across tasks, or error bars. Because net speedup is determined by the product of acceptance probability and tree branching factor minus the overhead of extra heads and tree attention, these quantities are load-bearing and must be shown explicitly to substantiate the acceleration numbers.
Authors: We agree that explicit reporting of acceptance rates, their variance, and error bars would better substantiate the speedup claims by directly illustrating the contribution of acceptance probability and branching factor. In the revised manuscript, we will add a dedicated table in the Experiments section that reports per-step acceptance rates for Medusa-1 and Medusa-2 across all evaluated tasks, including standard deviations to capture variance. We will also augment the speedup plots with error bars derived from multiple runs. These additions will explicitly connect the measured end-to-end speedups to the underlying acceptance and tree statistics while accounting for overhead. revision: yes
-
Referee: [Method] Method description of Medusa-2: the 'special training recipe' that jointly fine-tunes the backbone while preserving its original distribution is referenced but not specified with loss terms, weighting schedules, or regularization details. Without these, it is impossible to assess whether the reported higher speedups come at the cost of distribution shift that would only appear on held-out or longer-context data.
Authors: We acknowledge that the description of the Medusa-2 training procedure lacks sufficient implementation detail. In the revised Method section, we will explicitly define the composite loss (standard next-token prediction loss on the backbone combined with a weighted Medusa-head prediction loss), the weighting schedule (e.g., linear ramp-up of the head-loss coefficient from 0.1 to 0.5 over the first 10% of training steps), and the regularization term (KL divergence between the fine-tuned backbone outputs and the original model outputs on a held-out calibration set). These additions will allow readers to verify that distribution shift is controlled and that the higher speedups do not compromise generalization on held-out or longer-context data. revision: yes
-
Referee: [Experiments] Experiments: direct head-to-head comparisons against established speculative decoding baselines (e.g., draft-model methods) are absent. Speedup and quality metrics should be reported on identical prompts and hardware so that the claimed simplicity advantage can be weighed against any difference in achieved tokens-per-second.
Authors: We agree that direct, apples-to-apples comparisons would strengthen the evaluation. In the revised Experiments section, we will add a new subsection presenting head-to-head results against representative speculative decoding baselines (e.g., the draft-model method of Leviathan et al.) using the exact same prompt sets, model sizes, and hardware configuration. We will report tokens-per-second, acceptance rates, and generation quality metrics side-by-side, enabling a clear assessment of the simplicity versus performance trade-off. revision: yes
Circularity Check
No circularity: speedups are empirical measurements on held-out data
full rationale
The paper's claims rest on measured inference speedups from adding and training extra decoding heads plus tree verification, evaluated against standard autoregressive baselines on held-out tasks. No equations reduce the reported gains (e.g., 2.2x or 2.3-3.6x) to quantities defined inside the paper by construction, and no self-citations or uniqueness theorems are invoked to force the central result. The training recipes and acceptance-rate improvements are presented as standard fine-tuning procedures whose net benefit is quantified externally rather than assumed tautologically.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of Medusa heads
- tree depth and branching factor
axioms (1)
- domain assumption The backbone LLM’s hidden states remain sufficiently informative for the added heads to predict future tokens accurately.
invented entities (1)
-
Medusa decoding heads
no independent evidence
Forward citations
Cited by 24 Pith papers
-
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
-
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
-
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
-
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
-
WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference
WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% ac...
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.
-
Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs
Language models trained on parallel streams of computation can overcome single-stream bottlenecks in autonomous agents by enabling simultaneous reading, thinking, and acting.
-
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
-
Edit-Based Refinement for Parallel Masked Diffusion Language Models
ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.
-
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
-
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
-
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
-
CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding
CoVSpec achieves up to 2.21x higher throughput and over 96% lower communication overhead for device-edge VLM inference via training-free visual token reduction, adaptive drafting, and decoupled parallel verification-c...
-
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
-
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
-
RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding
RACER unifies retrieval of exact matching patterns with logit-driven cues to produce better speculative drafts, achieving more than 2x speedup over autoregressive decoding and outperforming prior training-free specula...
-
SMART: When is it Actually Worth Expanding a Speculative Tree?
SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.
-
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
MEM1 uses end-to-end RL to learn constant-memory agents that update a shared state for memory and reasoning, delivering 3.5x better performance and 3.7x lower memory use than larger baselines on long-horizon QA and sh...
-
SnapKV: LLM Knows What You are Looking for Before Generation
SnapKV selects clustered important KV positions per attention head from an observation window at the prompt end, yielding 3.6x faster generation and 8.2x better memory efficiency on 16K-token inputs with comparable pe...
-
Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning
STOP is a new learnable internal path-pruning technique that improves efficiency and accuracy of parallel reasoning in LRMs under fixed compute budgets.
-
Acceptance Dynamics Across Cognitive Domains in Speculative Decoding
Empirical measurements across four NLP domains show task type is a stronger predictor of speculative decoding acceptance than tree depth, with chat uniquely achieving expected accepted length over 1 token per step.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
-
Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
A framework combines multi-LoRA runtime switching, multi-stream stylistic decoding, and Dynamic Self-Speculative Decoding with INT4 quantization to achieve 4-6x memory and latency gains for on-device inference of a on...
Reference graph
Works this paper leans on
-
[2]
Axolotl. Axolotl . https://github.com/OpenAccess-AI-Collective/axolotl, 2023
work page 2023
-
[3]
Basu, S., Ramachandran, G. S., Keskar, N. S., and Varshney, L. R. \ MIROSTAT \ : A \ neural \ \ text \ \ decoding \ \ algorithm \ \ that \ \ directly \ \ controls \ \ perplexity \ . In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=W1G1JZEIy5_
work page 2021
-
[4]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
work page 1901
-
[5]
Accelerating Large Language Model Decoding with Speculative Sampling
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling. February 2023. doi:10.48550/ARXIV.2302.01318
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.01318 2023
-
[6]
Dissecting batching effects in gpt inference
Chen, L. Dissecting batching effects in gpt inference. https://le.qun.ch/en/blog/2023/05/13/transformer-batching/, 2023. Blog
work page 2023
-
[7]
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., and Xing, E. P. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/
work page 2023
-
[8]
PaLM: Scaling Language Modeling with Pathways
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
8-bit optimizers via block-wise quantization
Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L. 8-bit optimizers via block-wise quantization. International Conference on Learning Representations, 2021
work page 2021
-
[12]
Enhancing chat language models by scaling high-quality instructional conversations, 2023
Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., Liu, Z., Sun, M., and Zhou, B. Enhancing chat language models by scaling high-quality instructional conversations, 2023
work page 2023
-
[13]
Dubois, Y., Li, X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023
work page 2023
-
[14]
Sigmoid-weighted linear units for neural network function approximation in reinforcement learning
Elfwing, S., Uchibe, E., and Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 2017. doi:10.1016/j.neunet.2017.12.012
-
[15]
Hierarchical neural story generation
Fan, A., Lewis, M., and Dauphin, Y. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2018. doi:10.18653/v1/p18-1082
-
[17]
Google. Palm 2 technical report, 2023. URL https://ai.google/static/documents/palm2techreport.pdf
work page 2023
-
[18]
Hewitt, J., Manning, C. D., and Liang, P. Truncation sampling as language model desmoothing. October 2022. doi:10.48550/ARXIV.2210.15191
-
[19]
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
The curious case of neural text degeneration
Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH
work page 2020
-
[21]
J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W. Lora: Low-rank adaptation of large language models. ICLR, 2021
work page 2021
-
[22]
Assisted generation: a new direction toward low-latency text generation, 2023
Joao Gante . Assisted generation: a new direction toward low-latency text generation, 2023. URL https://huggingface.co/blog/assisted-generation
work page 2023
-
[24]
Kim, Y. and Rush, A. M. Sequence-level knowledge distillation. EMNLP, 2016
work page 2016
-
[25]
Fine-tuning can distort pretrained features and underperform out-of-distribution
Kumar, A., Raghunathan, A., Jones, R., Ma, T., and Liang, P. Fine-tuning can distort pretrained features and underperform out-of-distribution. International Conference on Learning Representations, 2022
work page 2022
-
[26]
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[27]
Fast inference from transformers via speculative decoding, 2023.URL https://arxiv
Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding. November 2022. doi:10.48550/ARXIV.2211.17192
-
[28]
Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023
work page 2023
-
[30]
On the probability-quality paradox in language generation
Meister, C., Wiher, G., Pimentel, T., and Cotterell, R. On the probability-quality paradox in language generation. March 2022. doi:10.48550/ARXIV.2203.17217
-
[31]
Meister, C., Pimentel, T., Wiher, G., and Cotterell, R. Locally typical sampling. Transactions of the Association for Computational Linguistics, 11: 0 102--121, 2023
work page 2023
- [33]
- [34]
-
[35]
Training language models to follow instructions with human feedback
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Pan, J. Tiny vicuna 1b. https://huggingface.co/Jiayi-Pan/Tiny-Vicuna-1B, 2023
work page 2023
-
[37]
MAUVE : Measuring the gap between neural text and human text using divergence frontiers
Pillutla, K., Swayamdipta, S., Zellers, R., Thickstun, J., Welleck, S., Choi, Y., and Harchaoui, Z. MAUVE : Measuring the gap between neural text and human text using divergence frontiers. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Tqx7nJp7PR
work page 2021
-
[38]
Efficiently scaling transformer inference
Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., and Dean, J. Efficiently scaling transformer inference. November 2022. doi:10.48550/ARXIV.2211.05102
- [39]
-
[42]
Stern, M., Shazeer, N. M., and Uszkoreit, J. Blockwise parallel decoding for deep autoregressive models. Neural Information Processing Systems, 2018
work page 2018
-
[44]
Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A. M., and Wolf, T. Zephyr: Direct distillation of lm alignment, 2023
work page 2023
-
[45]
Speculative decoding: Lossless speedup of autoregressive translation, 2023
Xia, H., Ge, T., Chen, S.-Q., Wei, F., and Sui, Z. Speculative decoding: Lossless speedup of autoregressive translation, 2023. URL https://openreview.net/forum?id=H-VlwsYvVi
work page 2023
-
[46]
Smoothquant: Accurate and efficient post-training quantization for large language models
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp.\ 38087--38099. PMLR, 2023 a
work page 2023
-
[47]
A survey on non-autoregressive generation for neural machine translation and beyond
Xiao, Y., Wu, L., Guo, J., Li, J., Zhang, M., Qin, T., and Liu, T.-y. A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023 b
work page 2023
-
[48]
Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34: 0 28877--28888, 2021
work page 2021
-
[49]
Tinyllama: An open-source small language model, 2024
Zhang, P., Zeng, G., Wang, T., and Lu, W. Tinyllama: An open-source small language model, 2024
work page 2024
-
[50]
OPT: Open Pre-trained Transformer Language Models
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023
work page 2023
-
[53]
Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[54]
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. 2023 , eprint=
work page 2023
-
[55]
International Conference on Learning Representations , year =
8-bit Optimizers via Block-wise Quantization , author =. International Conference on Learning Representations , year =
-
[56]
LoRA: Low-Rank Adaptation of Large Language Models , author =. ICLR , year =
- [57]
- [58]
-
[59]
International Conference on Learning Representations , year =
Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution , author =. International Conference on Learning Representations , year =
-
[60]
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
-
[61]
DistillSpec: Improving Speculative Decoding via Knowledge Distillation , author =. 2023 , journal =
work page 2023
- [62]
-
[63]
arXiv preprint arXiv:2310.18813 , year=
The Synergy of Speculative Decoding and Batching in Serving Large Language Models , author=. arXiv preprint arXiv:2310.18813 , year=
-
[64]
REST: Retrieval-Based Speculative Decoding , author =. 2023 , journal =
work page 2023
-
[65]
Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding , url =
Yichao Fu and Peter Bailis and Ion Stoica and Hao Zhang , month =. Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding , url =
-
[66]
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , author=. 2023 , eprint=
work page 2023
-
[67]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=
work page 2023
-
[68]
and Stoica, Ion and Xing, Eric P
Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =
-
[69]
Advances in Neural Information Processing Systems , volume=
Do transformers really perform badly for graph representation? , author=. Advances in Neural Information Processing Systems , volume=
-
[70]
Krishna Pillutla and Swabha Swayamdipta and Rowan Zellers and John Thickstun and Sean Welleck and Yejin Choi and Zaid Harchaoui , booktitle=. 2021 , url=
work page 2021
-
[71]
International Conference on Learning Representations , year=
The Curious Case of Neural Text Degeneration , author=. International Conference on Learning Representations , year=
-
[72]
Language Model Evaluation Beyond Perplexity , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
-
[73]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[74]
IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
A survey on non-autoregressive generation for neural machine translation and beyond , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
-
[75]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author=. arXiv preprint arXiv:2306.00978 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
International Conference on Machine Learning , pages=
Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[77]
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Llm. int8 (): 8-bit matrix multiplication for transformers at scale , author=. arXiv preprint arXiv:2208.07339 , year=
work page internal anchor Pith review arXiv
-
[78]
arXiv preprint arXiv:2307.13304 , year=
QuIP: 2-Bit Quantization of Large Language Models With Guarantees , author=. arXiv preprint arXiv:2307.13304 , year=
-
[79]
arXiv preprint arXiv:2306.14048 , year=
H \_2 O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models , author=. arXiv preprint arXiv:2306.14048 , year=
-
[80]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. arXiv preprint arXiv:2305.13245 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[81]
arXiv preprint arXiv:2306.07629 , year=
SqueezeLLM: Dense-and-Sparse Quantization , author=. arXiv preprint arXiv:2306.07629 , year=
-
[82]
Fast Transformer Decoding: One Write-Head is All You Need
Fast transformer decoding: One write-head is all you need , author=. arXiv preprint arXiv:1911.02150 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[83]
QLoRA: Efficient Finetuning of Quantized LLMs
Qlora: Efficient finetuning of quantized llms , author=. arXiv preprint arXiv:2305.14314 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [84]
-
[85]
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification , author=. arXiv preprint arXiv:2305.09781 , year=
-
[86]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [87]
-
[88]
Language models can solve computer tasks
Language Models can Solve Computer Tasks , author =. ARXIV.ORG , year =. doi:10.48550/arXiv.2303.17491 , bibSource =
-
[89]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. arXiv preprint arXiv:2305.16291 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [90]
-
[91]
arXiv preprint arXiv.2304.08354,
Tool learning with foundation models , author=. arXiv preprint arXiv:2304.08354 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.