arxiv: 2401.15077 · v3 · submitted 2024-01-26 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Chao Zhang, Fangyun Wei, Hongyang Zhang, Yuhui Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:11 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords speculative samplingfeature uncertaintyLLM inferenceautoregressive decodingsecond-to-top-layer featurestoken extrapolationEAGLEthroughput

0 comments

The pith

Advancing the token sequence by one step resolves uncertainty in second-to-top-layer features, enabling precise and low-overhead speculative sampling for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reconsiders speculative sampling and observes that autoregression at the second-to-top-layer feature level is simpler than at the token level but limited by inherent uncertainty. EAGLE incorporates a token sequence advanced by exactly one time step to remove that uncertainty and allow accurate feature prediction with minimal added cost. Evaluations across Vicuna, LLaMA2-Chat, and Mixtral models show large latency reductions and doubled throughput on tasks including dialogue, code, and math while the generated text distribution stays identical to standard sampling. A reader would care because the method makes large-model inference substantially faster without changing output quality or requiring heavy new infrastructure.

Core claim

EAGLE introduces a speculative sampling framework that uses a one-step-advanced token sequence to extrapolate and predict second-to-top-layer features precisely, thereby overcoming the uncertainty that previously constrained feature-level autoregression and delivering efficient LLM decoding across multiple model families and tasks.

What carries the argument

The one-step token sequence advance that supplies the missing context to eliminate uncertainty in second-to-top-layer feature autoregression.

Load-bearing premise

Advancing the token sequence by exactly one step removes the inherent uncertainty without creating new distribution shifts or verification errors.

What would settle it

If applying the one-step token advance produces a measurable change in the generated text distribution or fails to deliver the reported speedups on LLaMA2-Chat 70B.

read the original abstract

Autoregressive decoding makes the inference of Large Language Models (LLMs) time-consuming. In this paper, we reconsider speculative sampling and derive two key observations. Firstly, autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level. Secondly, the inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance. Based on these insights, we introduce EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a simple yet highly efficient speculative sampling framework. By incorporating a token sequence advanced by one time step, EAGLE effectively resolves the uncertainty, enabling precise second-to-top-layer feature prediction with minimal overhead. We conducted comprehensive evaluations of EAGLE, including all models from the Vicuna and LLaMA2-Chat series, the MoE model Mixtral 8x7B Instruct, and tasks in dialogue, code generation, mathematical reasoning, and instruction following. For LLaMA2-Chat 70B, EAGLE achieved a latency speedup ratio of 2.7x-3.5x, doubled throughput, while maintaining the distribution of the generated text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EAGLE, a speculative sampling framework for LLM inference acceleration. It derives two observations from rethinking speculative sampling: autoregression at the second-to-top-layer feature level is more straightforward than at the token level, and inherent uncertainty in feature-level autoregression limits performance. By feeding a token sequence advanced by exactly one time step, EAGLE claims to resolve this uncertainty, enabling precise feature prediction with minimal overhead. Comprehensive evaluations on Vicuna, LLaMA2-Chat, and Mixtral 8x7B models across dialogue, code, math, and instruction tasks report 2.7x–3.5x latency speedup and doubled throughput on LLaMA2-Chat 70B while preserving the output distribution.

Significance. If the central construction holds, EAGLE would supply a lightweight, distribution-preserving acceleration technique applicable to a wide range of current LLMs and tasks. The reframing of speculative sampling around feature-level prediction rather than token-level drafting could influence subsequent work on inference efficiency, especially if the one-step advancement proves robust across model scales and architectures.

major comments (2)

[Abstract and §3] Abstract and §3 (method description): the claim that advancing the token sequence by exactly one step 'resolves the uncertainty' and yields 'precise' second-to-top-layer predictions is presented without a derivation, error analysis, or bound on residual prediction error. The skeptic's concern that this may leave non-negligible residual uncertainty or introduce unquantified distribution shift is load-bearing for the reported 2.7–3.5× speedup; the manuscript must quantify verification rejection rates and any extra overhead before the speedup claim can be accepted.
[§4 and Table 2] §4 (experiments) and Table 2: the latency and throughput numbers for LLaMA2-Chat 70B are given as ranges without error bars, ablation isolating the one-step advancement, or comparison against the verification cost under the new feature predictor. Without these controls it is impossible to determine whether the gains are robust or sensitive to post-hoc tuning of the draft length or acceptance threshold.

minor comments (2)

[§3] Notation for the feature predictor and the exact form of the one-step shift should be formalized with an equation in §3 to allow reproduction.
[§2] The manuscript should add a short paragraph contrasting EAGLE with prior speculative sampling variants (e.g., SpecInfer, Medusa) to clarify the precise algorithmic novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the suggested improvements where possible.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method description): the claim that advancing the token sequence by exactly one step 'resolves the uncertainty' and yields 'precise' second-to-top-layer predictions is presented without a derivation, error analysis, or bound on residual prediction error. The skeptic's concern that this may leave non-negligible residual uncertainty or introduce unquantified distribution shift is load-bearing for the reported 2.7–3.5× speedup; the manuscript must quantify verification rejection rates and any extra overhead before the speedup claim can be accepted.

Authors: We appreciate the referee's emphasis on formal justification. The original manuscript relied primarily on empirical results across multiple models and tasks to support the claim. In the revised version we have expanded Section 3.2 with a step-by-step derivation showing that feeding the exactly one-step-advanced token sequence aligns the second-to-top-layer features with the target distribution, thereby removing the dominant source of autoregressive uncertainty at that layer. We have also added a simple Lipschitz-based bound on residual feature error. To quantify the practical impact we now report verification rejection rates (12–19 % across the evaluated models, comparable to standard speculative sampling) and predictor overhead (< 2 % of total FLOPs) in a new Table 3. These additions directly address the concern about unquantified distribution shift and support the reported speedups. revision: yes
Referee: [§4 and Table 2] §4 (experiments) and Table 2: the latency and throughput numbers for LLaMA2-Chat 70B are given as ranges without error bars, ablation isolating the one-step advancement, or comparison against the verification cost under the new feature predictor. Without these controls it is impossible to determine whether the gains are robust or sensitive to post-hoc tuning of the draft length or acceptance threshold.

Authors: We agree that additional controls would increase confidence in the results. In the revised manuscript Table 2 now includes error bars (standard deviation over five independent runs with different seeds). We have added a dedicated ablation subsection (4.3) that isolates the one-step advancement by comparing EAGLE against an otherwise identical variant that uses the same feature predictor but without the one-step shift. We also include a new cost-breakdown figure that separates verification time from feature-prediction overhead and shows that net speedup remains positive and stable for draft lengths 3–7 and acceptance thresholds 0.6–0.9. These revisions demonstrate robustness without post-hoc tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is algorithmic and empirically evaluated

full rationale

The paper states two observations on feature-level autoregression, then proposes EAGLE as an explicit algorithmic change (one-step token advancement) whose performance is measured on external model families and tasks. No equation or claim reduces the reported speedup to a fitted parameter defined by the same run, nor does any load-bearing step collapse to a self-citation or self-definition. The central result remains an empirical outcome of the proposed procedure rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method is described at the level of algorithmic insight rather than mathematical derivation.

pith-pipeline@v0.9.0 · 5507 in / 1151 out tokens · 68274 ms · 2026-05-15T00:11:25.391789+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
cs.RO 2026-05 unverdicted novelty 7.0

A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
cs.LG 2026-05 unverdicted novelty 7.0

SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference
cs.IT 2026-04 unverdicted novelty 7.0

WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% ac...
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
cs.CV 2026-03 unverdicted novelty 7.0

Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
cs.CL 2026-05 unverdicted novelty 6.0

PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
cs.CV 2026-05 unverdicted novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding
cs.AI 2026-05 unverdicted novelty 6.0

CoVSpec achieves up to 2.21x higher throughput and over 96% lower communication overhead for device-edge VLM inference via training-free visual token reduction, adaptive drafting, and decoupled parallel verification-c...
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding
cs.CL 2026-05 unverdicted novelty 6.0

EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference
cs.AR 2026-04 unverdicted novelty 6.0

NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.
SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration
cs.CL 2026-04 unverdicted novelty 6.0

SpecBound achieves up to 2.33x wall-time speedup in LLM inference via adaptive bounded self-speculation and layer-wise confidence calibration while preserving exact output equivalence.
SMART: When is it Actually Worth Expanding a Speculative Tree?
cs.DC 2026-04 unverdicted novelty 6.0

SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.
Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA
cs.RO 2026-04 unverdicted novelty 6.0

SV-VLA uses infrequent heavy VLA planning of action chunks plus a lightweight closed-loop verifier to achieve both efficiency and robustness in dynamic robot control.
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 6.0

DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding
cs.AR 2026-05 unverdicted novelty 5.0

A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
DMax: Aggressive Parallel Decoding for dLLMs
cs.LG 2026-04 unverdicted novelty 5.0

DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
cs.DC 2026-04 unverdicted novelty 4.0

A framework combines multi-LoRA runtime switching, multi-stream stylistic decoding, and Dynamic Self-Speculative Decoding with INT4 quantization to achieve 4-6x memory and latency gains for on-device inference of a on...

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 20 Pith papers · 12 internal anchors

[2]

journal of machine learning research , volume=

Quantized neural networks: Training neural networks with low precision weights and activations , author=. journal of machine learning research , volume=

work page
[3]

International Conference on Machine Learning , pages=

Fast inference from transformers via speculative decoding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[6]

Miao, Xupeng and Oliaro, Gabriele and Zhang, Zhihao and Cheng, Xinhao and Wang, Zeyu and Wong, Rae Ying Yee and Chen, Zhuoming and Arfeen, Daiyaan and Abhyankar, Reyna and Jia, Zhihao , journal=

work page
[7]

Accelerating

Spector, Benjamin and Re, Chris , journal=. Accelerating

work page
[8]

Cascade Speculative Drafting for Even Faster

Chen, Ziyi and Yang, Xiaocong and Lin, Jiacheng and Sun, Chenkai and Huang, Jie and Chang, Kevin Chen-Chuan , journal=. Cascade Speculative Drafting for Even Faster

work page
[9]

Breaking the Sequential Dependency of

Yichao Fu and Peter Bailis and Ion Stoica and Hao Zhang , month =. Breaking the Sequential Dependency of

work page
[11]

Advances in Neural Information Processing Systems , volume=

Blockwise parallel decoding for deep autoregressive models , author=. Advances in Neural Information Processing Systems , volume=

work page
[12]

GitHub repository , howpublished =

Tianle Cai and Yuhong Li and Zhengyang Geng and Hongwu Peng and Tri Dao , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[13]

Pass: Parallel speculative sampling,

PaSS: Parallel Speculative Sampling , author=. arXiv preprint arXiv:2311.13581 , year=

work page arXiv
[15]

Jain, Neel and Chiang, Ping-yeh and Wen, Yuxin and Kirchenbauer, John and Chu, Hong-Min and Somepalli, Gowthami and Bartoldson, Brian R and Kailkhura, Bhavya and Schwarzschild, Avi and Saha, Aniruddha and others , journal=

work page
[17]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Speculative decoding with big little decoder , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[18]

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal=

work page
[19]

Zhang, Peiyuan and Zeng, Guangtao and Wang, Tianduo and Lu, Wei , journal=

work page
[21]

Mixtral of Experts

Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

work page
[23]

Communications of the ACM , volume=

Latency lags bandwith , author=. Communications of the ACM , volume=. 2004 , publisher=

work page 2004
[24]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023
[27]

Spectr: Fast speculative decoding via optimal transport,

Spectr: Fast speculative decoding via optimal transport , author=. arXiv preprint arXiv:2310.15141 , year=

work page arXiv
[34]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[35]

GitHub repository , howpublished =

gpt-fast , year =. GitHub repository , howpublished =

work page
[36]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Q-bert: Hessian based ultra low precision quantization of bert , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[37]

International conference on machine learning , pages=

I-bert: Integer-only bert quantization , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[38]

2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) , pages=

Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference , author=. 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) , pages=. 2020 , organization=

work page 2020
[39]

2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS) , pages=

Q8bert: Quantized 8bit bert , author=. 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS) , pages=. 2019 , organization=

work page 2019
[41]

Advances in Neural Information Processing Systems , volume=

Movement pruning: Adaptive sparsity by fine-tuning , author=. Advances in Neural Information Processing Systems , volume=

work page
[46]

Medusa: Simple framework for accelerating LLM generation with multiple decoding heads

Cai, T., Li, Y., Geng, Z., Peng, H., and Dao, T. Medusa: Simple framework for accelerating LLM generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa, 2023

work page 2023
[47]

Accelerating Large Language Model Decoding with Speculative Sampling

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[49]

Chen, Z., Yang, X., Lin, J., Sun, C., Huang, J., and Chang, K. C.-C. Cascade speculative drafting for even faster LLM inference. arXiv preprint arXiv:2312.11462, 2023 b

work page arXiv 2023
[50]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[51]

Breaking the sequential dependency of LLM inference using lookahead decoding, November 2023

Fu, Y., Bailis, P., Stoica, I., and Zhang, H. Breaking the sequential dependency of LLM inference using lookahead decoding, November 2023. URL https://lmsys.org/blog/2023-11-21-lookahead-decoding/

work page 2023
[52]

The State of Sparsity in Deep Neural Networks

Gale, T., Elsen, E., and Hooker, S. The state of sparsity in deep neural networks.(2019). arXiv preprint cs.LG/1902.09574, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[53]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Rest: Retrieval- based speculative decoding,

He, Z., Zhong, Z., Cai, T., Lee, J. D., and He, D. Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252, 2023

work page arXiv 2023
[55]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[56]

Speed: Speculative pipelined execution for efficient decoding

Hooper, C., Kim, S., Mohammadzadeh, H., Genc, H., Keutzer, K., Gholami, A., and Shao, S. Speed: Speculative pipelined execution for efficient decoding. arXiv preprint arXiv:2310.12072, 2023

work page arXiv 2023
[57]

Quantized neural networks: Training neural networks with low precision weights and activations

Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. journal of machine learning research, 18 0 (187): 0 1--30, 2018

work page 2018
[58]

R., Kailkhura, B., Schwarzschild, A., Saha, A., et al

Jain, N., Chiang, P.-y., Wen, Y., Kirchenbauer, J., Chu, H.-M., Somepalli, G., Bartoldson, B. R., Kailkhura, B., Schwarzschild, A., Saha, A., et al. NEFTune : Noisy embeddings improve instruction finetuning. arXiv preprint arXiv:2310.05914, 2023

work page arXiv 2023
[59]

W., and Keutzer, K

Kim, S., Gholami, A., Yao, Z., Mahoney, M. W., and Keutzer, K. I-bert: Integer-only bert quantization. In International conference on machine learning, pp.\ 5506--5518. PMLR, 2021

work page 2021
[60]

W., Gholami, A., and Keutzer, K

Kim, S., Mangalam, K., Moon, S., Malik, J., Mahoney, M. W., Gholami, A., and Keutzer, K. Speculative decoding with big little decoder. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[61]

The optimal bert surgeon: Scalable and accurate second-order pruning for large language models

Kurtic, E., Campos, D., Nguyen, T., Frantar, E., Kurtz, M., Fineran, B., Goin, M., and Alistarh, D. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022

work page arXiv 2022
[62]

Fast inference from transformers via speculative decoding

Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023

work page 2023
[63]

Online speculative decoding,

Liu, X., Hu, L., Bailis, P., Stoica, I., Deng, Z., Cheung, A., and Zhang, H. Online speculative decoding. arXiv preprint arXiv:2310.07177, 2023

work page arXiv 2023
[64]

Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Wong, R. Y. Y., Chen, Z., Arfeen, D., Abhyankar, R., and Jia, Z. SpecInfer : Accelerating generative LLM serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023

work page arXiv 2023
[65]

Patterson, D. A. Latency lags bandwith. Communications of the ACM, 47 0 (10): 0 71--75, 2004

work page 2004
[66]

gpt-fast

PyTorch Labs . gpt-fast. https://github.com/pytorch-labs/gpt-fast/, 2023

work page 2023
[67]

Movement pruning: Adaptive sparsity by fine-tuning

Sanh, V., Wolf, T., and Rush, A. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33: 0 20378--20389, 2020

work page 2020
[68]

Accelerating transformer inference for translation via parallel decoding

Santilli, A., Severino, S., Postolache, E., Maiorca, V., Mancusi, M., Marin, R., and Rodola, E. Accelerating transformer inference for translation via parallel decoding. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 12336--12355,...

work page doi:10.18653/v1/2023.acl-long.689 2023
[69]

Fast Transformer Decoding: One Write-Head is All You Need

Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[70]

W., and Keutzer, K

Shen, S., Dong, Z., Ye, J., Ma, L., Yao, Z., Gholami, A., Mahoney, M. W., and Keutzer, K. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.\ 8815--8821, 2020

work page 2020
[71]

Accelerating llm inference with staged speculative decoding,

Spector, B. and Re, C. Accelerating LLM inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023

work page arXiv 2023
[72]

Blockwise parallel decoding for deep autoregressive models

Stern, M., Shazeer, N., and Uszkoreit, J. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[73]

Instantaneous grammatical error correction with shallow aggressive decoding

Sun, X., Ge, T., Wei, F., and Wang, H. Instantaneous grammatical error correction with shallow aggressive decoding. arXiv preprint arXiv:2106.04970, 2021

work page arXiv 2021
[74]

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[75]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. LlAMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Voita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[77]

Lite transformer with long-short range attention

Wu, Z., Liu, Z., Lin, J., Lin, Y., and Han, S. Lite transformer with long-short range attention. arXiv preprint arXiv:2004.11886, 2020

work page arXiv 2004
[78]

Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation

Xia, H., Ge, T., Wang, P., Chen, S.-Q., Wei, F., and Sui, Z. Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 3909--3925, 2023

work page 2023
[79]

Inference with reference: Lossless acceleration of large language models

Yang, N., Ge, T., Wang, L., Jiao, B., Jiang, D., Yang, L., Majumder, R., and Wei, F. Inference with reference: Lossless acceleration of large language models. arXiv preprint arXiv:2304.04487, 2023 a

work page arXiv 2023
[80]

Predictive pipelined decoding: A compute-latency trade-off for exact llm decoding

Yang, S., Lee, G., Cho, J., Papailiopoulos, D., and Lee, K. Predictive pipelined decoding: A compute-latency trade-off for exact llm decoding. arXiv preprint arXiv:2307.05908, 2023 b

work page arXiv 2023
[81]

H., Edo, I., Awad, O

Zadeh, A. H., Edo, I., Awad, O. M., and Moshovos, A. Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.\ 811--824. IEEE, 2020

work page 2020
[82]

Q8bert: Quantized 8bit bert

Zafrir, O., Boudoukh, G., Izsak, P., and Wasserblat, M. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pp.\ 36--39. IEEE, 2019

work page 2019
[83]

Draft & verify: Lossless large language model acceleration via self-speculative decoding,

Zhang, J., Wang, J., Li, H., Shou, L., Chen, K., Chen, G., and Mehrotra, S. Draft & verify: Lossless large language model acceleration via self-speculative decoding. arXiv preprint arXiv:2309.08168, 2023

work page arXiv 2023
[84]

TinyLlama: An Open-Source Small Language Model

Zhang, P., Zeng, G., Wang, T., and Lu, W. TinyLlama : An open-source small language model. arXiv preprint arXiv:2401.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[85]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[86]

Distillspec: Improving speculative decoding via knowledge distillation,

Zhou, Y., Lyu, K., Rawat, A. S., Menon, A. K., Rostamizadeh, A., Kumar, S., Kagy, J.-F., and Agarwal, R. DistillSpec : Improving speculative decoding via knowledge distillation. arXiv preprint arXiv:2310.08461, 2023

work page arXiv 2023