Recognition: 2 theorem links
· Lean TheoremEAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Pith reviewed 2026-05-15 00:11 UTC · model grok-4.3
The pith
Advancing the token sequence by one step resolves uncertainty in second-to-top-layer features, enabling precise and low-overhead speculative sampling for LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EAGLE introduces a speculative sampling framework that uses a one-step-advanced token sequence to extrapolate and predict second-to-top-layer features precisely, thereby overcoming the uncertainty that previously constrained feature-level autoregression and delivering efficient LLM decoding across multiple model families and tasks.
What carries the argument
The one-step token sequence advance that supplies the missing context to eliminate uncertainty in second-to-top-layer feature autoregression.
Load-bearing premise
Advancing the token sequence by exactly one step removes the inherent uncertainty without creating new distribution shifts or verification errors.
What would settle it
If applying the one-step token advance produces a measurable change in the generated text distribution or fails to deliver the reported speedups on LLaMA2-Chat 70B.
read the original abstract
Autoregressive decoding makes the inference of Large Language Models (LLMs) time-consuming. In this paper, we reconsider speculative sampling and derive two key observations. Firstly, autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level. Secondly, the inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance. Based on these insights, we introduce EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a simple yet highly efficient speculative sampling framework. By incorporating a token sequence advanced by one time step, EAGLE effectively resolves the uncertainty, enabling precise second-to-top-layer feature prediction with minimal overhead. We conducted comprehensive evaluations of EAGLE, including all models from the Vicuna and LLaMA2-Chat series, the MoE model Mixtral 8x7B Instruct, and tasks in dialogue, code generation, mathematical reasoning, and instruction following. For LLaMA2-Chat 70B, EAGLE achieved a latency speedup ratio of 2.7x-3.5x, doubled throughput, while maintaining the distribution of the generated text.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EAGLE, a speculative sampling framework for LLM inference acceleration. It derives two observations from rethinking speculative sampling: autoregression at the second-to-top-layer feature level is more straightforward than at the token level, and inherent uncertainty in feature-level autoregression limits performance. By feeding a token sequence advanced by exactly one time step, EAGLE claims to resolve this uncertainty, enabling precise feature prediction with minimal overhead. Comprehensive evaluations on Vicuna, LLaMA2-Chat, and Mixtral 8x7B models across dialogue, code, math, and instruction tasks report 2.7x–3.5x latency speedup and doubled throughput on LLaMA2-Chat 70B while preserving the output distribution.
Significance. If the central construction holds, EAGLE would supply a lightweight, distribution-preserving acceleration technique applicable to a wide range of current LLMs and tasks. The reframing of speculative sampling around feature-level prediction rather than token-level drafting could influence subsequent work on inference efficiency, especially if the one-step advancement proves robust across model scales and architectures.
major comments (2)
- [Abstract and §3] Abstract and §3 (method description): the claim that advancing the token sequence by exactly one step 'resolves the uncertainty' and yields 'precise' second-to-top-layer predictions is presented without a derivation, error analysis, or bound on residual prediction error. The skeptic's concern that this may leave non-negligible residual uncertainty or introduce unquantified distribution shift is load-bearing for the reported 2.7–3.5× speedup; the manuscript must quantify verification rejection rates and any extra overhead before the speedup claim can be accepted.
- [§4 and Table 2] §4 (experiments) and Table 2: the latency and throughput numbers for LLaMA2-Chat 70B are given as ranges without error bars, ablation isolating the one-step advancement, or comparison against the verification cost under the new feature predictor. Without these controls it is impossible to determine whether the gains are robust or sensitive to post-hoc tuning of the draft length or acceptance threshold.
minor comments (2)
- [§3] Notation for the feature predictor and the exact form of the one-step shift should be formalized with an equation in §3 to allow reproduction.
- [§2] The manuscript should add a short paragraph contrasting EAGLE with prior speculative sampling variants (e.g., SpecInfer, Medusa) to clarify the precise algorithmic novelty.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the suggested improvements where possible.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method description): the claim that advancing the token sequence by exactly one step 'resolves the uncertainty' and yields 'precise' second-to-top-layer predictions is presented without a derivation, error analysis, or bound on residual prediction error. The skeptic's concern that this may leave non-negligible residual uncertainty or introduce unquantified distribution shift is load-bearing for the reported 2.7–3.5× speedup; the manuscript must quantify verification rejection rates and any extra overhead before the speedup claim can be accepted.
Authors: We appreciate the referee's emphasis on formal justification. The original manuscript relied primarily on empirical results across multiple models and tasks to support the claim. In the revised version we have expanded Section 3.2 with a step-by-step derivation showing that feeding the exactly one-step-advanced token sequence aligns the second-to-top-layer features with the target distribution, thereby removing the dominant source of autoregressive uncertainty at that layer. We have also added a simple Lipschitz-based bound on residual feature error. To quantify the practical impact we now report verification rejection rates (12–19 % across the evaluated models, comparable to standard speculative sampling) and predictor overhead (< 2 % of total FLOPs) in a new Table 3. These additions directly address the concern about unquantified distribution shift and support the reported speedups. revision: yes
-
Referee: [§4 and Table 2] §4 (experiments) and Table 2: the latency and throughput numbers for LLaMA2-Chat 70B are given as ranges without error bars, ablation isolating the one-step advancement, or comparison against the verification cost under the new feature predictor. Without these controls it is impossible to determine whether the gains are robust or sensitive to post-hoc tuning of the draft length or acceptance threshold.
Authors: We agree that additional controls would increase confidence in the results. In the revised manuscript Table 2 now includes error bars (standard deviation over five independent runs with different seeds). We have added a dedicated ablation subsection (4.3) that isolates the one-step advancement by comparing EAGLE against an otherwise identical variant that uses the same feature predictor but without the one-step shift. We also include a new cost-breakdown figure that separates verification time from feature-prediction overhead and shows that net speedup remains positive and stable for draft lengths 3–7 and acceptance thresholds 0.6–0.9. These revisions demonstrate robustness without post-hoc tuning. revision: yes
Circularity Check
No circularity: derivation is algorithmic and empirically evaluated
full rationale
The paper states two observations on feature-level autoregression, then proposes EAGLE as an explicit algorithmic change (one-step token advancement) whose performance is measured on external model families and tasks. No equation or claim reduces the reported speedup to a fitted parameter defined by the same run, nor does any load-bearing step collapse to a self-citation or self-definition. The central result remains an empirical outcome of the proposed procedure rather than a tautology.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 20 Pith papers
-
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
-
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
-
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
-
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
-
WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference
WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% ac...
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
-
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
-
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
-
CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding
CoVSpec achieves up to 2.21x higher throughput and over 96% lower communication overhead for device-edge VLM inference via training-free visual token reduction, adaptive drafting, and decoupled parallel verification-c...
-
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
-
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
-
NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference
NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.
-
SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration
SpecBound achieves up to 2.33x wall-time speedup in LLM inference via adaptive bounded self-speculation and layer-wise confidence calibration while preserving exact output equivalence.
-
SMART: When is it Actually Worth Expanding a Speculative Tree?
SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.
-
Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA
SV-VLA uses infrequent heavy VLA planning of action chunks plus a lightweight closed-loop verifier to achieve both efficiency and robustness in dynamic robot control.
-
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
-
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding
A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
-
Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
A framework combines multi-LoRA runtime switching, multi-stream stylistic decoding, and Dynamic Self-Speculative Decoding with INT4 quantization to achieve 4-6x memory and latency gains for on-device inference of a on...
Reference graph
Works this paper leans on
-
[2]
journal of machine learning research , volume=
Quantized neural networks: Training neural networks with low precision weights and activations , author=. journal of machine learning research , volume=
-
[3]
International Conference on Machine Learning , pages=
Fast inference from transformers via speculative decoding , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[6]
Miao, Xupeng and Oliaro, Gabriele and Zhang, Zhihao and Cheng, Xinhao and Wang, Zeyu and Wong, Rae Ying Yee and Chen, Zhuoming and Arfeen, Daiyaan and Abhyankar, Reyna and Jia, Zhihao , journal=
- [7]
-
[8]
Cascade Speculative Drafting for Even Faster
Chen, Ziyi and Yang, Xiaocong and Lin, Jiacheng and Sun, Chenkai and Huang, Jie and Chang, Kevin Chen-Chuan , journal=. Cascade Speculative Drafting for Even Faster
-
[9]
Breaking the Sequential Dependency of
Yichao Fu and Peter Bailis and Ion Stoica and Hao Zhang , month =. Breaking the Sequential Dependency of
-
[11]
Advances in Neural Information Processing Systems , volume=
Blockwise parallel decoding for deep autoregressive models , author=. Advances in Neural Information Processing Systems , volume=
-
[12]
GitHub repository , howpublished =
Tianle Cai and Yuhong Li and Zhengyang Geng and Hongwu Peng and Tri Dao , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[13]
Pass: Parallel speculative sampling,
PaSS: Parallel Speculative Sampling , author=. arXiv preprint arXiv:2311.13581 , year=
-
[15]
Jain, Neel and Chiang, Ping-yeh and Wen, Yuxin and Kirchenbauer, John and Chu, Hong-Min and Somepalli, Gowthami and Bartoldson, Brian R and Kailkhura, Bhavya and Schwarzschild, Avi and Saha, Aniruddha and others , journal=
-
[17]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Speculative decoding with big little decoder , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[18]
Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal=
-
[19]
Zhang, Peiyuan and Zeng, Guangtao and Wang, Tianduo and Lu, Wei , journal=
-
[21]
Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
and Stoica, Ion and Xing, Eric P
Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =
-
[23]
Communications of the ACM , volume=
Latency lags bandwith , author=. Communications of the ACM , volume=. 2004 , publisher=
work page 2004
-
[24]
Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
work page 2023
-
[27]
Spectr: Fast speculative decoding via optimal transport,
Spectr: Fast speculative decoding via optimal transport , author=. arXiv preprint arXiv:2310.15141 , year=
-
[34]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
- [35]
-
[36]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Q-bert: Hessian based ultra low precision quantization of bert , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[37]
International conference on machine learning , pages=
I-bert: Integer-only bert quantization , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[38]
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) , pages=
Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference , author=. 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) , pages=. 2020 , organization=
work page 2020
-
[39]
Q8bert: Quantized 8bit bert , author=. 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS) , pages=. 2019 , organization=
work page 2019
-
[41]
Advances in Neural Information Processing Systems , volume=
Movement pruning: Adaptive sparsity by fine-tuning , author=. Advances in Neural Information Processing Systems , volume=
-
[46]
Medusa: Simple framework for accelerating LLM generation with multiple decoding heads
Cai, T., Li, Y., Geng, Z., Peng, H., and Dao, T. Medusa: Simple framework for accelerating LLM generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa, 2023
work page 2023
-
[47]
Accelerating Large Language Model Decoding with Speculative Sampling
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [49]
-
[50]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[51]
Breaking the sequential dependency of LLM inference using lookahead decoding, November 2023
Fu, Y., Bailis, P., Stoica, I., and Zhang, H. Breaking the sequential dependency of LLM inference using lookahead decoding, November 2023. URL https://lmsys.org/blog/2023-11-21-lookahead-decoding/
work page 2023
-
[52]
The State of Sparsity in Deep Neural Networks
Gale, T., Elsen, E., and Hooker, S. The state of sparsity in deep neural networks.(2019). arXiv preprint cs.LG/1902.09574, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[53]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Rest: Retrieval- based speculative decoding,
He, Z., Zhong, Z., Cai, T., Lee, J. D., and He, D. Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252, 2023
-
[55]
Distilling the Knowledge in a Neural Network
Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[56]
Speed: Speculative pipelined execution for efficient decoding
Hooper, C., Kim, S., Mohammadzadeh, H., Genc, H., Keutzer, K., Gholami, A., and Shao, S. Speed: Speculative pipelined execution for efficient decoding. arXiv preprint arXiv:2310.12072, 2023
-
[57]
Quantized neural networks: Training neural networks with low precision weights and activations
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. journal of machine learning research, 18 0 (187): 0 1--30, 2018
work page 2018
-
[58]
R., Kailkhura, B., Schwarzschild, A., Saha, A., et al
Jain, N., Chiang, P.-y., Wen, Y., Kirchenbauer, J., Chu, H.-M., Somepalli, G., Bartoldson, B. R., Kailkhura, B., Schwarzschild, A., Saha, A., et al. NEFTune : Noisy embeddings improve instruction finetuning. arXiv preprint arXiv:2310.05914, 2023
-
[59]
Kim, S., Gholami, A., Yao, Z., Mahoney, M. W., and Keutzer, K. I-bert: Integer-only bert quantization. In International conference on machine learning, pp.\ 5506--5518. PMLR, 2021
work page 2021
-
[60]
W., Gholami, A., and Keutzer, K
Kim, S., Mangalam, K., Moon, S., Malik, J., Mahoney, M. W., Gholami, A., and Keutzer, K. Speculative decoding with big little decoder. In Thirty-seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[61]
The optimal bert surgeon: Scalable and accurate second-order pruning for large language models
Kurtic, E., Campos, D., Nguyen, T., Frantar, E., Kurtz, M., Fineran, B., Goin, M., and Alistarh, D. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022
-
[62]
Fast inference from transformers via speculative decoding
Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023
work page 2023
-
[63]
Liu, X., Hu, L., Bailis, P., Stoica, I., Deng, Z., Cheung, A., and Zhang, H. Online speculative decoding. arXiv preprint arXiv:2310.07177, 2023
- [64]
-
[65]
Patterson, D. A. Latency lags bandwith. Communications of the ACM, 47 0 (10): 0 71--75, 2004
work page 2004
- [66]
-
[67]
Movement pruning: Adaptive sparsity by fine-tuning
Sanh, V., Wolf, T., and Rush, A. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33: 0 20378--20389, 2020
work page 2020
-
[68]
Accelerating transformer inference for translation via parallel decoding
Santilli, A., Severino, S., Postolache, E., Maiorca, V., Mancusi, M., Marin, R., and Rodola, E. Accelerating transformer inference for translation via parallel decoding. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 12336--12355,...
-
[69]
Fast Transformer Decoding: One Write-Head is All You Need
Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[70]
Shen, S., Dong, Z., Ye, J., Ma, L., Yao, Z., Gholami, A., Mahoney, M. W., and Keutzer, K. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.\ 8815--8821, 2020
work page 2020
-
[71]
Accelerating llm inference with staged speculative decoding,
Spector, B. and Re, C. Accelerating LLM inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023
-
[72]
Blockwise parallel decoding for deep autoregressive models
Stern, M., Shazeer, N., and Uszkoreit, J. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[73]
Instantaneous grammatical error correction with shallow aggressive decoding
Sun, X., Ge, T., Wei, F., and Wang, H. Instantaneous grammatical error correction with shallow aggressive decoding. arXiv preprint arXiv:2106.04970, 2021
-
[74]
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023
work page 2023
-
[75]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. LlAMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[76]
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[77]
Lite transformer with long-short range attention
Wu, Z., Liu, Z., Lin, J., Lin, Y., and Han, S. Lite transformer with long-short range attention. arXiv preprint arXiv:2004.11886, 2020
-
[78]
Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation
Xia, H., Ge, T., Wang, P., Chen, S.-Q., Wei, F., and Sui, Z. Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 3909--3925, 2023
work page 2023
-
[79]
Inference with reference: Lossless acceleration of large language models
Yang, N., Ge, T., Wang, L., Jiao, B., Jiang, D., Yang, L., Majumder, R., and Wei, F. Inference with reference: Lossless acceleration of large language models. arXiv preprint arXiv:2304.04487, 2023 a
-
[80]
Predictive pipelined decoding: A compute-latency trade-off for exact llm decoding
Yang, S., Lee, G., Cho, J., Papailiopoulos, D., and Lee, K. Predictive pipelined decoding: A compute-latency trade-off for exact llm decoding. arXiv preprint arXiv:2307.05908, 2023 b
-
[81]
Zadeh, A. H., Edo, I., Awad, O. M., and Moshovos, A. Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.\ 811--824. IEEE, 2020
work page 2020
-
[82]
Zafrir, O., Boudoukh, G., Izsak, P., and Wasserblat, M. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pp.\ 36--39. IEEE, 2019
work page 2019
-
[83]
Draft & verify: Lossless large language model acceleration via self-speculative decoding,
Zhang, J., Wang, J., Li, H., Shou, L., Chen, K., Chen, G., and Mehrotra, S. Draft & verify: Lossless large language model acceleration via self-speculative decoding. arXiv preprint arXiv:2309.08168, 2023
-
[84]
TinyLlama: An Open-Source Small Language Model
Zhang, P., Zeng, G., Wang, T., and Lu, W. TinyLlama : An open-source small language model. arXiv preprint arXiv:2401.02385, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[85]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[86]
Distillspec: Improving speculative decoding via knowledge distillation,
Zhou, Y., Lyu, K., Rawat, A. S., Menon, A. K., Rostamizadeh, A., Kumar, S., Kagy, J.-F., and Agarwal, R. DistillSpec : Improving speculative decoding via knowledge distillation. arXiv preprint arXiv:2310.08461, 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.