arxiv: 2604.26469 · v3 · submitted 2026-04-29 · 💻 cs.SE

Recognition: unknown

An Empirical Study of Speculative Decoding on Software Engineering Tasks

Yijia Li , Junkai Chen , Xing Hu , Xin Xia

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:32 UTC · model grok-4.3

classification 💻 cs.SE

keywords speculative decodinglarge language modelssoftware engineeringcode generationinference accelerationempirical studymodel-basedmodel-free

0 comments

The pith

Speculative decoding accelerates inference for software engineering tasks with larger gains on smaller models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models for software engineering tasks are limited by slow autoregressive inference. Speculative decoding addresses this by using draft generations that the main model verifies in batches. This paper benchmarks both model-based and model-free variants across code generation, editing, and repair scenarios. It finds higher speedups for smaller models, task-specific preferences for each variant, and that code's repetitive structure allows more aggressive settings than in natural language. The results yield guidelines for efficient deployment in software engineering.

Core claim

Our empirical results indicate that SD demonstrates clear potential for accelerating inference, particularly for smaller models that achieve higher speedups than those of their larger counterparts. We find that the effectiveness of SD methods varies across different task scenarios. Model-based approaches are well-suited for code generation, whereas model-free methods are better adapted to repository-level repair and editing scenarios. Furthermore, we observe that the repetitiveness of SE tasks improves the performance of model-free methods. In contrast to natural language tasks, the higher predictability of SE tasks allows for more aggressive hyperparameters.

What carries the argument

Speculative decoding, a technique where a smaller draft model proposes multiple tokens for parallel verification by the target large language model.

If this is right

Smaller models obtain higher speedups from speculative decoding than larger models.
Model-based speculative decoding is effective for function-level code generation tasks.
Model-free speculative decoding is more suitable for repository-level repair and editing tasks.
The repetitiveness of software engineering tasks boosts the performance of model-free methods.
Software engineering tasks permit more aggressive hyperparameter settings than natural language tasks due to higher predictability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If adopted, this would mean faster response times in AI coding assistants for developers.
Hybrid approaches combining model-based and model-free methods could optimize performance across mixed task types.
These guidelines may apply to other structured prediction tasks beyond software engineering.
Testing on additional models and larger codebases would help confirm the generalizability of the task preferences.

Load-bearing premise

The selected tasks, models, and evaluation metrics are representative of real-world software engineering workflows and the observed differences will generalize.

What would settle it

Demonstrating that on a new set of repository-level tasks a larger model achieves higher speedup with model-based methods than model-free methods, or that speedups do not increase for smaller models, would falsify the central findings.

Figures

Figures reproduced from arXiv: 2604.26469 by Junkai Chen, Xing Hu, Xin Xia, Yijia Li.

**Figure 1.** Figure 1: Overview of the studied SD methods. Eagle-3). These methods accelerate the inference of LLM with diverse techniques (e.g., structural retrieval, neural-based drafting) and gain great adoption in various research domains [56]. In the following, we begin by describing the formulation of SD, and then introduce selected model-free and model-based approaches. 3.1.1 Problem Formulation. Let 𝑀𝑞 denote the target … view at source ↗

**Figure 2.** Figure 2: The 𝑛 − 𝛼 curves on (a) LiveCodeBench, (b) SWE-bench, and (c) Aider Polyglot. For PLD, As detailed in view at source ↗

**Figure 3.** Figure 3: Illustration of the agentic loop (a) and the statistics of infinite loops observed across different models view at source ↗

**Figure 4.** Figure 4: The 𝑛 − 𝛼 curves on (a) LiveCodeBench and (b) MT-bench. A closer inspection reveals that this decline is predominantly driven by the ineffectiveness of model-free methods. On Qwen3-32B, Suffix Decoding achieves a speedup of only 1.05×, while PLD suffers a regression to 0.94×. This trend of stagnation is consistent across other evaluated models. This performance drop is directly attributable to the sharp de… view at source ↗

read the original abstract

Large Language Models (LLMs) have become widely used for Software Engineering (SE) tasks, spanning from function-level code generation to complex repository-level workflows. However, the high latency of autoregressive inference remains a significant bottleneck, hindering their deployment in interactive environments. While Speculative Decoding (SD) offers a promising technique for lossless acceleration, prior research on long-context repository-level tasks and complex agentic interactions remains limited. To bridge this gap, we present the first systematic empirical study to evaluate the effectiveness of SD in SE tasks. We systematically benchmark a comprehensive spectrum of strategies, encompassing both model-based and model-free methods, across representative generation, editing, and repair scenarios. Our empirical results indicate that SD demonstrates clear potential for accelerating inference, particularly for smaller models that achieve higher speedups than those of their larger counterparts. We find that the effectiveness of SD methods varies across different task scenarios. Model-based approaches are well-suited for code generation, whereas model-free methods are better adapted to repository-level repair and editing scenarios. Furthermore, we observe that the repetitiveness of SE tasks improves the performance of model-free methods. In contrast to natural language tasks, the higher predictability of SE tasks allows for more aggressive hyperparameters. Our findings are summarized as guidelines to help increase inference efficiency for SE scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper benchmarks speculative decoding on SE tasks and finds model-based methods suit generation while model-free ones fit repair and editing, with larger gains on smaller models.

read the letter

The paper runs a systematic comparison of model-based and model-free speculative decoding across code generation, editing, and repository-level repair. It reports that SE tasks' repetitiveness and predictability support more aggressive settings than natural-language work, smaller models see bigger speedups, and the two SD families perform differently by task type. Those patterns match what the abstract claims and give concrete guidelines for picking methods in LLM-powered SE tools. The work is honest empirical measurement with no circular claims or invented quantities. It extends earlier SD studies by focusing on longer-context SE scenarios instead of short text. The main soft spot is that the provided summary lacks the actual speedup numbers, statistical tests, full baseline setups, and error analysis, so the strength of the task-specific differences is hard to judge from the abstract alone. The usual empirical worry about whether the chosen tasks and models represent real workflows applies here too, though nothing in the design makes that assumption especially fragile. This is the kind of paper that helps practitioners who need to cut latency in interactive coding assistants. A reader building or tuning LLM SE systems would get direct value from the task breakdowns and guidelines. It deserves peer review because the core observations are practical and the setup is standard enough that referees can check the details without it being a waste of time.

Referee Report

2 major / 3 minor

Summary. The manuscript presents the first systematic empirical study of speculative decoding (SD) for accelerating LLM inference on software engineering tasks. It benchmarks model-based and model-free SD strategies across code generation, editing, and repository-level repair scenarios using multiple models, reporting that SD yields speedups (higher for smaller models), that model-based methods suit generation while model-free suit repair/editing, that SE repetitiveness and predictability enable more aggressive settings than in NL tasks, and that these observations yield practical guidelines for SE inference efficiency.

Significance. If the empirical patterns hold, the work is significant for filling a gap in SD research by focusing on long-context, agentic SE workflows where latency is a deployment barrier. The multi-strategy, multi-task benchmark and distillation into guidelines provide concrete, actionable value for practitioners using LLMs in interactive SE tools, while highlighting how domain-specific properties (repetitiveness, predictability) interact with acceleration techniques.

major comments (2)

Experimental results section: the reported speedups and task-specific differences lack accompanying statistical tests, variance measures, or error analysis, which is load-bearing for the central claim that 'effectiveness of SD methods varies across different task scenarios' and that patterns are 'consistent'.
Task and model selection (methodology section): the assumption that the chosen tasks, models, and metrics are representative of real-world SE workflows is not supported by ablation studies or discussion of selection biases, undermining the generalizability of the guidelines to other codebases and models.

minor comments (3)

The abstract and conclusion refer to 'guidelines' but these should be explicitly enumerated in a dedicated table or subsection for easy reference by readers.
Figures showing speedups would be clearer if they included non-SD baseline latencies and confidence intervals alongside the reported values.
Ensure all model sizes, exact hyperparameter settings for 'aggressive' configurations, and dataset statistics are tabulated in the experimental setup for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation of minor revision. We have addressed the concerns regarding statistical rigor and generalizability by incorporating additional analyses and discussions in the revised manuscript.

read point-by-point responses

Referee: Experimental results section: the reported speedups and task-specific differences lack accompanying statistical tests, variance measures, or error analysis, which is load-bearing for the central claim that 'effectiveness of SD methods varies across different task scenarios' and that patterns are 'consistent'.

Authors: We concur that the inclusion of statistical tests and variance measures would bolster the reliability of our findings. In the revised version, we now report speedups with standard deviations computed over multiple runs and have included results from statistical significance tests (using paired t-tests) to validate the task-specific differences. Furthermore, we have added an error analysis to examine cases of inconsistency, thereby supporting the central claims more robustly. revision: yes
Referee: Task and model selection (methodology section): the assumption that the chosen tasks, models, and metrics are representative of real-world SE workflows is not supported by ablation studies or discussion of selection biases, undermining the generalizability of the guidelines to other codebases and models.

Authors: We recognize the importance of justifying our selections for broader applicability. Although we did not perform dedicated ablation studies on the choice of tasks and models in the original manuscript, we have now augmented the methodology section with explicit criteria for selection, drawing from widely-used SE benchmarks. A dedicated subsection on threats to validity and limitations has been introduced to discuss potential biases and the scope of generalizability. This addresses the concern without requiring extensive new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity: direct empirical measurements

full rationale

The paper is a standard multi-model, multi-task empirical benchmark study reporting observed speedups, acceptance rates, and task-specific differences for speculative decoding on SE workloads. No equations, fitted parameters, or derived predictions are present; all central claims are direct observational results from experiments. No self-citation chains support load-bearing premises, and no quantities are defined in terms of themselves or renamed as novel predictions. The work is self-contained against external benchmarks because performance metrics are measured on public models and tasks without reduction to author-specific constructs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking study; no free parameters are fitted to support a central claim, no domain axioms beyond standard ML evaluation assumptions are invoked, and no new entities are postulated.

pith-pipeline@v0.9.0 · 5527 in / 1098 out tokens · 53663 ms · 2026-05-07T13:32:08.911846+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 17 canonical work pages · 10 internal anchors

[1]

Aider AI. 2024. polyglot-benchmark: Coding problems used in aider’s polyglot benchmark. https://github.com/Aider- AI/polyglot-benchmark

2024
[2]

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. 2024. Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding. arXiv:2402.05109 [cs.LG] https: //arxiv.org/abs/2402.05109

work page arXiv 2024
[3]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732

work page internal anchor Pith review arXiv 2021
[4]

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. [n. d.]. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. InForty-first International Conference on Machine Learning
[5]

Federico Cassano, Luisa Li, Akul Sethi, Noah Shinn, Abby Brennan-Jones, Jacob Ginesin, Edward Berman, George Chakhnashvili, Anton Lozhkov, Carolyn Jane Anderson, et al . 2023. Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions. InFirst Conference on Language Modeling

2023
[6]

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318(2023)

work page internal anchor Pith review arXiv 2023
[7]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review arXiv 2021
[8]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs.LG] https://arxiv.org/abs/2110.14168

work page internal anchor Pith review arXiv 2021
[9]

DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao...

work page arXiv 2024
[10]

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. 2025. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?arXiv preprint arXiv:2509.16941(2025)

work page internal anchor Pith review arXiv 2025
[11]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems35 (2022), 30318–30332

2022
[12]

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing chat language models by scaling high-quality instructional conversations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 3029–3051

2023
[13]

Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. 2023. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion.Advances in Neural Information Processing Systems36 (2023), 46701–46723

2023
[14]

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. 2024. Layerskip: Enabling early exit inference and self-speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12622–12642

2024
[15]

Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International conference on machine learning. PMLR, 10323–10337. , Vol. 1, No. 1, Article . Publication date: May 2026. 20 Yijia Li, Junkai Chen, Xing Hu, and Xin Xia

2023
[16]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. [n. d.]. OPTQ: Accurate Quantization for Generative Pre-trained Transformers. InThe Eleventh International Conference on Learning Representations
[17]

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. [n. d.]. Break the Sequential Dependency of LLM Inference Using Lookahead Decoding. InForty-first International Conference on Machine Learning
[18]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al . 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review arXiv 2024
[19]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. 2025. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature645, 8081 (2025), 633–638

2025
[20]

Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D Lee, and Di He. 2024. REST: Retrieval-Based Speculative Decoding. In2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2024. Association for Computational Linguistics (ACL), 1582–1595

2024
[21]

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend.Advances in neural information processing systems28 (2015)

2015
[22]

Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu. 2016. On the naturalness of software. Commun. ACM59, 5 (2016), 122–131

2016
[23]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

2024
[24]

Xing Hu, Feifei Niu, Junkai Chen, Xin Zhou, Junwei Zhang, Junda He, Xin Xia, and David Lo. 2025. Assessing and advancing benchmarks for evaluating large language models in software engineering tasks.ACM Transactions on Software Engineering and Methodology(2025)

2025
[25]

Hugging Face. 2024. Text Generation Inference. https://github.com/huggingface/text-generation-inference. https: //github.com/huggingface/text-generation-inference Production-ready inference server supporting Speculative De- coding

2024
[26]

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. Openai o1 system card.arXiv preprint arXiv:2412.16720(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. [n. d.]. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Representations
[28]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. [n. d.]. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations
[29]

René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. InProceedings of the 2014 international symposium on software testing and analysis. 437–440

2014
[30]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG] https://arxiv.org/abs/2001.08361

work page internal anchor Pith review arXiv 2020
[31]

Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, and Aman Chadha. 2024. A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models. arXiv:2405.13019 [cs.CL] https://arxiv.org/abs/2405. 13019

work page arXiv 2024
[32]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. CoRR(2023)

2023
[33]

LeetCode Inc. 2026. LeetCode. https://leetcode.com. Accessed: 2026-01-27

2026
[34]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning. PMLR, 19274–19286

2023
[35]

Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. 2023. TACO: Topics in Algorithmic COde generation dataset.CoRR(2023)

2023
[36]

Shenggui Li, Yikai Zhu, Chao Wang, Fan Yin, Shuai Shi, Yubo Wang, Yi Zhang, Yingyi Huang, Haoshuai Zheng, and Yineng Zhang. 2025. SpecForge: Train speculative decoding models effortlessly. https://github.com/sgl-project/ specforge

2025
[37]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. [n. d.]. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. InForty-first International Conference on Machine Learning. , Vol. 1, No. 1, Article . Publication date: May 2026. An Empirical Study of Speculative Decoding on Software Engineering Tasks 21

2026
[38]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees. arXiv:2406.16858 [cs.CL] https://arxiv.org/abs/2406.16858

work page arXiv 2024
[39]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2025. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test. arXiv:2503.01840 [cs.CL] https://arxiv.org/abs/2503.01840

work page arXiv 2025
[40]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems6 (2024), 87–100

2024
[41]

Tianyang Liu, Canwen Xu, and Julian McAuley. [n. d.]. RepoBench: Benchmarking Repository-Level Code Auto- Completion Systems. InThe Twelfth International Conference on Learning Representations
[42]

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Z...

work page internal anchor Pith review arXiv 2024
[43]

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al . 2024. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM International Conference on Architectural Support for Programmi...

2024
[44]

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. [n. d.]. Octopack: Instruction tuning code large language models. In NeurIPS 2023 workshop on instruction tuning and instruction following

2023
[45]

NVIDIA. 2024. TensorRT-LLM: A TensorRT Toolbox for Large Language Model Inference. https://github.com/NVIDIA/ TensorRT-LLM. https://github.com/NVIDIA/TensorRT-LLM High-performance inference library

2024
[46]

Gabriele Oliaro, Zhihao Jia, Daniel F Campos, and Aurick Qiao. [n. d.]. Suffixdecoding: Extreme speculative decoding for emerging ai applications. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
[47]

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. [n. d.]. Training Software Engineering Agents and Verifiers with SWE-Gym. InForty-second International Conference on Machine Learning
[48]

Li Zhang Peiding Wang, Fang Liu, Yinghao Zhu, Wang Xu, Lin Shi, Xiaoli Lian, Minxiao Li, Bo Shen, and An Fu. [n. d.]. EFFICIENTEDIT: Accelerating Code Editing via Edit-Oriented Speculative Decoding. ([n. d.])
[49]

Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, and Emanuele Rodola. 2023. Accelerating Transformer Inference for Translation via Parallel Decoding. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Lingu...

work page doi:10.18653/v1/2023.acl-long.689 2023
[50]

Apoorv Saxena. 2023. Prompt Lookup Decoding. https://github.com/apoorvumang/prompt-lookup-decoding/

2023
[51]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_ alpaca

2023
[52]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

work page internal anchor Pith review arXiv 2025
[53]

vLLM Team. 2025. vLLM v0.12.0 Release Notes. https://github.com/vllm-project/vllm/releases/tag/v0.12.0. Accessed: 2026-01-27

2025
[54]

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. [n. d.]. OpenHands: An Open Platform for AI Software Developers as Generalist Agents. InThe Thirteenth International Conference on Learning Representations
[55]

Davis Wertheimer, Joshua Rosenkranz, Thomas Parnell, Sahil Suneja, Pavithra Ranganathan, Raghu Ganti, and Mudhakar Srivatsa. 2024. Accelerating Production LLMs with Combined Token/Embedding Speculators. arXiv:2404.19124 [cs.CL] https://arxiv.org/abs/2404.19124

work page arXiv 2024
[56]

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. 2024. Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding. InACL (Findings)

2024
[57]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information , Vol. 1, No. 1, Article . Publication date: May 2026. 22 Yijia Li, Junkai Chen, Xing Hu, and Xin Xia Processing Systems37 (2024), 50528–50652

2024
[58]

Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, and Bo An. 2025. LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

2025
[59]

Lefan Zhang, Xiaodan Wang, Yanhua Huang, and Ruiwen Xu. [n. d.]. Learning Harmonized Representations for Speculative Sampling. InThe Thirteenth International Conference on Learning Representations
[60]

Qianhui Zhao, Li Zhang, Fang Liu, Xiaoli Lian, Qiaoyuanhe Meng, Ziqian Jiao, Zetong Zhou, Jia Li, and Lin Shi. [n. d.]. FASTCODER: Accelerating Repository-level Code Generation via Efficient Retrieval and Verification. ([n. d.])
[61]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623. , Vol. 1, No. 1, Article . Publication date: May 2026

2023