arxiv: 2605.06914 · v1 · submitted 2026-05-07 · 💻 cs.DC · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Regulating Branch Parallelism in LLM Serving

Swapnil Gandhi , Siva Hari , William J. Dally , Christos Kozyrakis

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:57 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.CL

keywords LLM servingbranch parallelismadmission controlgoodputSLO attainmentintra-request parallelism

0 comments

The pith

A per-step admission controller admits extra LLM output branches only when their predicted externality fits the current slack budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM serving systems can now generate multiple independent branches per request to increase throughput. Eager execution of all branches inflates shared decode steps and slows other requests, while fixed caps waste the parallelism that motivated exposing branches. The paper demonstrates that safe branch width changes continuously with batch composition, context lengths, and accumulated slack. TAPER predicts the extra step latency from each new branch and admits it only if the cost fits inside the batch's slack, treating branches as opportunistic work. Because branches share the request's prefix KV cache, changing width requires no memory reclamation and makes per-step control practical.

Core claim

TAPER treats extra branches as opportunistic work, admitted only when the predicted branch externality fits within the batch's current slack budget. Branch-level scheduling decouples compute from memory because branches share the request's prefix KV, so expanding or contracting width requires no memory reclamation.

What carries the argument

TAPER, the per-step admission controller that decides branch admissions by comparing predicted branch externality against the batch's accumulated slack.

If this is right

On Qwen3-32B, goodput rises 1.77 times over no parallelism and 1.48 times over eager execution.
SLO attainment remains above 95 percent.
Dynamic width adjustment is feasible because branches share prefix KV cache.
Regulation prevents eager admission from inflating shared decode steps for co-batched requests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same slack-based admission logic could apply to other speculative or tree-structured generation methods.
Serving frameworks might embed this controller to replace static caps across different parallelism exposures.
More accurate externality models could further increase admitted branches without SLO risk.

Load-bearing premise

Branch externality can be predicted accurately enough from batch composition, context lengths, and accumulated slack to enable safe per-step admission without unaccounted delays.

What would settle it

A workload trace on which TAPER's externality predictions cause either SLO violations or lower goodput than both the no-parallelism and eager baselines.

Figures

Figures reproduced from arXiv: 2605.06914 by Christos Kozyrakis, Siva Hari, Swapnil Gandhi, William J. Dally.

**Figure 1.** Figure 1: Intra-request parallelism across workloads. Proportion of decomposable requests (PDR), parallel token share (PTS), and average branch fanout (ABF) for three datasets. 2.1 Intra-Request parallelism in the wild Several recent methods shorten the critical path of LLM decoding by exposing independent branches within a single response. Skeleton-of-Thought [4] expands outline points concurrently; APAR [6] emits … view at source ↗

**Figure 2.** Figure 2: The throughput trap and its resolution. Four fixed step-width policies and TAPER on a mixed workload. IRP-EAGER raises throughput but collapses goodput and SLO attainment under load. The cost falls asymmetrically on requests in serial stages. TAPER dynamically adjusts its branch admission rate (panel (i)), retaining most of eager’s throughput while protecting SLO attainment. 2.2 The throughput trap The met… view at source ↗

read the original abstract

Recent methods expose intra-request parallelism in LLM outputs, allowing independent branches to decode concurrently. Existing serving systems execute these branches eagerly or under fixed caps. We show that both are brittle: eager admission inflates the shared decode step, degrading co-batched requests in serial stages, while conservative fixed caps forgo the throughput that motivated exposing branches in the first place. We call the excess step latency caused by admitted branches the branch externality and show that the safe width depends on batch composition, context lengths, and accumulated slack, all of which change continuously over a workload trace. We introduce TAPER, a per-step admission controller that treats extra branches as opportunistic work, admitted only when the predicted branch externality fits within the batch's current slack budget. Per-step regulation is practical because branch-level scheduling decouples compute from memory: branches share the request's prefix KV, so expanding or contracting width requires no memory reclamation. On Qwen3-32B, TAPER improves goodput by $1.77\times$ over IRP-Off and by $1.48\times$ over IRP-Eager, while maintaining over $95\%$ SLO attainment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TAPER gives a workable per-step controller for branch parallelism in LLM serving by budgeting predicted externality against slack, but the reported gains depend on predictor accuracy that the abstract leaves unshown.

read the letter

The main thing here is a controller called TAPER that admits extra decoding branches only when their predicted extra latency cost fits inside the current batch slack budget. It frames that cost as branch externality and claims 1.77× goodput over no branches and 1.48× over eager admission on Qwen3-32B while holding SLO attainment above 95% percent. The design exploits the fact that branches share the request prefix KV cache, so changing width does not force memory reclamation and per-step decisions stay cheap.

Referee Report

2 major / 2 minor

Summary. The paper claims that eager or fixed-cap execution of intra-request branches in LLM serving is brittle because it either inflates shared decode steps or forgoes throughput. It introduces TAPER, a per-step admission controller that treats extra branches as opportunistic work admitted only when the predicted branch externality fits the batch's current slack budget (derived from batch composition, context lengths, and accumulated slack). Branch-level scheduling is enabled by shared prefix KV caches. On Qwen3-32B the system reports 1.77× goodput over IRP-Off and 1.48× over IRP-Eager while maintaining >95% SLO attainment.

Significance. If the central results hold, the work provides a concrete mechanism for dynamically regulating branch parallelism in LLM serving, addressing a practical tension between throughput and latency that existing systems handle poorly. The approach of predicting externality from observable batch state and using shared KV to decouple width changes from memory management is a pragmatic contribution that could be adopted in production serving stacks. The evaluation supplies concrete speedups on a public model against external baselines.

major comments (2)

[Evaluation] Evaluation section: The reported 1.77× and 1.48× goodput gains at >95% SLO attainment rest on the accuracy of the per-step branch externality predictor, yet no prediction-error metrics, training procedure, or robustness results under workload shift are supplied; without these the speedups cannot be verified as arising from safe admission rather than optimistic prediction.
[§3] §3 (Design): The slack-budget calculation and admission rule are described only at the level of 'predicted externality fits within the batch's current slack budget'; no equation, pseudocode, or precise definition of how batch composition, context lengths, and accumulated slack are combined into a numeric budget is given, making the controller non-reproducible from the text.

minor comments (2)

[Abstract] The abstract states 'over 95% SLO attainment' without defining the exact SLO (e.g., per-token latency threshold) or the measurement window used in the experiments.
[Evaluation] Figure captions and axis labels in the evaluation would benefit from explicit units and a brief description of the workload trace used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments identify areas where additional detail will improve verifiability and reproducibility. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The reported 1.77× and 1.48× goodput gains at >95% SLO attainment rest on the accuracy of the per-step branch externality predictor, yet no prediction-error metrics, training procedure, or robustness results under workload shift are supplied; without these the speedups cannot be verified as arising from safe admission rather than optimistic prediction.

Authors: We agree that the predictor's accuracy is essential to attributing the gains to safe admission. The current evaluation focuses on end-to-end goodput and SLO attainment under the tested traces, but we will add a new subsection in the evaluation that reports the predictor's training procedure (supervised regression on offline execution traces), quantitative error metrics (MAE, over-prediction rate, and calibration plots), and robustness results under workload shifts (different arrival rates, context-length distributions, and model sizes). These additions will allow readers to verify that the observed improvements arise from conservative, accurate externality estimates rather than optimistic predictions. revision: yes
Referee: [§3] §3 (Design): The slack-budget calculation and admission rule are described only at the level of 'predicted externality fits within the batch's current slack budget'; no equation, pseudocode, or precise definition of how batch composition, context lengths, and accumulated slack are combined into a numeric budget is given, making the controller non-reproducible from the text.

Authors: We acknowledge that the current prose description in §3 is insufficiently precise for reproducibility. In the revision we will replace the high-level description with explicit equations that define the slack budget as a function of current batch composition (per-request decode costs and KV-cache occupancy), context lengths, and accumulated slack from prior steps. We will also insert pseudocode for the per-step admission decision, showing how the predicted externality is compared to the budget and how width is adjusted. These changes will make the TAPER controller fully reproducible from the text while preserving the original design intent. revision: yes

Circularity Check

0 steps flagged

No circularity: TAPER controller and reported gains are externally evaluated design choices

full rationale

The paper presents TAPER as a new per-step admission controller that admits branches only when predicted externality fits the current slack budget derived from observable batch state. The central claims are empirical goodput improvements (1.77× over IRP-Off, 1.48× over IRP-Eager at >95% SLO) measured on Qwen3-32B against external baselines. No equations, derivations, or self-citations in the abstract reduce these gains to quantities defined inside the same model or fitted parameters; the predictor is treated as a practical implementation detail whose accuracy is validated by the end-to-end results rather than presupposed by construction. The design is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the ability to predict branch externality and estimate slack from observable batch state. The abstract states that branch-level scheduling decouples compute from memory via shared prefix KV, which is treated as a domain fact. No explicit free parameters or invented physical entities are named.

axioms (1)

domain assumption Branch-level scheduling decouples compute from memory because branches share the request's prefix KV cache
Invoked to justify why per-step width changes are practical without memory reclamation.

invented entities (1)

branch externality no independent evidence
purpose: Quantify the excess step latency imposed on co-batched requests by admitted branches
New term introduced to explain why eager admission degrades serial stages.

pith-pipeline@v0.9.0 · 5509 in / 1464 out tokens · 81565 ms · 2026-05-11T00:57:49.395548+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We call the excess step latency caused by admitted branches the branch externality... TAPER... admits them only when the predicted branch externality fits within the batch's current slack budget.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 28 canonical work pages · 6 internal anchors

[1]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088

2022
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URL https: //arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Skeleton-of-thought: Prompting llms for efficient parallel generation.arXiv preprint arXiv:2307.15337, 2023

Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. Skeleton-of- thought: Prompting llms for efficient parallel generation, 2024. URL https://arxiv.org/ abs/2307.15337

work page arXiv 2024
[5]

Cheng, Zack Ankner, Nikunj Saunshi, Blake M

Tian Jin, Ellie Y . Cheng, Zack Ankner, Nikunj Saunshi, Blake M. Elias, Amir Yazdanbakhsh, Jonathan Ragan-Kelley, Suvinay Subramanian, and Michael Carbin. Learning to keep a promise: Scaling language model decoding parallelism with learned asynchronous decoding, 2025. URL https://arxiv.org/abs/2502.11517

work page arXiv 2025
[6]

Apar: Llms can do auto-parallel auto-regressive decoding, 2024

Mingdao Liu, Aohan Zeng, Bowen Wang, Peng Zhang, Jie Tang, and Yuxiao Dong. Apar: Llms can do auto-parallel auto-regressive decoding, 2024. URL https://arxiv.org/abs/2401. 06761

2024
[7]

Learning adaptive parallel reasoning with language models, 2025

Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, and Alane Suhr. Learning adaptive parallel reasoning with language models, 2025. URLhttps://arxiv.org/abs/2504.15466

work page arXiv 2025
[8]

Aspd: Unlocking adaptive serial-parallel decoding by exploring intrinsic parallelism in llms, 2025

Keyu Chen, Zhifeng Shen, Daohai Yu, Haoqian Wu, Wei Wen, Jianfeng He, Ruizhi Qiao, and Xing Sun. Aspd: Unlocking adaptive serial-parallel decoding by exploring intrinsic parallelism in llms, 2025. URLhttps://arxiv.org/abs/2508.08895

work page arXiv 2025
[9]

Multiverse: Your language models secretly decide how to parallelize and merge generation.arXiv preprint arXiv:2506.09991,

Xinyu Yang, Yuwei An, Hongyi Liu, Tianqi Chen, and Beidi Chen. Multiverse: Your language models secretly decide how to parallelize and merge generation, 2025. URL https://arxiv. org/abs/2506.09991

work page arXiv 2025
[10]

Threadweaver: Adaptive threading for efficient parallel reasoning in language models, 2025

Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li, Adam Yala, Trevor Darrell, Alane Suhr, Yuandong Tian, and Xi Victoria Lin. Threadweaver: Adaptive threading for efficient parallel reasoning in language models, 2025. URLhttps://arxiv.org/abs/2512.07843

work page arXiv 2025
[11]

Parallelprompt: Extracting parallelism from large language model queries, 2025

Steven Kolawole, Keshav Santhanam, Virginia Smith, and Pratiksha Thaker. Parallelprompt: Extracting parallelism from large language model queries, 2025. URL https://arxiv.org/ abs/2506.18728

work page arXiv 2025
[12]

Yu, and Aiwei Liu

Lingzhe Zhang, Liancheng Fang, Chiming Duan, Minghua He, Leyi Pan, Pei Xiao, Shiyu Huang, Yunpeng Zhai, Xuming Hu, Philip S. Yu, and Aiwei Liu. A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models, 2026. URL https: //arxiv.org/abs/2508.08712. 10

work page arXiv 2026
[13]

Orca: A distributed serving system for Transformer-Based generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for Transformer-Based generative models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association. ISBN 978-1-939133-28-1. URL https: //www.usenix.org/conferen...

2022
[14]

ServeGen: Workload characterization and generation of large language model serving in production

Yuxing Xiang, Xue Li, Kun Qian, Yan Zhang, Wenyuan Yu, Ennan Zhai, Xin Jin, and Jingren Zhou. ServeGen: Workload characterization and generation of large language model serving in production. In23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26), pages 1845–1859, Renton, WA, May 2026. USENIX Association. ISBN 978- 1-939133-54-0....

2026
[15]

NanoFlow: Towards optimal large language model serving throughput

Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci. NanoFlow: Towards optimal large language model serving throughput. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 749–765, Bo...

2025
[16]

Gonzalez, and Ion Stoica

Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. Fairness in Serving Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 965–988, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-40-3. URL https: //www.usenix.org/confere...

2024
[17]

Llumnix: Dynamic scheduling for large language model serving

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic scheduling for large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 173–191, Santa Clara, CA, July 2024. USENIX Association. ISBN 978-1-939133-40-3. URL https: //www.usenix.org/conference/os...

2024
[18]

In: Proceedings of the 29th Symposium on Operating Systems Principles

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Machin...

work page doi:10.1145/3600006.3613165 2023
[19]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: efficient execution of structured language model programs. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, N...

2025
[20]

P Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

2023
[21]

Retrieval-Augmented Generation (RAG) Dataset 12000

Neural Bridge. Retrieval-Augmented Generation (RAG) Dataset 12000. https:// huggingface.co/datasets/neural-bridge/rag-dataset-12000, 2023

2023
[22]

OpenR1-Math-220k

Open-R1. OpenR1-Math-220k. https://huggingface.co/datasets/open-r1/ OpenR1-Math-220k, 2025

2025
[23]

Azure LLM inference trace 2023

Microsoft. Azure LLM inference trace 2023. https://github.com/Azure/ AzurePublicDataset/blob/master/AzureLLMInferenceDataset2023.md, 2024

2023
[24]

Routledge, 2017

Arthur Pigou.The economics of welfare. Routledge, 2017

2017
[25]

Sprint: Enabling interleaved planning and parallelized execution in reasoning models, 2025

Emil Biju, Shayan Talaei, Zhemin Huang, Mohammadreza Pourreza, Azalia Mirhoseini, and Amin Saberi. Sprint: Enabling interleaved planning and parallelized execution in reasoning models, 2025. URLhttps://arxiv.org/abs/2506.05745. 11

work page arXiv 2025
[26]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URLhttps://arxiv.org/abs/2407.21787

work page internal anchor Pith review arXiv 2024
[29]

Efficient beam search for large language models using trie-based decoding, 2025

Brian J Chan, MaoXun Huang, Jui-Hung Cheng, Chao-Ting Chen, and Hen-Hsen Huang. Efficient beam search for large language models using trie-based decoding, 2025. URL https: //arxiv.org/abs/2502.00085

work page arXiv 2025
[30]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023. URLhttps://arxiv.org/abs/2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Universal self-consistency for large language model generation.arXiv preprint arXiv:2311.17311, 2023

Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Universal self-consistency for large language model generation, 2023. URLhttps://arxiv.org/abs/2311.17311

work page arXiv 2023
[32]

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B,

Di Zhang, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, and Wanli Ouyang. Accessing gpt-4 level mathematical olympiad solutions via monte carlo tree self-refine with llama-3 8b, 2024. URLhttps://arxiv.org/abs/2406.07394

work page arXiv 2024
[33]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: deliberate problem solving with large language models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

2023
[34]

14 Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn

Gleb Rodionov, Roman Garipov, Alina Shutova, George Yakushev, Erik Schultheis, Vage Egiazarian, Anton Sinitsin, Denis Kuznedelev, and Dan Alistarh. Hogwild! inference: Parallel llm generation via concurrent attention, 2025. URL https://arxiv.org/abs/2504.06261

work page arXiv 2025
[35]

Parallel-r1: Towards parallel thinking via reinforcement learning

Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, and Dong Yu. Parallel-r1: Towards parallel thinking via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2509.07980

work page arXiv 2025
[36]

Fairbatching: Fairness-aware batch formation for llm inference, 2025

Hongtao Lyu, Boyue Liu, Mingyu Wu, and Haibo Chen. Fairbatching: Fairness-aware batch formation for llm inference, 2025. URLhttps://arxiv.org/abs/2510.14392

work page arXiv 2025
[37]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

2023
[38]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. MEDUSA: Simple LLM inference acceleration framework with multiple decoding heads. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

2024
[39]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test. InAnnual Conference on Neural Information Processing Systems, 2025. 12

2025
[40]

Optimizing speculative decoding for serving large language models using goodput,

Xiaoxuan Liu, Jongseok Park, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Chen Zhang, Kuntai Du, Xiangxi Mo, Kaichao You, Alvin Cheung, Zhijie Deng, Ion Stoica, and Hao Zhang. Turbospec: Closed-loop speculation control system for optimizing llm serving goodput, 2025. URLhttps://arxiv.org/abs/2406.14066

work page arXiv 2025
[41]

Adaspec: Adap- tive speculative decoding for fast, slo-aware large language model serving

Kaiyu Huang, Hao Wu, Zhubo Shi, Han Zou, Minchen Yu, and Qingjiang Shi. Adaspec: Adap- tive speculative decoding for fast, slo-aware large language model serving. InProceedings of the 2025 ACM Symposium on Cloud Computing, SoCC ’25, page 361–374. ACM, November 2025. doi: 10.1145/3772052.3772239. URLhttp://dx.doi.org/10.1145/3772052.3772239

work page doi:10.1145/3772052.3772239 2025
[42]

Adaserve: Accelerating multi-slo llm serving with slo-customized speculative decoding.arXiv preprint arXiv:2501.12162, 2025

Zikun Li, Zhuofu Chen, Remi Delacourt, Gabriele Oliaro, Zeyu Wang, Qinghan Chen, Shuhuai Lin, April Yang, Zhihao Zhang, Zhuoming Chen, Sean Lai, Xinhao Cheng, Xupeng Miao, and Zhihao Jia. Adaserve: Accelerating multi-slo llm serving with slo-customized speculative decoding, 2025. URLhttps://arxiv.org/abs/2501.12162

work page arXiv 2025
[43]

Nightjar: Dynamic adaptive speculative decoding for large language models serving, 2026

Rui Li, Zhaoning Zhang, Libo Zhang, Huaimin Wang, Xiang Fu, and Zhiquan Lai. Nightjar: Dynamic adaptive speculative decoding for large language models serving, 2026. URL https: //arxiv.org/abs/2512.22420

work page arXiv 2026
[44]

Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services,

Jiachen Liu, Jae-Won Chung, Zhiyu Wu, Fan Lai, Myungjin Lee, and Mosharaf Chowdhury. Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services,
[45]

URLhttps://arxiv.org/abs/2404.16283

work page arXiv
[46]

A predictive and synergistic two-layer scheduling framework for llm serving, 2025

Yue Zhang, Yuansheng Chen, Xuan Mo, Alex Xi, Jialun Li, and WeiGang Wu. A predictive and synergistic two-layer scheduling framework for llm serving, 2025. URL https://arxiv. org/abs/2509.23384. A Proof of Schedule Invariance Lemma A.1(Schedule invariance).The output of a parallel phase is independent of the order and timing in which its branches are sched...

work page arXiv 2025