arxiv: 2605.10195 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: no theorem link

Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration

Haochen Huang, Meng Li, Pengfei Zuo, Runsheng Wang, Shengxuan Qiu, Shuzhang Zhong

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:48 UTC · model grok-4.3

classification 💻 cs.LG

keywords Tree-of-Thoughtspeculative explorationLLM inferencereasoning accelerationreward-guided searchearly terminationparallel searchspeculative decoding

0 comments

The pith

Speculative path selection and early termination break the reward synchronization barrier to accelerate Tree-of-Thought reasoning 1.2 to 3 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tree-of-Thought reasoning turns LLM inference into a tree search that shows promise on math and programming problems but runs slowly because each branch expansion must wait for a reward signal. The paper introduces SPEX to explore paths speculatively by predicting promising branches ahead of rewards, balancing compute budgets across multiple queries, and pruning redundant deep branches early. This removes the sequential bottleneck that has kept ToT from scaling in practice. A reader would care because the same reward-guided search that improves answer quality now becomes fast enough to combine with other inference tricks and run on real workloads.

Core claim

SPEX uses intra-query speculative path selection to expand high-potential branches, inter-query budget allocation to distribute resources, and adaptive early termination to cut unneeded depth, delivering 1.2 to 3 times speedup across ToT algorithms and up to 4.1 times when paired with token-level speculative decoding, all while preserving final answer quality.

What carries the argument

The SPEX speculative exploration mechanism, which predicts and expands high-potential reasoning branches before reward verification and applies dynamic allocation plus pruning to keep the search tree efficient.

If this is right

ToT search can expand more branches in parallel without proportional increases in latency.
Complex reasoning tasks become viable under tighter response-time limits.
Existing ToT implementations gain immediate efficiency by adding the three SPEX techniques.
Token-level speculative decoding and tree-level speculation multiply to produce cumulative gains up to 4.1 times.
Reward model calls can occur less frequently while still guiding effective search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prediction-and-prune pattern could apply to other tree or graph search methods used with language models.
Serving systems might learn to adjust how aggressively they speculate based on observed task difficulty.
Fewer reward evaluations per query could reduce total energy use in large-scale LLM deployments.
The integration shown with SGLang indicates the methods can be added to production inference engines without major rewrites.

Load-bearing premise

That early predictions of promising branches and early termination of others will not miss correct solutions or lower final answer accuracy.

What would settle it

A controlled run on a standard math or code benchmark where SPEX at the reported speedups produces measurably lower accuracy than baseline ToT on the same model and task.

Figures

Figures reproduced from arXiv: 2605.10195 by Haochen Huang, Meng Li, Pengfei Zuo, Runsheng Wang, Shengxuan Qiu, Shuzhang Zhong.

**Figure 1.** Figure 1: Illustration of ToT and Reward Barrier. compute budgets [3, 8, 26, 47, 50–52, 69]. Two main approaches have been constructed, including Chain-of-Thought (CoT) [13, 41, 58, 72] and Tree-of-Thought (ToT) [2, 11, 16, 19, 54, 66]. CoT prompts LLMs to reason over a coherent sequence of tokens (denoted as “thought”) that serve as intermediate steps toward problem solving. ToT further generalizes CoT by mainta… view at source ↗

**Figure 2.** Figure 2: Classification of different ToT algorithms. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example of (a) reward barrier and (b) our proposed [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: (a) Roofline model analysis; (b) Intensity degrada [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: (a) SPEX struggles with skewed trees under the RE [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: The architectural overview of SPEX. speculative resources based on query probability, KV cache reuse potential, and hardware capabilities (§4.3). Next, for each query, SPEX employs the intra-query speculative selection scheme to predict and expand the most promising future branches for speculative exploration (§4.2). Finally, to handle skewed trees, SPEX integrates the adaptive early termination strategy,… view at source ↗

**Figure 9.** Figure 9: SPEX for (a) DFS and (b) BFS. (a) SPEX simulates the next [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: The speedup and throughput (finished questions per minute) for different ToT reasoning tasks. [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation study of three techniques in SPEX for (a) [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 13.** Figure 13: Overhead analysis of SPEX. necks and can be effectively composed. The combination (Baseline+MTP+SPEX) consistently yields the highest throughput. Notably, for RSTAR-10 at BS=1, combining SPEX with MTP boosts the speedup from ∼ 2.0× (MTP only) and ∼ 3.1× (SPEX only) to a remarkably ∼ 4.1×. As the batch size increases, the system transitions from memorybound to compute-bound. Consequently, the marginal g… view at source ↗

**Figure 12.** Figure 12: Orthogonality analysis of SPEX and MTP for (a) [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 14.** Figure 14: Prediction accuracy evaluation. (a) Hit rate of [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗

**Figure 15.** Figure 15: (a) Probability of speculative explorations reaching [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗

read the original abstract

Tree-of-Thought (ToT) reasoning structures Large Language Model (LLM) inference as a tree-based search, demonstrating strong potential for solving complex mathematical and programming tasks. However, its efficiency is constrained by the reward dependency barrier -- a synchronization bottleneck caused by sequential reward-guided exploration that limits search parallelism and introduces substantial latency. Prior system optimizations, mainly designed for linear Chain-of-Thought (CoT) reasoning, cannot address these challenges, leaving the efficiency of ToT underexplored. To enhance ToT reasoning efficiency, we observe that the reasoning paths can be explored speculatively to break the reward synchronization barrier. Therefore, in this paper, we propose SPEX and introduce three key techniques: (i) intra-query speculative path selection to predict and expand high-potential branches of ToT, (ii) inter-query budget allocation to balance speculative resource allocation across queries dynamically, and (iii) adaptive early termination to prune deep and redundant branches for a skewed search tree. We implement SPEX on top of the SGLang framework and evaluate it across diverse ToT algorithms and LLMs. Extensive experiments show that SPEX achieves $1.2 \sim 3 \times$ speedup for different ToT reasoning algorithms. Moreover, SPEX synergizes with token-level speculative decoding, achieving cumulative speedups of up to $4.1\times$. Ablation studies further confirm the contributions of each technique. Overall, SPEX represents a significant step toward efficient and scalable ToT reasoning, unlocking the parallelism required for high-performance inference-time scaling for LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPEX adds three targeted speculative tricks to Tree-of-Thought that deliver real measured speedups, but the quality preservation case still needs tighter numbers.

read the letter

The paper's main contribution is a set of three techniques—speculative intra-query path selection, dynamic inter-query budget allocation, and adaptive early termination—that let ToT break the sequential reward synchronization bottleneck. They report 1.2-3x speedups across different ToT algorithms and up to 4.1x when stacked with token-level speculative decoding, all implemented on SGLang. That combination is new enough for the ToT setting and directly attacks a latency issue that linear CoT optimizations left alone. The experiments cover multiple LLMs and algorithms, which gives the results some breadth, and the ablations are at least mentioned as confirming each piece adds something. Practical engineering work like this is useful when it ships measurable gains on real frameworks. The soft spot is exactly the one the stress-test note flags. Speculative selection and early termination are presented as pruning only redundant or low-potential branches, yet the text does not show per-task accuracy deltas, worst-case path-loss rates, or any formal argument that the predictor never drops the optimal solution under the original reward function. If the speculation model has even moderate error at deeper levels, or if thresholds are tuned per dataset, the latency wins could come with hidden quality cost or extra verification overhead. Without those numbers the speedup claims are harder to trust at face value. This paper is for people already working on inference-time scaling for reasoning models—engineers who need faster ToT without rewriting the whole search. A serious referee should see it because the problem is real, the implementation is concrete, and the quality question is fixable with more data rather than fatal to the idea.

Referee Report

3 major / 2 minor

Summary. The paper proposes SPEX to accelerate Tree-of-Thought (ToT) reasoning by breaking the reward synchronization barrier. It introduces three techniques: intra-query speculative path selection to expand high-potential branches, inter-query budget allocation for dynamic resource balancing across queries, and adaptive early termination to prune deep/redundant branches in skewed trees. Implemented on SGLang, experiments across ToT algorithms and LLMs report 1.2–3× speedups, up to 4.1× when combined with token-level speculative decoding, with ablations confirming each component's contribution.

Significance. If the reported speedups hold while preserving final answer correctness under the original ToT reward functions, this would represent a meaningful advance in inference-time scaling for complex reasoning tasks. The work directly targets the parallelism limitations of tree-based search that prior CoT-focused optimizations leave unaddressed, and the synergy with speculative decoding is a practical strength.

major comments (3)

[§4] §4 (experimental evaluation): the 1.2–3× and 4.1× speedup claims are presented without per-task accuracy deltas, error bars, or worst-case path-loss statistics comparing SPEX outputs to baseline ToT under the original reward model. This is load-bearing because adaptive early termination and speculative selection could prune viable paths; the skeptic concern that quality equivalence is not demonstrated therefore remains unaddressed by the reported ablations.
[§3.3] §3.3 (adaptive early termination): the technique is described as pruning 'deep and redundant branches' and 'high-potential branches' via prediction, yet no formal argument, invariant, or exhaustive enumeration is supplied showing that pruned branches never contain the optimal solution under the original ToT reward function. If the predictor has even modest error at deeper levels, the latency gains could be offset by hidden verification cost or accuracy loss.
[§3.1–3.2] §3.1–3.2 (speculative path selection and budget allocation): the intra-query and inter-query mechanisms rely on predictors whose error rates and fallback costs are not quantified; without these measurements it is impossible to verify that the claimed cumulative speedups are net-positive once any required re-verification or re-expansion is included.

minor comments (2)

[Abstract] The abstract and introduction would benefit from an explicit statement of the baseline ToT implementations and reward models used, to allow readers to assess the generality of the 1.2–3× range.
[§3] Notation for the speculative predictor and early-termination threshold is introduced without a consolidated table of symbols or hyperparameters, making it harder to reproduce the exact conditions under which the speedups were measured.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of SPEX to address parallelism limitations in Tree-of-Thought reasoning. We address each major comment below and describe the revisions we will make to provide stronger empirical validation and analysis.

read point-by-point responses

Referee: [§4] §4 (experimental evaluation): the 1.2–3× and 4.1× speedup claims are presented without per-task accuracy deltas, error bars, or worst-case path-loss statistics comparing SPEX outputs to baseline ToT under the original reward model. This is load-bearing because adaptive early termination and speculative selection could prune viable paths; the skeptic concern that quality equivalence is not demonstrated therefore remains unaddressed by the reported ablations.

Authors: We agree that explicit demonstration of quality equivalence is essential. While our internal evaluations confirmed that final answers match baseline ToT under the original reward model across tasks, the manuscript reports only aggregate speedups and component ablations without per-task accuracy deltas, error bars, or worst-case path statistics. In the revised version we will add tables reporting accuracy for each task and LLM, standard deviations from repeated runs, and analysis of path divergence cases to confirm that correctness is preserved. revision: yes
Referee: [§3.3] §3.3 (adaptive early termination): the technique is described as pruning 'deep and redundant branches' and 'high-potential branches' via prediction, yet no formal argument, invariant, or exhaustive enumeration is supplied showing that pruned branches never contain the optimal solution under the original ToT reward function. If the predictor has even modest error at deeper levels, the latency gains could be offset by hidden verification cost or accuracy loss.

Authors: The adaptive early termination employs a learned predictor to estimate whether further expansion of a branch is unlikely to improve the reward. A strict formal invariant guaranteeing that no optimal path is pruned is difficult to establish without restrictive assumptions on the reward function and LLM behavior. However, our experiments across diverse tasks show no accuracy degradation. We will expand §3.3 with additional details on predictor training, validation metrics, and a discussion of empirical safety conditions and fallback costs. revision: partial
Referee: [§3.1–3.2] §3.1–3.2 (speculative path selection and budget allocation): the intra-query and inter-query mechanisms rely on predictors whose error rates and fallback costs are not quantified; without these measurements it is impossible to verify that the claimed cumulative speedups are net-positive once any required re-verification or re-expansion is included.

Authors: We acknowledge that the error rates of the predictors and the overhead of any corrective re-expansions or re-verifications were not quantified in the initial submission. The manuscript focuses on end-to-end speedups and ablations, but additional breakdown is needed to confirm net gains. In the revision we will include new measurements of predictor accuracy, frequency of fallback events, and their latency contributions, allowing verification that the reported speedups remain positive after accounting for these costs. revision: yes

Circularity Check

0 steps flagged

No significant circularity in algorithmic proposal or empirical claims

full rationale

The paper proposes SPEX as a set of new algorithmic techniques (intra-query speculative path selection, inter-query budget allocation, adaptive early termination) implemented on SGLang and evaluated empirically on ToT algorithms and LLMs. Speedup claims (1.2-3x, up to 4.1x cumulative) rest on reported measurements and ablation studies rather than any equations, fitted parameters, or derivations. No self-citations are invoked as load-bearing premises, no uniqueness theorems are imported, and no ansatz or renaming of known results occurs. The central claims are self-contained experimental results without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the contribution is presented as algorithmic engineering rather than new theoretical primitives.

pith-pipeline@v0.9.0 · 5603 in / 1085 out tokens · 68623 ms · 2026-05-12T03:48:00.148699+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 7 internal anchors

[1]

Llemma: An open language model for mathematics

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics.arXiv preprint arXiv:2310.10631, 2023

work page arXiv 2023
[2]

Scal- ing test-time compute with open models

Edward Beeching, Lewis Tunstall, and Sasha Rush. Scal- ing test-time compute with open models. 2024. Hugging Face Technical Report

work page 2024
[3]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirho- seini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review arXiv 2024
[4]

brown university math olympiad 2025

BRUMO. brown university math olympiad 2025. https://www.brumo.org/, 2025

work page 2025
[5]

Monte-carlo tree search: A new frame- work for game ai

Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck. Monte-carlo tree search: A new frame- work for game ai. InProceedings of the AAAI Confer- ence on Artificial Intelligence and Interactive Digital Entertainment, volume 4, pages 216–217, 2008

work page 2008
[6]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with specu- lative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Mcc-kd: Multi-cot consistent knowledge distillation.arXiv preprint arXiv:2310.14747, 2023

Hongzhan Chen, Siyue Wu, Xiaojun Quan, Rui Wang, Ming Yan, and Ji Zhang. Mcc-kd: Multi-cot consistent knowledge distillation.arXiv preprint arXiv:2310.14747, 2023

work page arXiv 2023
[8]

Are more llm calls all you need? towards the scaling properties of compound ai systems.Advances in Neu- ral Information Processing Systems, 37:45767–45790, 2024

Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei A Zaharia, and James Y Zou. Are more llm calls all you need? towards the scaling properties of compound ai systems.Advances in Neu- ral Information Processing Systems, 37:45767–45790, 2024

work page 2024
[9]

Skip-thinking: Chunk-wise chain- of-thought distillation enable smaller language mod- els to reason better and faster.arXiv preprint arXiv:2505.18642, 2025

Xiao Chen, Sihang Zhou, Ke Liang, Xiaoyu Sun, and Xinwang Liu. Skip-thinking: Chunk-wise chain- of-thought distillation enable smaller language mod- els to reason better and faster.arXiv preprint arXiv:2505.18642, 2025

work page arXiv 2025
[10]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jian- hui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187, 2024

work page internal anchor Pith review arXiv 2024
[11]

Self-playing adversarial language game enhances llm reasoning

Pengyu Cheng, Yong Dai, Tianhao Hu, Han Xu, Zhisong Zhang, Lei Han, Nan Du, and Xiaolong Li. Self-playing adversarial language game enhances llm reasoning. Advances in Neural Information Processing Systems, 37:126515–126543, 2024

work page 2024
[12]

Speculative monte-carlo tree search

Scott Cheng, Mahmut Taylan Kandemir, and Ding-Yong Hong. Speculative monte-carlo tree search. InAdvances in Neural Information Processing Systems, pages 88664– 88683, 2024

work page 2024
[13]

Navigate through enigmatic labyrinth a survey of chain of thought reasoning: Advances, frontiers and future,

Zheng Chu, Jingchang Chen, Qianglong Chen, Wei- jiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. Navigate through enig- matic labyrinth a survey of chain of thought reason- ing: Advances, frontiers and future.arXiv preprint arXiv:2309.15402, 2023

work page arXiv 2023
[14]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plap- pert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

work page 2022
[16]

arXiv preprint arXiv:2309.17179 , year=

Xidong Feng, Ziyu Wan, Muning Wen, Stephen Mar- cus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large lan- guage model decoding and training.arXiv preprint arXiv:2309.17179, 2023

work page arXiv 2023
[17]

Efficiently serving llm reasoning programs with certaindex

Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhong- dongming Dai, Yonghao Zhuang, Yian Ma, Aurick Qiao, Tajana Rosing, Ion Stoica, et al. Efficiently scal- ing llm reasoning with certaindex.arXiv preprint arXiv:2412.20993, 2024

work page arXiv 2024
[18]

Deep think with confidence, 2025

Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence, 2025

work page 2025
[19]

Interpretable contrastive monte carlo tree search reasoning

Zitian Gao, Boye Niu, Xuzheng He, Haotian Xu, Hongzhang Liu, Aiwei Liu, Xuming Hu, and Lijie Wen. Interpretable contrastive monte carlo tree search reason- ing.arXiv preprint arXiv:2410.01707, 2024

work page arXiv 2024
[20]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Harvard-MIT Mathematics Tour- nament (HMMT) February 2025

HMMT Organization. Harvard-MIT Mathematics Tour- nament (HMMT) February 2025. https://www.hmmt. org, 2025

work page 2025
[23]

Ets: Efficient tree search for inference- time scaling.arXiv preprint arXiv:2502.13575, 2025

Coleman Hooper, Sehoon Kim, Suhong Moon, Kerem Dilmen, Monishwaran Maheswaran, Nicholas Lee, Michael W Mahoney, Sophia Shao, Kurt Keutzer, and Amir Gholami. Ets: Efficient tree search for inference- time scaling.arXiv preprint arXiv:2502.13575, 2025

work page arXiv 2025
[24]

arXiv preprint arXiv:2504.01296 , year=

Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296, 2025

work page arXiv 2025
[25]

Specmcts: Accelerating monte carlo tree search us- ing speculative tree traversal.IEEE Access, 9:142195– 142205, 2021

Juhwan Kim, Byeongmin Kang, and Hyungmin Cho. Specmcts: Accelerating monte carlo tree search us- ing speculative tree traversal.IEEE Access, 9:142195– 142205, 2021

work page 2021
[26]

Large language models are zero-shot reasoners.Advances in neural in- formation processing systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- taka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural in- formation processing systems, 35:22199–22213, 2022

work page 2022
[27]

Efficient memory manage- ment for large language model serving with pagedatten- tion

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory manage- ment for large language model serving with pagedatten- tion. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023
[28]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023
[29]

Large language model inference acceler- ation: A comprehensive hardware perspective.arXiv preprint arXiv:2410.04466, 2024

Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, et al. Large language model inference acceler- ation: A comprehensive hardware perspective.arXiv preprint arXiv:2410.04466, 2024

work page arXiv 2024
[30]

Orches: Or- chestrated test-time-compute-based llm reasoning on collaborative gpu-pim heterogeneous system

Sixu Li, Yuzhou Chen, Chaojian Li, Yonggan Fu, Zheng Wang, Zhongzhi Yu, Haoran You, Zhifan Ye, Wei Zhou, Yongan Zhang, and Yingyan (Celine) Lin. Orches: Or- chestrated test-time-compute-based llm reasoning on collaborative gpu-pim heterogeneous system. InPro- ceedings of the 58th IEEE/ACM International Sympo- sium on Microarchitecture, MICRO ’25, page 476...

work page 2025
[31]

Making language models better reasoners with step-aware verifier

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. InPro- ceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5315–5333, 2023

work page 2023
[32]

Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning.arXiv preprint arXiv:2401.10480, 2024

Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. Escape sky-high cost: Early-stopping self-consistency for multi- step reasoning.arXiv preprint arXiv:2401.10480, 2024

work page arXiv 2024
[33]

Reward-guided speculative decoding for efficient llm reasoning

Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong. Reward-guided speculative de- coding for efficient llm reasoning.arXiv preprint arXiv:2501.19324, 2025

work page arXiv 2025
[34]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[35]

Trimr: Verifier-based training-free thinking compression for efficient test-time scaling

Weizhe Lin, Xing Li, Zhiyuan Yang, Xiaojin Fu, Hui- Ling Zhen, Yaoyuan Wang, Xianzhi Yu, Wulong Liu, Xiaosong Li, and Mingxuan Yuan. Trimr: Verifier-based training-free thinking compression for efficient test-time scaling.arXiv preprint arXiv:2505.17155, 2025

work page arXiv 2025
[36]

Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.arXiv preprint arXiv:2502.06703, 2025

Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.arXiv preprint arXiv:2502.06703, 2025

work page arXiv 2025
[37]

Prolonged reasoning is not all you need: Certainty-based adaptive routing for efficient LLM/MLLM reasoning.arXiv preprint arXiv:2505.15154, 2025

Jinghui Lu, Haiyang Yu, Siliang Xu, Shiwei Ran, Guozhi Tang, Siqi Wang, Bin Shan, Teng Fu, Hao Feng, Jingqun Tang, et al. Prolonged reasoning is not all you need: Certainty-based adaptive routing for efficient llm/mllm reasoning.arXiv preprint arXiv:2505.15154, 2025

work page arXiv 2025
[38]

Autol2s: Auto long-short reasoning for efficient large language models

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Hoang Anh Duy Le, Shaochen Zhong, Hongyi Liu, Jiayi Yuan, Yang Sui, Vladimir Braverman, Vipin Chaudhary, et al. Autol2s: Auto long-short reasoning for efficient large language models.arXiv preprint arXiv:2505.22662, 2025

work page arXiv 2025
[39]

American Invita- tional Mathematics Examination (AIME) 2024

Mathematical Association of America. American Invita- tional Mathematics Examination (AIME) 2024. https: //www.maa.org/math-competitions/aime, 2024

work page 2024
[40]

American Invita- tional Mathematics Examination (AIME) 2025

Mathematical Association of America. American Invita- tional Mathematics Examination (AIME) 2025. https: //www.maa.org/math-competitions/aime, 2025. 14

work page 2025
[41]

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gus- tavo Hernández Ábrego, Ji Ma, Vincent Y

Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. Skeleton-of-thought: Prompting llms for efficient parallel generation.arXiv preprint arXiv:2307.15337, 2023

work page arXiv 2023
[42]

Learning to reason with llms

OpenAI. Learning to reason with llms. Technical report, OpenAI, 2024. Technical Report

work page 2024
[43]

OpenAI O3-mini System Card

OpenAI. OpenAI O3-mini System Card. Technical report, OpenAI, January 2025. Technical Report

work page 2025
[44]

Specrea- son: Fast and accurate inference-time compute via speculative reasoning

Rui Pan, Yinwei Dai, Zhihao Zhang, Gabriele Oliaro, Zhihao Jia, and Ravi Netravali. Specreason: Fast and ac- curate inference-time compute via speculative reasoning. arXiv preprint arXiv:2504.07891, 2025

work page arXiv 2025
[45]

Optimizing anytime reasoning via budget relative policy optimization

Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Optimizing anytime rea- soning via budget relative policy optimization.arXiv preprint arXiv:2505.13438, 2025

work page arXiv 2025
[46]

Mutual reasoning makes smaller llms stronger problem-solvers,

Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, and Mao Yang. Mutual reasoning makes smaller llms stronger problem-solvers.arXiv preprint arXiv:2408.06195, 2024

work page arXiv 2024
[47]

TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling,

Jiahao Qiu, Yifu Lu, Yifan Zeng, Jiacheng Guo, Jiayi Geng, Huazheng Wang, Kaixuan Huang, Yue Wu, and Mengdi Wang. Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling.arXiv preprint arXiv:2410.16033, 2024

work page arXiv 2024
[48]

Scalable fsm parallelization via path fusion and higher-order speculation

Junqiao Qiu, Xiaofan Sun, Amir Hossein Nodehi Sabet, and Zhijia Zhao. Scalable fsm parallelization via path fusion and higher-order speculation. InProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’21, page 887–901, New York, NY , USA, 2021. Association for Computing Machinery

work page 2021
[49]

Speccot: Accelerating chain-of-thought reasoning through speculative exploration

Junhan Shi, Yijia Zhu, Zhenning Shi, Dan Zhao, Qing Li, and Yong Jiang. Speccot: Accelerating chain-of-thought reasoning through speculative exploration. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

work page
[50]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Sky-t1: Train your own o1 preview model within $450, 2025

NovaSky Team. Sky-t1: Train your own o1 preview model within $450, 2025

work page 2025
[52]

Qwq: Reflect deeply on the boundaries of the unknown.Hugging Face, 2024

Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown.Hugging Face, 2024

work page 2024
[53]

Adapthink: Adaptive thinking pref- erences for reasoning language model.arXiv preprint arXiv:2506.18237, 2025

Xu Wan, Wei Wang, Wenyue Xu, Wotao Yin, Jie Song, and Mingyang Sun. Adapthink: Adaptive thinking pref- erences for reasoning language model.arXiv preprint arXiv:2506.18237, 2025

work page arXiv 2025
[54]

Q*: Improving multi-step reasoning for llms with deliberative planning,

Chaojie Wang, Yanchen Deng, Zhiyi Lyu, Liang Zeng, Jujie He, Shuicheng Yan, and Bo An. Q*: Improving multi-step reasoning for llms with deliberative planning. arXiv preprint arXiv:2406.14283, 2024

work page arXiv 2024
[55]

Openr: An open source framework for advanced reasoning with large language models.arXiv preprint arXiv:2410.09671, 2024

Jun Wang, Meng Fang, Ziyu Wan, Muning Wen, Jiachen Zhu, Anjie Liu, Ziqin Gong, Yan Song, Lei Chen, Li- onel M Ni, et al. Openr: An open source framework for advanced reasoning with large language models.arXiv preprint arXiv:2410.09671, 2024

work page arXiv 2024
[56]

R1-compress: Long chain-of-thought compression via chunk compression and search

Yibo Wang, Li Shen, Huanjin Yao, Tiansheng Huang, Rui Liu, Naiqiang Tan, Jiaxing Huang, Kai Zhang, and Dacheng Tao. R1-compress: Long chain-of-thought compression via chunk compression and search.arXiv preprint arXiv:2505.16838, 2025

work page arXiv 2025
[57]

Adaptive deep reasoning: Triggering deep thinking when needed.arXiv preprint arXiv:2505.20101, 2025

Yunhao Wang, Yuhao Zhang, Tinghao Yu, Can Xu, Feng Zhang, and Fengzong Lian. Adaptive deep reasoning: Triggering deep thinking when needed.arXiv preprint arXiv:2505.20101, 2025

work page arXiv 2025
[58]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information pro- cessing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information pro- cessing systems, 35:24824–24837, 2022

work page 2022
[59]

Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem- solving

Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem- solving. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[60]

Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, and Wenjie Li

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model infer- ence: A comprehensive survey of speculative decoding. arXiv preprint arXiv:2401.07851, 2024

work page arXiv 2024
[61]

Lillicrap, Kenji Kawaguchi, and Michael Shieh

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning.arXiv preprint arXiv:2405.00451, 2024

work page arXiv 2024
[62]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jian- wei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren 15 Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin...

work page 2025
[63]

Speculative thinking: En- hancing small-model reasoning with large model guidance at inference time

Wang Yang, Xiang Yue, Vipin Chaudhary, and Xiao- tian Han. Speculative thinking: Enhancing small-model reasoning with large model guidance at inference time. arXiv preprint arXiv:2504.12329, 2025

work page arXiv 2025
[64]

Tree of thoughts: Deliberate problem solving with large lan- guage models.Advances in neural information process- ing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large lan- guage models.Advances in neural information process- ing systems, 36:11809–11822, 2023

work page 2023
[65]

Flashin- fer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005, 2025

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. Flashin- fer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005, 2025

work page arXiv 2025
[66]

Advancing llm reasoning generalists with preference trees

Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, et al. Advancing llm rea- soning generalists with preference trees.arXiv preprint arXiv:2404.02078, 2024

work page arXiv 2024
[67]

Don’t overthink it: A survey of ef- ficient r1-style large reasoning models.arXiv preprint arXiv:2508.02120, 2025

Linan Yue, Yichao Du, Yizhi Wang, Weibo Gao, Fangzhou Yao, Li Wang, Ye Liu, Ziyu Xu, Qi Liu, Shimin Di, et al. Don’t overthink it: A survey of ef- ficient r1-style large reasoning models.arXiv preprint arXiv:2508.02120, 2025

work page arXiv 2025
[68]

Rest-mcts*: Llm self- training via process reward guided tree search.Advances in Neural Information Processing Systems, 37:64735– 64772, 2024

Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self- training via process reward guided tree search.Advances in Neural Information Processing Systems, 37:64735– 64772, 2024

work page 2024
[69]

More agents is all you need.Transactions on Machine Learning Research

Qin Zhang, Yangbin Yu, QIANG FU, Deheng Ye, et al. More agents is all you need.Transactions on Machine Learning Research

work page
[70]

Can pruning improve reasoning? revisiting long-cot compression with capability in mind

Shangziqi Zhao, Jiahao Yuan, Guisong Yang, and Us- man Naseem. Can pruning improve reasoning? revisit- ing long-cot compression with capability in mind for bet- ter reasoning.arXiv preprint arXiv:2505.14582, 2025

work page arXiv 2025
[71]

Sglang: Efficient execution of structured language model programs.Advances in neural information pro- cessing systems, 37:62557–62583, 2024

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information pro- cessing systems, 37:62557–62583, 2024

work page 2024
[72]

Generalizable chain-of-thought prompting in mixed-task scenarios with large language models.arXiv preprint arXiv:2310.06692, 2023

Anni Zou, Zhuosheng Zhang, Hai Zhao, and Xiangru Tang. Generalizable chain-of-thought prompting in mixed-task scenarios with large language models.arXiv preprint arXiv:2310.06692, 2023. 16

work page arXiv 2023