arxiv: 2605.11186 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration

Dylan Zhao, Jingwei Sun, Yangchenchen Jin, Yuning Han

Authors on Pith no claims yet

Pith reviewed 2026-05-13 03:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords speculative decodingLLM accelerationmemory constrained inferenceedge devicescascaded verificationself-speculative decodingtoken acceptanceparameter offloading

0 comments

The pith

CATS achieves up to 5.08x wall-clock speedup for LLM inference on memory-limited edge devices by cascaded adaptive tree speculation without increasing memory usage or degrading quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the memory bottleneck in auto-regressive LLM decoding on devices like edge platforms where high-bandwidth memory is limited. It proposes CATS, a self-speculative decoding method that performs cascaded verification and correction of draft tokens tailored to the device's memory budget and parameter offloading patterns. This allows parallel checking of multiple tokens without needing space for a separate draft model, keeping peak memory at the level of the target model only. Experiments on real edge devices with various models and benchmarks demonstrate up to 5.08x speedup in wall-clock time with unchanged generation quality, exceeding state-of-the-art approaches by 1.45x. Readers would care because it makes high-quality LLM inference feasible on hardware previously too constrained for efficient speculative methods.

Core claim

CATS is a self-speculative decoding framework for memory-limited devices that conducts cascaded verification and correction based on the memory budget and parameter offloading patterns. This design maximizes token acceptance rate and end-to-end speedup while keeping the peak memory footprint on the device equal to that of the target model alone. Evaluations show a wall-clock speedup of up to 5.08x with no degradation in generation quality, outperforming the SOTA method by up to 1.45x under edge memory constraints.

What carries the argument

Cascaded adaptive tree speculation: a staged verification process adapted to memory constraints and offloading patterns that enables self-speculation by maximizing accepted draft tokens per target model invocation.

If this is right

Fewer invocations of the full target model are needed per output token due to higher acceptance rates.
Inference runs faster in real time on edge devices without hardware upgrades.
Output quality stays the same as standard decoding since incorrect drafts are corrected.
The framework works across multiple LLM architectures and evaluation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar cascaded approaches might improve efficiency in other generative models facing memory limits, such as diffusion models.
Integrating with dynamic offloading could further optimize for fluctuating memory availability during long generations.
This opens possibilities for deploying larger LLMs on consumer hardware by reducing the effective memory requirement for acceleration techniques.

Load-bearing premise

That conducting cascaded verification and correction based on the memory budget and parameter offloading patterns maximizes token acceptance rate and end-to-end speedup while keeping the peak memory footprint on the device equal to that of the target model alone.

What would settle it

Deploying CATS on a specific edge device with a known memory limit, running inference on a benchmark, and verifying if the measured wall-clock time reduction approaches 5x, memory usage does not exceed the target model's, and quality metrics like perplexity remain equivalent to the baseline model.

Figures

Figures reproduced from arXiv: 2605.11186 by Dylan Zhao, Jingwei Sun, Yangchenchen Jin, Yuning Han.

**Figure 2.** Figure 2: Memory hierarchy on server vs. edge, with measured per-token latency breakdown for [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: End-to-end speedup of speculative decoding methods on B200 vs. Jetson AGX Orin. shifts to this flash↔DRAM movement as reflected in the latency breakdown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Pipeline of CATS. Left: three-stage decoding cycle showing the interleaved flash↔DRAM transfers and GPU computation. The bottommost blue block (draft sub-network) and the intermediate SV layers (middle blue block) are streamed from flash once per full inference cycle during the drafting and shallow verification process (regarded as a SD process); the target layers (top blue block, rest of the models) are s… view at source ↗

**Figure 5.** Figure 5: MT-Bench quality–speed comparison under relaxed acceptance. Quality is measured by the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Auto-regressive decoding in Large Language Models (LLMs) is inherently memory-bound: every generation step requires loading the model weights and intermediate results from memory (e.g., High-Bandwidth Memory (HBM) for GPU servers), making throughput bottlenecked by memory bandwidth rather than compute. Speculative decoding addresses this by enabling parallel verification of multiple draft tokens, effectively amortizing the cost of each target-model call. However, existing speculative decoding methods are designed under the assumption that HBM is sufficiently large to hold both the target model and an auxiliary draft model simultaneously -- an assumption that breaks down on memory-constrained devices such as edge platforms with limited DRAM. We analyze the inference bottleneck in this memory-limited regime and propose CATS, a self-speculative decoding framework that conducts cascaded verification and correction based on the memory budget and parameter offloading patterns on memory-limited devices. This design maximizes token acceptance rate and end-to-end speedup while keeping the peak memory footprint on the device equal to that of the target model alone. We evaluate CATS on different models across five benchmarks on real edge devices. CATS can achieve a wall-clock speedup of up to 5.08x with no degradation in generation quality, outperforming the SOTA method by up to 1.45x under edge memory constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CATS, a self-speculative decoding framework for memory-limited LLM inference that performs cascaded verification and correction guided by memory budget and parameter offloading patterns. It claims this maximizes token acceptance rate while ensuring the device's peak memory footprint equals that of the target model alone, yielding up to 5.08x wall-clock speedup with no quality degradation and 1.45x improvement over SOTA methods on edge devices across five benchmarks.

Significance. If the memory-equality claim and empirical speedups hold under rigorous verification, the work would be significant for practical LLM deployment on DRAM-constrained edge platforms, where standard speculative decoding fails due to simultaneous model storage requirements. The empirical focus on real devices and parameter-free adaptation to offloading patterns strengthens its potential impact.

major comments (3)

[Abstract, §4] Abstract and §4 (method): The central claim that 'peak memory footprint on the device equal to that of the target model alone' is load-bearing for all reported speedups, yet no memory breakdown, profiling results, or ablation isolating overheads from the adaptive tree, draft-token storage, verification states, or correction logic is provided. This leaves the skeptic's concern about auxiliary buffers unaddressed.
[§5] §5 (experiments): The reported 5.08x wall-clock speedup and 1.45x over SOTA lack details on experimental setup, exact baselines, acceptance-rate calculation, statistical measures (e.g., variance across runs), or how generation quality was assessed (e.g., perplexity, human eval). This makes the 'no degradation' claim difficult to verify.
[§3] §3 (analysis): The assumption that cascaded verification based on offloading patterns always maximizes acceptance rate without increasing peak memory is not supported by a concrete memory model or counter-example analysis; a modest temporary allocation for tree nodes could force weight eviction in the target regime.

minor comments (2)

[§4] Notation for the cascaded tree structure and offloading schedule could be clarified with a small diagram or pseudocode in §4.
[Abstract] The abstract mentions 'five benchmarks' but does not name them; listing them explicitly would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's thorough review and constructive suggestions. We address each of the major comments below and plan to revise the manuscript to incorporate additional details and clarifications as outlined.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (method): The central claim that 'peak memory footprint on the device equal to that of the target model alone' is load-bearing for all reported speedups, yet no memory breakdown, profiling results, or ablation isolating overheads from the adaptive tree, draft-token storage, verification states, or correction logic is provided. This leaves the skeptic's concern about auxiliary buffers unaddressed.

Authors: We agree with the referee that a memory breakdown is necessary to support the central claim. Although the CATS framework is designed such that all operations occur within the memory footprint of the target model by adapting to offloading patterns and using cascaded structures that reuse memory, we did not include explicit profiling in the original submission. In the revised version, we will add a memory usage analysis, including breakdowns of peak memory during different phases and ablations on auxiliary overheads, to confirm no additional memory is required beyond the target model. revision: yes
Referee: [§5] §5 (experiments): The reported 5.08x wall-clock speedup and 1.45x over SOTA lack details on experimental setup, exact baselines, acceptance-rate calculation, statistical measures (e.g., variance across runs), or how generation quality was assessed (e.g., perplexity, human eval). This makes the 'no degradation' claim difficult to verify.

Authors: The experimental details were summarized due to space constraints, but we acknowledge the need for more transparency. We will revise §5 to provide full details on the experimental setup (including device specs, baseline code references if available, acceptance rate computation formula), report variance across runs, and specify quality metrics used (perplexity and sample outputs). This will substantiate the speedup claims and no-degradation assertion. revision: yes
Referee: [§3] §3 (analysis): The assumption that cascaded verification based on offloading patterns always maximizes acceptance rate without increasing peak memory is not supported by a concrete memory model or counter-example analysis; a modest temporary allocation for tree nodes could force weight eviction in the target regime.

Authors: We will strengthen §3 by introducing a formal memory model that accounts for temporary allocations during tree speculation and verification. This model will demonstrate that under the offloading patterns, the peak memory does not exceed the target model's requirement, as temporary buffers are allocated in memory freed by offloaded weights. We will also discuss potential counter-examples and why they do not apply in our cascaded adaptive approach. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical framework with independent evaluation

full rationale

The paper introduces CATS as a practical algorithmic adaptation of speculative decoding for memory-limited edge devices, relying on cascaded verification/correction tuned to offloading patterns. All reported speedups (up to 5.08x) and memory claims are grounded in direct wall-clock measurements on real hardware across benchmarks, not in any closed-loop fitting, self-referential definitions, or load-bearing self-citations. The design goal of matching the target model's peak memory footprint is stated as an engineering constraint that the method satisfies experimentally, without equations or derivations that reduce to their own inputs by construction. No self-citation chains or ansatzes are invoked to justify core results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are explicitly introduced in the abstract beyond standard assumptions of LLM auto-regressive inference and speculative decoding.

pith-pipeline@v0.9.0 · 5540 in / 1130 out tokens · 98417 ms · 2026-05-13T03:58:45.867075+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 4 internal anchors

[1]

Prabhu Vellaisamy, Suresh Tripathi, Vignesh Natarajan, Surya Santhana Thenrasu, Shawn Blanton, and John P. Shen. TaxBreak: Unmasking the hidden costs of LLM inference through overhead decomposition. arXiv preprint arXiv:2603.12465, 2026

work page arXiv 2026
[2]

Karen Khatamifard, Minsik Cho, Carlo C

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, S. Karen Khatamifard, Minsik Cho, Carlo C. Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. LLM in a flash: Efficient large language model inference with limited memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Co...

work page 2024
[3]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research. PMLR, 2023

work page 2023
[4]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Blockwise parallel decoding for deep autoregressive models

Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. InAdvances in Neural Information Processing Systems, volume 31, 2018

work page 2018
[6]

Draft & verify: Lossless large language model acceleration via self-speculative decoding

Jun Zhang, Jue Zeng, Huizhhen Wang, Linjun Hu, Heming Xia, Tao Ge, and Furu Wei. Draft & verify: Lossless large language model acceleration via self-speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2024

work page 2024
[7]

LayerSkip: Enabling early exit inference and self-speculative decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Kelly Ma, and Elias Aly. LayerSkip: Enabling early exit inference and self-speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...

work page 2024
[8]

SWIFT: On-the-fly self-speculative decoding for LLM inference acceleration

Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, and Wenjie Li. SWIFT: On-the-fly self-speculative decoding for LLM inference acceleration. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[9]

Kangaroo: Lossless self-speculative decoding via double early exiting

Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, and Yunhe Wang. Kangaroo: Lossless self-speculative decoding via double early exiting. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024

work page 2024
[10]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024. 10

work page 2024
[11]

EAGLE: Speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024

work page 2024
[12]

Speculative decoding: Exploiting speculative execution for accelerating Seq2seq generation

Heming Xia, Tao Ge, Peiyi Wang, Si-Qing Chen, Furu Wei, and Zhifang Sui. Speculative decoding: Exploiting speculative execution for accelerating Seq2seq generation. InFindings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 2023

work page 2023
[13]

SpecInfer: Accelerating large language model serving with tree-based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. SpecInfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM I...

work page 2024
[14]

Sequoia: Scalable, robust, and hardware-aware speculative decoding

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable, robust, and hardware-aware speculative decoding. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[15]

OPT-Tree: Speculative decoding with adaptive draft tree structure.arXiv preprint arXiv:2406.17276, 2024

Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. OPT-Tree: Speculative decoding with adaptive draft tree structure.arXiv preprint arXiv:2406.17276, 2024

work page arXiv 2024
[16]

DySpec: Faster speculative decoding with dynamic token tree structure.arXiv preprint arXiv:2410.11744, 2024

Yunfan Xiong, Ruoyu Zhang, Yanzeng Li, Tianhao Wu, and Lei Zou. DySpec: Faster speculative decoding with dynamic token tree structure.arXiv preprint arXiv:2410.11744, 2024

work page arXiv 2024
[17]

DistillSpec: Improving speculative decoding via knowledge distillation

Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. DistillSpec: Improving speculative decoding via knowledge distillation. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[18]

Gonzalez, Ion Stoica, and Hao Zhang

Xiaoxuan Liu, Lanxiang Qian, Ying Ye, Qinghao Zhao, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. Online speculative decoding. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024

work page 2024
[19]

GliDe with a CaPE: A low-hassle method to accelerate speculative decoding

Cunxiao Du, Jing Jiang, Yuankai Xu, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, and Yang You. GliDe with a CaPE: A low-hassle method to accelerate speculative decoding. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research. PMLR, 2024

work page 2024
[20]

Lee, and Di He

Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, and Di He. REST: Retrieval-based speculative decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2024. Association for Computational Linguistics, 2024

work page 2024
[21]

Nearest neighbor speculative decoding for LLM generation and attribution

Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Wen-tau Yih, and Xi Victoria Lin. Nearest neighbor speculative decoding for LLM generation and attribution. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[22]

SuffixDecoding: A model-free approach to speeding up large language model inference

Gabriele Oliaro, Zhihao Jia, Daniel Campos, and Aurick Qiao. SuffixDecoding: A model-free approach to speeding up large language model inference. InAdvances in Neural Information Processing Systems, volume 38, 2025

work page 2025
[23]

Break the sequential dependency of LLM inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of LLM inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

work page arXiv 2024
[24]

Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. InFindings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics, 2024

work page 2024
[25]

Mahoney, Amir Gholami, and Kurt Keutzer

Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, and Kurt Keutzer. Speculative decoding with big little decoder. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[26]

Cascade speculative drafting for even faster LLM inference

Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Kevin Chen-Chuan Chang, and Jie He. Cascade speculative drafting for even faster LLM inference. InAdvances in Neural Information Processing Systems, volume 37, 2024. 11

work page 2024
[27]

Ouroboros: Speculative decoding with large model enhanced drafting

Weilin Zhao, Yuxiang Huang, Xu Han, Wang Xu, Chaojun Xiao, Zhiyuan Liu, and Maosong Sun. Ouroboros: Speculative decoding with large model enhanced drafting. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2024

work page 2024
[28]

Speculative RAG: Enhancing retrieval augmented generation through drafting

Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Goldie, Jingbo Shang, Chenguang Zhu, Chen-Yu Lee, and Tomas Pfister. Speculative RAG: Enhancing retrieval augmented generation through drafting. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[29]

Speeding up speculative decoding via sequential approximate verification

Meiyu Zhong, Noel Teku, and Ravi Tandon. Speeding up speculative decoding via sequential approximate verification. InProceedings of the 3rd Efficient Systems for Foundation Models Workshop at the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research. PMLR, 2025

work page 2025
[30]

SpecTr: Fast speculative decoding via optimal transport

Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. SpecTr: Fast speculative decoding via optimal transport. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[31]

Block verification accelerates speculative decoding

Ziteng Sun, Jae Hun Ro, Ahmad Beirami, and Ananda Theertha Suresh. Block verification accelerates speculative decoding. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[32]

Judge decoding: Faster speculative sam- pling requires going beyond model alignment

Gregor Bachmann, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Schönfeld, Ali Thabet, and Jonas Kohler. Judge decoding: Faster speculative sam- pling requires going beyond model alignment. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[33]

Think before you accept: Semantic reflective verification for faster speculative decoding.arXiv preprint arXiv:2505.18629, 2025

Yixuan Wang, Yijun Liu, Shiyu Ji, Yuzhuang Xu, Yang Xu, Qingfu Zhu, and Wanxiang Che. Think before you accept: Semantic reflective verification for faster speculative decoding.arXiv preprint arXiv:2505.18629, 2025

work page arXiv 2025
[34]

TriForce: Lossless acceleration of long sequence generation with hierarchical speculative decoding

Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, and Beidi Chen. TriForce: Lossless acceleration of long sequence generation with hierarchical speculative decoding. InFirst Conference on Language Modeling, 2024

work page 2024
[35]

MagicDec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding

Jian Chen, Vashisth Tiwari, Ranajoy Sadhukhan, Zhuoming Chen, Jinyuan Shi, Ian En-Hsu Yen, and Beidi Chen. MagicDec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[36]

KnapSpec: Self-speculative decoding via adaptive layer selection as a knapsack problem.arXiv preprint arXiv:2602.20217, 2026

Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, and Insu Han. KnapSpec: Self-speculative decoding via adaptive layer selection as a knapsack problem.arXiv preprint arXiv:2602.20217, 2026

work page arXiv 2026
[37]

Hydra: Sequentially-dependent draft heads for Medusa decoding

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for Medusa decoding. InFirst Conference on Language Modeling, 2024

work page 2024
[38]

EAGLE-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[39]

EAGLE-3: Scaling up inference acceleration of large language models via training-time test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-3: Scaling up inference acceleration of large language models via training-time test. InAdvances in Neural Information Processing Systems, volume 38, 2025

work page 2025
[40]

CLLMs: Consistency large language models

Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, and Hao Zhang. CLLMs: Consistency large language models. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research. PMLR, 2024

work page 2024
[41]

Self-speculative decoding in any-order and any-subset autoregressive models

Gabe Guo and Stefano Ermon. Self-speculative decoding in any-order and any-subset autoregressive models. InStructured Probabilistic Inference and Generative Modeling Workshop at NeurIPS 2025, 2025

work page 2025
[42]

Better & faster large language models via multi-token prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024

work page 2024
[43]

Multi-token joint speculative decoding for accelerating large language model inference.arXiv preprint arXiv:2407.09722, 2024

Zongyue Qin, Ziniu Hu, Zifan He, Neha Prakriya, Jason Cong, and Yizhou Sun. Multi-token joint speculative decoding for accelerating large language model inference.arXiv preprint arXiv:2407.09722, 2024. 12

work page arXiv 2024
[44]

PaSS: Parallel speculative sampling

Giovanni Monea, Armand Joulin, and Edouard Grave. PaSS: Parallel speculative sampling. InEfficient Natural Language and Speech Processing Workshop at NeurIPS 2023, 2023

work page 2023
[45]

PowerInfer: Fast large language model serving with a consumer-grade GPU

Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. PowerInfer: Fast large language model serving with a consumer-grade GPU. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. ACM, 2024

work page 2024
[46]

An I/O characterizing study of offloading LLM models and KV caches to NVMe SSD

Zebin Ren, Krijn Doekemeijer, Tiziano De Matteis, Christian Pinto, Radu Stoica, and Animesh Trivedi. An I/O characterizing study of offloading LLM models and KV caches to NVMe SSD. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25. ACM, 2025

work page 2025
[47]

Inference performance eval- uation for LLMs on edge devices with a novel benchmarking framework and metric.arXiv preprint arXiv:2508.11269, 2025

Hao Chen, Cong Tian, Zixuan He, Bin Yu, Yepang Liu, and Jialun Cao. Inference performance eval- uation for LLMs on edge devices with a novel benchmarking framework and metric.arXiv preprint arXiv:2508.11269, 2025

work page arXiv 2025
[48]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chat- bot impressing GPT-4 with 90% ChatGPT quality. https://lmsys.org/blog/2023-03-30-vicuna/ , 2023

work page 2023
[49]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM- as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[51]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[52]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model. https://github. com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[53]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. 13 A CATSalgorithm Algorithm 1CATS: Cascaded Self-Speculative Decoding with Tree-Masked Final Verificatio...

work page internal anchor Pith review Pith/arXiv arXiv 2021