Pard: Accelerating llm inference with low-cost parallel draft model adaptation

Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, Emad Barsoum · 2025 · arXiv 2504.18583

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

cs.CL · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware bandit adaptation.

An Interpretable Latency Model for Speculative Decoding in LLM Serving

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

The paper presents an interpretable latency model for speculative decoding that infers effective batch size via Little's Law and decomposes demand to predict and explain performance across serving loads, validated on vLLM measurements.

Test-Time Speculation

cs.CL · 2026-05-10 · unverdicted · novelty 6.0 · 2 refs

TTS adapts speculator models online via target model verifications to improve acceptance lengths by up to 72% over prior methods, with gains increasing for longer generations.

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

cs.DC · 2026-02-10

citing papers explorer

Showing 2 of 2 citing papers after filters.

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting cs.LG · 2026-05-28 · unverdicted · none · ref 2
BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.
An Interpretable Latency Model for Speculative Decoding in LLM Serving cs.LG · 2026-05-14 · unverdicted · none · ref 2
The paper presents an interpretable latency model for speculative decoding that infers effective batch size via Little's Law and decomposes demand to predict and explain performance across serving loads, validated on vLLM measurements.

Pard: Accelerating llm inference with low-cost parallel draft model adaptation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer