BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.
Pard: Accelerating llm inference with low-cost parallel draft model adaptation
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6roles
background 2polarities
background 2representative citing papers
SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware bandit adaptation.
The paper presents an interpretable latency model for speculative decoding that infers effective batch size via Little's Law and decomposes demand to predict and explain performance across serving loads, validated on vLLM measurements.
TTS adapts speculator models online via target model verifications to improve acceptance lengths by up to 72% over prior methods, with gains increasing for longer generations.
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
citing papers explorer
-
Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting
BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.
-
An Interpretable Latency Model for Speculative Decoding in LLM Serving
The paper presents an interpretable latency model for speculative decoding that infers effective batch size via Little's Law and decomposes demand to predict and explain performance across serving loads, validated on vLLM measurements.