citation dossier

Eagle: Speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang · 2024 · arXiv 2401.15077

18Pith papers citing it

18reference links

cs.LGtop field · 5 papers

UNVERDICTEDtop verdict bucket · 18 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 18 reviewed papers. Its strongest current cluster is cs.LG (5 papers). The largest review-status bucket among citing papers is UNVERDICTED (18 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.

BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preserving exact synchrony.

NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.

WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference

cs.IT · 2026-04-20 · unverdicted · novelty 7.0

WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% accuracy loss.

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.

CASCADE: Context-Aware Relaxation for Speculative Image Decoding

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.

CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

CoVSpec achieves up to 2.21x higher throughput and over 96% lower communication overhead for device-edge VLM inference via training-free visual token reduction, adaptive drafting, and decoupled parallel verification-correction in speculative decoding.

Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.

Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

cs.AR · 2026-04-28 · unverdicted · novelty 6.0

NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.

SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration

cs.CL · 2026-04-14 · unverdicted · novelty 6.0

SpecBound achieves up to 2.33x wall-time speedup in LLM inference via adaptive bounded self-speculation and layer-wise confidence calibration while preserving exact output equivalence.

SMART: When is it Actually Worth Expanding a Speculative Tree?

cs.DC · 2026-04-09 · unverdicted · novelty 6.0

SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.

Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA

cs.RO · 2026-04-03 · unverdicted · novelty 6.0

SV-VLA uses infrequent heavy VLA planning of action chunks plus a lightweight closed-loop verifier to achieve both efficiency and robustness in dynamic robot control.

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

cs.CV · 2024-02-19 · unverdicted · novelty 6.0

DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and deployed on a production vehicle.

31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding

cs.AR · 2026-05-10 · unverdicted · novelty 5.0

A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.

DMax: Aggressive Parallel Decoding for dLLMs

cs.LG · 2026-04-09 · unverdicted · novelty 5.0

DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly 2-3x on benchmarks while preserving accuracy.

Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

cs.DC · 2026-04-20 · unverdicted · novelty 4.0

A framework combines multi-LoRA runtime switching, multi-stream stylistic decoding, and Dynamic Self-Speculative Decoding with INT4 quantization to achieve 4-6x memory and latency gains for on-device inference of a one-for-all foundational LLM on Qualcomm chipsets.

citing papers explorer

Showing 18 of 18 citing papers.

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding cs.LG · 2026-05-11 · unverdicted · none · ref 7
SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning cs.LG · 2026-05-09 · unverdicted · none · ref 12
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preserving exact synchrony.
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization cs.LG · 2026-04-20 · unverdicted · none · ref 29
NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference cs.IT · 2026-04-20 · unverdicted · none · ref 26
WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% accuracy loss.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding cs.CV · 2026-04-14 · unverdicted · none · ref 26
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding cs.CL · 2026-05-09 · unverdicted · none · ref 19
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding cs.CV · 2026-05-08 · unverdicted · none · ref 24
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.
CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding cs.AI · 2026-05-04 · unverdicted · none · ref 7
CoVSpec achieves up to 2.21x higher throughput and over 96% lower communication overhead for device-edge VLM inference via training-free visual token reduction, adaptive drafting, and decoupled parallel verification-correction in speculative decoding.
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding cs.CL · 2026-05-01 · unverdicted · none · ref 3
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving cs.LG · 2026-04-29 · unverdicted · none · ref 33
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference cs.AR · 2026-04-28 · unverdicted · none · ref 18
NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.
SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration cs.CL · 2026-04-14 · unverdicted · none · ref 3
SpecBound achieves up to 2.33x wall-time speedup in LLM inference via adaptive bounded self-speculation and layer-wise confidence calibration while preserving exact output equivalence.
SMART: When is it Actually Worth Expanding a Speculative Tree? cs.DC · 2026-04-09 · unverdicted · none · ref 20
SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.
Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA cs.RO · 2026-04-03 · unverdicted · none · ref 21
SV-VLA uses infrequent heavy VLA planning of action chunks plus a lightweight closed-loop verifier to achieve both efficiency and robustness in dynamic robot control.
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models cs.CV · 2024-02-19 · unverdicted · none · ref 72
DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and deployed on a production vehicle.
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding cs.AR · 2026-05-10 · unverdicted · none · ref 9
A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
DMax: Aggressive Parallel Decoding for dLLMs cs.LG · 2026-04-09 · unverdicted · none · ref 43
DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly 2-3x on benchmarks while preserving accuracy.
Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM cs.DC · 2026-04-20 · unverdicted · none · ref 15
A framework combines multi-LoRA runtime switching, multi-stream stylistic decoding, and Dynamic Self-Speculative Decoding with INT4 quantization to achieve 4-6x memory and latency gains for on-device inference of a one-for-all foundational LLM on Qualcomm chipsets.

Eagle: Speculative sampling requires rethinking feature uncertainty

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer