Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.
Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding.arXiv preprint arXiv:2404.11912
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.
Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on long-context benchmarks.
K-VEC is a coverage-aware KV-cache eviction strategy using cross-head and cross-layer modules that improves performance by up to 10.35 points over prior methods on LongBench subsets at fixed memory budget.
BudgetDraft applies multi-view sparse training with an acceptance-aware full-cache loss branch to produce one budget-robust drafter that recovers acceptance rates across sparsity levels in speculative decoding for 4K-16K contexts.
DREAM-R introduces RL-based draft alignment, ratio-based verification, and parallel execution to accelerate speculative reasoning in multimodal models while preserving accuracy.
citing papers explorer
-
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on long-context benchmarks.