Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding.arXiv preprint arXiv:2404.11912

Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding , author= · 2024 · arXiv 2404.11912

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

representative citing papers

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.

DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

cs.CL · 2024-07-16 · accept · novelty 6.0

Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on long-context benchmarks.

Coverage-Driven KV Cache Eviction for Efficient and Improved Inference of LLM

cs.CL · 2026-06-28 · unverdicted · novelty 5.0

K-VEC is a coverage-aware KV-cache eviction strategy using cross-head and cross-layer modules that improves performance by up to 10.35 points over prior methods on LongBench subsets at fixed memory budget.

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

cs.LG · 2026-05-29 · unverdicted · novelty 5.0

BudgetDraft applies multi-view sparse training with an acceptance-aware full-cache loss branch to produce one budget-robust drafter that recovers acceptance rates across sparsity levels in speculative decoding for 4K-16K contexts.

DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution

cs.AI · 2026-05-27 · unverdicted · novelty 4.0

DREAM-R introduces RL-based draft alignment, ratio-based verification, and parallel execution to accelerate speculative reasoning in multimodal models while preserving accuracy.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference cs.CL · 2024-07-16 · accept · none · ref 46
Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on long-context benchmarks.

Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding.arXiv preprint arXiv:2404.11912

fields

years

verdicts

representative citing papers

citing papers explorer