SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
hub
Fast inference from transformers via speculative decoding
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% quality retention.
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.
Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.
Ultra-compact 8-30M parameter models start contextually grounded responses on-device while cloud models seamlessly continue them, enabling responsive AI on power-constrained hardware.
Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
DualDiffusion combines a lightweight drafter using approximations with a full verifier to reduce generation steps in masked diffusion models while keeping accuracy on MMLU and GSM8K.
Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to-end deployment on Horizon Journey hardware.
SOLARIS speculatively precomputes user-item latent representations to decouple large-model inference from real-time serving, delivering 0.67% revenue gain when deployed in Meta's ad system.
citing papers explorer
-
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
-
Speculative Decoding for Autoregressive Video Generation
A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% quality retention.
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.
-
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.
-
Micro Language Models Enable Instant Responses
Ultra-compact 8-30M parameter models start contextually grounded responses on-device while cloud models seamlessly continue them, enabling responsive AI on power-constrained hardware.
-
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
-
DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models
DualDiffusion combines a lightweight drafter using approximations with a full verifier to reduce generation steps in masked diffusion models while keeping accuracy on MMLU and GSM8K.
-
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
-
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding
A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
-
Complexity Horizons of Compressed Models in Analog Circuit Analysis
Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
-
EdgeFM: Efficient Edge Inference for Vision-Language Models
EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to-end deployment on Horizon Journey hardware.
-
SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling
SOLARIS speculatively precomputes user-item latent representations to decouple large-model inference from real-time serving, delivering 0.67% revenue gain when deployed in Meta's ad system.