HERALD enables near-lossless accuracy at 5-10% KV budget for block dLLMs by amortizing top-k selection across denoising steps and overlapping CPU-GPU retrieval, yielding up to 2.47x higher throughput than GPU-only inference.
Beyond next-token prediction: A perfor- mance characterization of diffusion versus autoregressive language models.arXiv preprint arXiv:2510.04146, 2025
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Introduces a sketch-based watermarking method for masked diffusion language models providing an order-agnostic detection statistic decoupled from local context.
EfficientRollout applies self-speculative decoding with quantized drafter induction and system-aware acceptance policies to cut RL rollout latency up to 19.6% while preserving final model quality.
citing papers explorer
-
HERALD: High-Throughput Block Diffusion LLM Serving via CPU-GPU Cooperative KV Cache Retrieval
HERALD enables near-lossless accuracy at 5-10% KV budget for block dLLMs by amortizing top-k selection across denoising steps and overlapping CPU-GPU retrieval, yielding up to 2.47x higher throughput than GPU-only inference.
-
EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts
EfficientRollout applies self-speculative decoding with quantized drafter induction and system-aware acceptance policies to cut RL rollout latency up to 19.6% while preserving final model quality.