Recognition: 1 theorem link
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Pith reviewed 2026-05-15 18:48 UTC · model grok-4.3
The pith
By switching to direct token prediction and multi-layer feature fusion, EAGLE-3 enables draft models to improve with increased training data for faster LLM inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EAGLE-3 abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data, achieving up to 6.5x speedup and 1.4x over EAGLE-2.
What carries the argument
Training-time test that performs multi-layer feature fusion to support direct token prediction in the draft model.
If this is right
- Inference runs up to 6.5 times faster than standard autoregressive decoding on chat and reasoning tasks.
- The draft model delivers roughly 1.4 times the speedup of EAGLE-2 when both use the same training scale.
- Throughput rises 1.38 times in frameworks such as SGLang when batch size reaches 64.
- Gains appear consistently across both chat-oriented and reasoning-oriented target models on five separate benchmarks.
Where Pith is reading between the lines
- The same direct-prediction shift could be tested in other speculative-sampling methods that currently rely on feature matching.
- If multi-layer fusion proves stable, it may allow smaller target models to be paired with stronger drafts without losing end-to-end quality.
- Pairing the technique with quantization or other compression methods could produce compounded reductions in latency and memory.
Load-bearing premise
Direct token prediction combined with multi-layer feature fusion will remove prior constraints on scaling training data without introducing new accuracy or stability problems in the draft model.
What would settle it
Train an EAGLE-3 draft model on substantially more data than the EAGLE-2 baseline and check whether the measured inference speedup ratio fails to increase; if acceptance rates stay flat or drop, the claim does not hold.
read the original abstract
The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. In the SGLang framework, EAGLE-3 achieves a 1.38x throughput improvement at a batch size of 64. The code is available at https://github.com/SafeAILab/EAGLE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EAGLE-3 as an extension of prior EAGLE speculative sampling for LLM inference acceleration. It replaces feature prediction with direct token prediction and introduces multi-layer feature fusion via a 'training-time test' technique, claiming this removes prior constraints and allows the draft model to fully benefit from scaling training data. Experiments on chat and reasoning models across five tasks report speedups up to 6.5x (1.4x over EAGLE-2) and 1.38x throughput improvement in SGLang at batch size 64, with code released.
Significance. If the core claims hold, the work offers a practical advance in speculative decoding by addressing data-scaling limitations in draft models, with potential impact on efficient LLM deployment. Strengths include empirical evaluation across multiple model types and tasks plus public code release, which supports reproducibility and follow-up work.
major comments (2)
- Abstract: the central claim that abandoning feature prediction and adding multi-layer fusion 'enable the draft model to fully benefit from scaling up training data' lacks supporting evidence; no scaling curves, performance-vs-data-volume plots, or ablations isolating direct token prediction (while holding fusion fixed) are presented, so observed gains could stem from hyperparameter changes or fusion alone rather than removal of the feature-prediction bottleneck.
- Experiments section: reported speedups (6.5x, 1.4x over EAGLE-2) and throughput numbers provide no details on exact baselines, statistical significance, error bars, data splits, or variance across runs, preventing full verification of the performance claims.
minor comments (1)
- Abstract: the phrase 'training-time test' is used without a concise definition or pointer to its implementation details; add a brief description or section reference for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for major revision. We address each major comment below and commit to revising the manuscript to strengthen the evidence and clarity of our claims.
read point-by-point responses
-
Referee: Abstract: the central claim that abandoning feature prediction and adding multi-layer fusion 'enable the draft model to fully benefit from scaling up training data' lacks supporting evidence; no scaling curves, performance-vs-data-volume plots, or ablations isolating direct token prediction (while holding fusion fixed) are presented, so observed gains could stem from hyperparameter changes or fusion alone rather than removal of the feature-prediction bottleneck.
Authors: We acknowledge that the current manuscript does not present explicit scaling curves, performance-vs-data-volume plots, or ablations that isolate direct token prediction while holding multi-layer fusion fixed. Our central claim rests on the empirical observation that prior EAGLE variants (relying on feature prediction) exhibit limited gains from increased training data, whereas EAGLE-3 shows substantial improvements over EAGLE-2. To address the concern that gains may arise from other factors, we will add dedicated ablations and scaling plots in the revised version. These additions will directly test the contribution of abandoning feature prediction. revision: yes
-
Referee: Experiments section: reported speedups (6.5x, 1.4x over EAGLE-2) and throughput numbers provide no details on exact baselines, statistical significance, error bars, data splits, or variance across runs, preventing full verification of the performance claims.
Authors: We agree that the experimental section requires more precise reporting to enable verification. In the revision we will explicitly list the exact baseline implementations and versions (including EAGLE-2), report statistical significance tests, include error bars derived from multiple independent runs, detail the data splits used for training and evaluation, and quantify variance across runs. The public code release already supports reproducibility, but the text will be updated to include these details. revision: yes
Circularity Check
No significant circularity; empirical speedups are measured outcomes, not reductions to fitted inputs or self-citations.
full rationale
The paper's central claims rest on experimental measurements of speedup (up to 6.5x and 1.4x over EAGLE-2) across five tasks after introducing direct token prediction and training-time test fusion. No equations are presented that define the reported ratios in terms of internally fitted parameters, and no uniqueness theorems or ansatzes are imported via self-citation to force the architecture. The scaling-data benefit is asserted from observed performance differences rather than derived by construction from the method's own definitions. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- draft model training hyperparameters
axioms (1)
- domain assumption Speculative sampling with a draft model reduces wall-clock latency while preserving output distribution
Lean theorems connected to this paper
-
HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EAGLE-3 abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via training-time test, significantly enhancing performance and enabling the draft model to fully benefit from scaling up training data
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
-
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
-
Test-Time Speculation
Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
-
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
-
An Empirical Study of Speculative Decoding on Software Engineering Tasks
Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.
-
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
-
WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference
WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% ac...
-
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
-
KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models
KERV integrates kinematic Kalman Filter predictions with speculative decoding in VLA models to achieve 27-37% faster inference while maintaining nearly the same task success rates.
-
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.
-
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
-
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
-
SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting
SpecBlock achieves 8-19% higher speedup than EAGLE-3 in LLM speculative decoding by using repeated block expansions with hidden-state inheritance, a dynamic rank head, and a valid-prefix training mask.
-
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
-
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
-
RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding
RACER unifies retrieval of exact matching patterns with logit-driven cues to produce better speculative drafts, achieving more than 2x speedup over autoregressive decoding and outperforming prior training-free specula...
-
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
-
SMART: When is it Actually Worth Expanding a Speculative Tree?
SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.
-
Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack
Workload-aware optimizations for LLM serving in AML and fraud detection yield substantial gains in throughput, latency, and GPU utilization on synthetic compliance prompts.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.