arxiv: 2605.09329 · v1 · submitted 2026-05-10 · 💻 cs.CL · cs.LG

Recognition: no theorem link

Test-Time Speculation

Avinash Kumar, Poulami Das, Sujay Sanghavi

Pith reviewed 2026-05-12 04:14 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords speculative decodingLLM inference accelerationonline distillationtest-time adaptationacceptance lengthdraft modellong-form generation

0 comments

The pith

Test-Time Speculation adapts the draft model online using target verification signals to sustain high acceptance lengths during long LLM generations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Speculative decoding relies on a fast draft model to propose tokens that a slower target model then verifies, with speedup determined by how many draft tokens get accepted in a row. Existing speculators lose effectiveness quickly because they are trained offline on short sequences yet must operate on much longer outputs at inference time, causing acceptance lengths to fall toward 1. Test-Time Speculation turns the verification step itself into a continuous training signal: each time the target model checks a draft token it supplies the exact label needed to update the draft model without any extra forward passes. The draft is treated as a student that receives repeated updates from the target teacher across successive speculation rounds, allowing it to track the target more closely as generation length grows. Experiments on Qwen-3, Qwen-3.5 and Llama-3.1 families show acceptance lengths rise by up to 72 percent and 41 percent on average, with the gap widening as output length increases.

Core claim

Test-Time Speculation (TTS) is an online distillation procedure that continuously updates the draft model at inference time by using the token-verification outcomes already produced by the target model as supervision, thereby preventing the acceptance-length collapse that occurs when offline-trained speculators are applied to long sequences.

What carries the argument

Test-Time Speculation (TTS), an online distillation loop that treats verification results from the target model as training labels to refine the draft model after each speculation round.

Load-bearing premise

Continuous online updates to the draft model remain stable and do not introduce extra latency, divergence, or quality loss over very long generations.

What would settle it

Measure acceptance length across a single generation of 10,000 tokens and check whether it stays above the offline baseline or eventually drops back toward 1.

Figures

Figures reproduced from arXiv: 2605.09329 by Avinash Kumar, Poulami Das, Sujay Sanghavi.

**Figure 2.** Figure 2: Acceptance Length of four tasks using (a) DFlash, (b) EAGLE-3, and (c) PARD speculators [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Acceptance Length of four tasks using DFlash speculator on (a) Qwen3.5-35B, (b) Qwen3.6- [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution entropy (in nats) for Llama3.1-8B (target) with EAGLE-3 (draft). (a) Target [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Acceptance Length of TTS versus DFlash for (a) AIME 2024 and (b) LiveCodeBench on [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: (a) Acceptance Length (AL) of TTS on Qwen3-8B with optimization steps per round ( [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Execution timeline of TTS with strided updates and asynchronous pipelining. Every [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a more accurate target model to verify them. Its performance depends on the $\textit{acceptance length}$, or number of draft tokens accepted by the target. Our studies show that the acceptance length of even state-of-the-art speculators, like DFlash, EAGLE-3 and PARD degrade with generation length, reaching values close to 1 (i.e. no speedup) within just a few thousand output tokens, making speculators ineffective for long-response tasks. Acceptance lengths decline because most speculators are trained offline on short sequences, but are forced to match the target model on much longer outputs at inference, well beyond their training distribution. To address this issue, we propose $\textit{Test-Time Speculation (TTS)}$, an online distillation approach that continuously adapts the speculator at test-time. TTS leverages the key insight that the token verification step already invokes the target model for each draft token, providing the training signal needed to adapt the draft at no additional cost. Treating the draft as the student and the target as a teacher, TTS adjusts the draft over several speculation rounds, with each update improving the draft's accuracy as generation proceeds. Our results across multiple models from the Qwen-3, Qwen-3.5, and Llama3.1 families show that TTS improves acceptance lengths over state-of-the-art speculators by up to $72\%$ and $41\%$ on average, with the benefits scaling with increased generation lengths.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TTS adapts the draft model online via verification signals to fix acceptance-length drop on long outputs, but the no-extra-cost claim and long-run stability are the parts that need checking.

read the letter

The main thing here is that the authors noticed offline-trained speculators lose their edge once generations get long, and they fix it by updating the draft model on the fly using the target model's verification outputs as supervision. They demonstrate that acceptance lengths for things like EAGLE-3 and DFlash fall close to 1 after a few thousand tokens, then show that their test-time updates produce up to 72% higher acceptance lengths and 41% on average, with the advantage growing as output length increases. The experiments cover Qwen-3, Qwen-3.5, and Llama 3.1 families, which is a reasonable spread. The core idea is new in tying online distillation directly to the speculative loop this way, and the motivation is practical and clearly stated. The approach itself is straightforward distillation, which keeps the math clean. What needs more scrutiny is the repeated claim that updates happen at no additional cost. Running gradients and an optimizer step on the draft still adds compute even when the target forward pass is already required, so I want to see exactly how they schedule the updates and whether they measured wall-clock overhead. The stability question for very long sequences is also open: the abstract says gains scale with length, but without details on update frequency, learning rate decay, or any regularization, it is not obvious the draft stays aligned rather than drifting or forgetting earlier adaptations. The paper would benefit from ablations on those choices and from reporting variance across runs. This is aimed at people who deploy LLMs for long-form generation and care about inference speed. A reader working on speculative decoding or efficient serving would get concrete value from the idea if the implementation details check out. It is worth sending to peer review because the problem is real and the proposed fix is simple enough to test, but the referee should press on the cost accounting and long-sequence behavior.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Test-Time Speculation (TTS), an online distillation approach for speculative decoding. It observes that acceptance lengths of existing speculators (DFlash, EAGLE-3, PARD) degrade toward 1 over long generations because they are trained offline on short sequences. TTS continuously adapts the draft model during inference by treating verification signals from the target model as supervision, claiming this incurs no extra cost. Across Qwen-3, Qwen-3.5, and Llama-3.1 families, TTS is reported to raise acceptance lengths by up to 72% (41% average) relative to baselines, with gains that increase as output length grows.

Significance. If the empirical gains and scaling behavior are reproducible, TTS would address a practical barrier to using speculative decoding on long-response tasks. The insight that verification already supplies a teacher signal for test-time adaptation is elegant and could generalize to other inference accelerators. The work would be strengthened by explicit quantification of any hidden overhead and by stability results on sequences far beyond the offline training regime.

major comments (3)

[Abstract] Abstract: the assertion that adaptation occurs 'at no additional cost' because verification already invokes the target is not self-evident. Any gradient-based update to draft parameters requires at least a backward pass and optimizer step per round; the manuscript must quantify this overhead relative to standard speculative decoding and show it remains negligible.
[Experiments] Experiments section: the central scaling claim (gains increase with generation length) rests on results whose maximum tested lengths, error bars, number of runs, and ablation on update frequency or learning rate are not reported. Without these, it is impossible to confirm that continuous updates remain stable and do not introduce divergence or quality degradation beyond a few thousand tokens.
[Method] Method section: the precise loss, optimizer, and update schedule used for online distillation must be specified, together with any safeguards against forgetting or distribution shift, because these choices directly determine whether the reported acceptance-length improvements are robust or artifactual.

minor comments (2)

The abstract states improvements 'scale with increased generation lengths' but does not define the exact length ranges or provide a plot of acceptance length versus token position; adding such a figure would clarify the scaling behavior.
[Related Work] Consider adding a short related-work paragraph contrasting TTS with prior test-time adaptation or online distillation techniques in LLMs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. The comments highlight important areas for clarification and additional detail that will strengthen the manuscript. We address each major comment below and commit to revisions that provide the requested quantification, experimental details, and methodological specifications without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that adaptation occurs 'at no additional cost' because verification already invokes the target is not self-evident. Any gradient-based update to draft parameters requires at least a backward pass and optimizer step per round; the manuscript must quantify this overhead relative to standard speculative decoding and show it remains negligible.

Authors: We agree that the 'no additional cost' phrasing requires nuance and quantification. The target forward pass is reused from verification, but the draft model's backward pass and optimizer step do incur extra computation. In the revised manuscript we will add a dedicated overhead analysis subsection with wall-clock time and FLOP measurements on the same hardware used for the main experiments. Preliminary internal measurements show the overhead stays below 8% of total inference time for the draft sizes and update frequencies employed, because the draft is 10-20x smaller than the target and updates occur only every few hundred tokens. We will report these numbers explicitly and revise the abstract to state 'with negligible additional cost' supported by the new data. revision: yes
Referee: [Experiments] Experiments section: the central scaling claim (gains increase with generation length) rests on results whose maximum tested lengths, error bars, number of runs, and ablation on update frequency or learning rate are not reported. Without these, it is impossible to confirm that continuous updates remain stable and do not introduce divergence or quality degradation beyond a few thousand tokens.

Authors: We acknowledge the need for greater transparency on experimental rigor. The original experiments tested generations up to 8192 tokens with at least three independent runs per model-task pair; acceptance-length curves were averaged and showed monotonic improvement with length. In the revision we will (1) state the exact maximum lengths, (2) add error bars and report standard deviation across runs, (3) include ablations varying update frequency (every 64/128/256 tokens) and learning rate (1e-5 to 5e-4), and (4) extend evaluation to 16384-token generations on a subset of models to confirm continued stability and absence of divergence or quality drop. These additions will directly support the scaling claim. revision: yes
Referee: [Method] Method section: the precise loss, optimizer, and update schedule used for online distillation must be specified, together with any safeguards against forgetting or distribution shift, because these choices directly determine whether the reported acceptance-length improvements are robust or artifactual.

Authors: We will expand the Method section with the missing specifications. The loss is the standard cross-entropy between the draft model's next-token logits and the target model's verified tokens (only on accepted positions). We use AdamW with learning rate 2e-5, beta2=0.999, and weight decay 0.01. Updates occur every 128 tokens or immediately after a low-acceptance round. Safeguards include: gradient norm clipping at 1.0, a small FIFO replay buffer of the last 512 verified tokens to mitigate forgetting, and an exponential moving average of draft parameters with decay 0.99 to stabilize against distribution shift. Pseudocode for the update loop will be added. These choices were used for all reported results and will now be fully documented. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims are empirical measurements

full rationale

The paper presents TTS as an online adaptation method that reuses target-model verification signals already computed during speculative decoding. Reported gains (up to 72% and 41% average) are stated as experimental outcomes measured on Qwen and Llama models for varying generation lengths. No mathematical derivation, fitted-parameter prediction, self-definitional loop, or load-bearing self-citation is present in the provided text. The 'no additional cost' phrasing rests on the observation that verification already runs the target, but the performance scaling is not forced by construction or renamed from prior results; it is reported as measured data. This is a standard empirical contribution with no circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the method builds on standard online learning and distillation without new postulated components.

pith-pipeline@v0.9.0 · 5576 in / 967 out tokens · 29470 ms · 2026-05-12T04:14:27.087400+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 14 internal anchors

[1]

International Conference on Machine Learning , pages=

Fast inference from transformers via speculative decoding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[2]

Accelerating Large Language Model Decoding with Speculative Sampling

Accelerating large language model decoding with speculative sampling , author=. arXiv preprint arXiv:2302.01318 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

Eagle-3: Scaling up inference acceleration of large language models via training-time test , author=. arXiv preprint arXiv:2503.01840 , year=

work page arXiv
[4]

Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen

DFlash: Block Diffusion for Flash Speculative Decoding , author=. arXiv preprint arXiv:2602.06036 , year=

work page arXiv
[5]

Pard: Accelerating llm inference with low-cost parallel draft model adaptation.arXiv preprint arXiv:2504.18583, 2025

Pard: Accelerating llm inference with low-cost parallel draft model adaptation , author=. arXiv preprint arXiv:2504.18583 , year=

work page arXiv
[6]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

2408.07055 , archiveprefix =

Longwriter: Unleashing 10,000+ word generation from long context llms , author=. arXiv preprint arXiv:2408.07055 , year=

work page arXiv
[9]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

Longbench: A bilingual, multitask benchmark for long context understanding , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

work page
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

International Conference on Learning Representations (ICLR) , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. International Conference on Learning Representations (ICLR) , year=

work page
[12]

ACM Transactions on Storage , year=

Mooncake: A kvcache-centric disaggregated architecture for llm serving , author=. ACM Transactions on Storage , year=

work page
[13]

18th USENIX symposium on operating systems design and implementation (OSDI 24) , pages=

Taming \ Throughput-Latency \ tradeoff in \ LLM \ inference with \ Sarathi-Serve \ , author=. 18th USENIX symposium on operating systems design and implementation (OSDI 24) , pages=

work page
[14]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification , author=. arXiv preprint arXiv:2502.17421 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

The Twelfth International Conference on Learning Representations , year=

YaRN: Efficient Context Window Extension of Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

work page
[19]

arXiv preprint arXiv:2512.02337 , year=

SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification , author=. arXiv preprint arXiv:2512.02337 , year=

work page arXiv
[20]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Enhancing chat language models by scaling high-quality instructional conversations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[21]

ShareGPT

OpenAI. ShareGPT. 2023

work page 2023
[22]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

work page
[23]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[24]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

The twelfth international conference on learning representations , year=

Let's verify step by step , author=. The twelfth international conference on learning representations , year=

work page
[26]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[28]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Theoremqa: A theorem-driven question answering dataset , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[30]

AIME 2024 (I)

MathArena. AIME 2024 (I). 2024

work page 2024
[31]

AIME 2024 (II)

MathArena. AIME 2024 (II). 2024

work page 2024
[32]

AIME 2025

MathArena. AIME 2025. 2025

work page 2025
[33]

arXiv:2310.07177 , year=

Online speculative decoding , author=. arXiv preprint arXiv:2310.07177 , year=

work page arXiv
[34]

2025 , month=

ATLAS: Adaptive-Learning Speculator System , author=. 2025 , month=

work page 2025
[35]

Zheng, W.-L

Lmsys-chat-1m: A large-scale real-world llm conversation dataset , author=. arXiv preprint arXiv:2309.11998 , year=

work page arXiv
[36]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[37]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

work page
[39]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv preprint arXiv:2312.00752 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

arXiv preprint arXiv:2508.08192 , year=

Efficient speculative decoding for llama at scale: Challenges and solutions , author=. arXiv preprint arXiv:2508.08192 , year=

work page arXiv
[41]

Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

work page 2018
[42]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Codesearchnet challenge: Evaluating the state of semantic code search , author=. arXiv preprint arXiv:1909.09436 , year=

work page internal anchor Pith review arXiv 1909
[43]

2023 , howpublished=

Finance-Alpaca: An Instruction-Following Dataset for Financial Question Answering , author=. 2023 , howpublished=

work page 2023
[44]

Advances in neural information processing systems , volume=

Sglang: Efficient execution of structured language model programs , author=. Advances in neural information processing systems , volume=

work page