arxiv: 2605.01106 · v1 · submitted 2026-05-01 · 💻 cs.CL · cs.AI

Recognition: unknown

Component-Aware Self-Speculative Decoding in Hybrid Language Models

Elies Segu\'i-Mas, Guillermina Tormo-Carb\'o, Hector Borobia

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords speculative decodinghybrid language modelsself-speculationMambaattention mechanismsacceptance ratesparallel compositionsequential composition

0 comments

The pith

The way hybrid language models combine their components decides whether component-level self-speculation works for faster inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that hybrid models mixing state-space models and attention layers can use part of themselves as an internal draft for speculative decoding, but only if the components sit in parallel within each layer. Parallel hybrids such as Falcon-H1 reach acceptance rates of 0.68 at draft length 2, while sequential hybrids such as Qwen3.5 drop to 0.038, creating an 18-fold gap. This difference holds across model scales and can be predicted in advance from how much perplexity rises when the draft component is removed. The authors conclude that the overall composition pattern, not just the presence of mixed components, controls whether the technique succeeds.

Core claim

Component-aware self-speculative decoding isolates the SSM or linear-attention subgraph inside a hybrid model and uses it as a zero-cost draft model that proposes tokens for parallel verification by the full target. Parallel hybrids integrate the subgraphs so that the draft stays distributionally close to the target, producing high acceptance; sequential hybrids interleave the layers in ways that make the same subgraph a poor match, producing near-zero acceptance. The gap is reproducible at different sizes and correlates directly with the perplexity penalty observed in an ablation that removes the draft component.

What carries the argument

Component-aware self-speculative decoding, which treats the SSM/linear-attention subgraph as an internal draft model whose proposals are verified in parallel by the remaining target layers.

If this is right

Parallel hybrid designs support effective self-speculation without needing a separate smaller model.
Perplexity degradation after removing the draft component can forecast speculative acceptance rates before any speculative run is performed.
Sequential hybrids require alternative acceleration methods such as LayerSkip to reach usable acceptance rates.
The observed acceptance rates remain consistent when model size increases from 0.5B to 3B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of new hybrid architectures may need to weigh speculative-decoding compatibility when choosing parallel versus sequential component layouts.
The same composition test could be applied to other inference techniques that exploit internal subgraphs, such as early-exit or mixture-of-experts routing.
If the zero-cost assumption holds in practice, the method could be combined with existing speculative frameworks to reduce memory traffic further.

Load-bearing premise

Isolating the SSM or linear-attention subgraph as a draft adds no extra computation or synchronization cost and leaves the target model's output distribution unchanged.

What would settle it

Measure acceptance rates when the same component-aware method is applied to a new hybrid architecture whose layers mix SSM and attention in a third pattern not tested here; if rates fall between the parallel and sequential extremes, the composition-pattern claim is supported.

Figures

Figures reproduced from arXiv: 2605.01106 by Elies Segu\'i-Mas, Guillermina Tormo-Carb\'o, Hector Borobia.

**Figure 1.** Figure 1: Self-speculative decoding acceptance under greedy decoding. (a) Per-token view at source ↗

read the original abstract

Speculative decoding accelerates autoregressive inference by drafting candidate tokens with a fast model and verifying them in parallel with the target. Self-speculative methods avoid the need for an external drafter but have been studied exclusively in homogeneous Transformer architectures. We introduce component-aware self-speculative decoding, the first method to exploit the internal architectural heterogeneity of hybrid language models, isolating the SSM/linear-attention subgraph as a zero-cost internal draft. We evaluate this on two architecturally distinct hybrid families: Falcon-H1 (parallel: Mamba-2 + attention per layer) and Qwen3.5 (sequential: interleaved linear and attention layers), with a pure Transformer control (Qwen2.5). Parallel hybrids achieve acceptance rates of alpha = 0.68 at draft length k=2 under greedy decoding, while sequential hybrids yield only alpha = 0.038 -- an 18x gap attributable to how each architecture integrates its components. The property is scale-invariant: Falcon-H1 at 3B reproduces the rates observed at 0.5B. We further show that perplexity degradation from a companion ablation study predicts speculative viability without running speculative decoding: a 3.15x ratio (Falcon) maps to alpha = 0.37 at k=4, while 81.96x (Qwen) maps to alpha = 0.019. For sequential hybrids, generic LayerSkip achieves 12x higher acceptance rates than the component-aware strategy. The composition pattern of hybrid models -- not merely the presence of alternative components -- determines whether component-level self-speculation is viable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces component-aware self-speculative decoding for hybrid language models, isolating the SSM or linear-attention subgraph as a zero-cost internal draft model. It evaluates the approach on parallel hybrids (Falcon-H1) and sequential hybrids (Qwen3.5), reporting acceptance rates of alpha=0.68 versus alpha=0.038 at draft length k=2 under greedy decoding (an 18x gap), attributes the difference to architectural composition patterns rather than component presence alone, demonstrates scale-invariance across model sizes, and shows that perplexity degradation ratios from ablations can predict speculative viability without executing the full method. It also compares against generic LayerSkip for sequential cases and includes a pure Transformer baseline.

Significance. If the central empirical claims hold after verification of the zero-cost and distribution-preservation assumptions, the work would be significant for inference acceleration in hybrid architectures, as it offers a self-speculative technique that exploits internal heterogeneity without external drafters. The reported scale-invariance, the large acceptance gap tied to parallel vs. sequential composition, and the perplexity-based predictor are concrete contributions that could guide architecture design for speculative decoding.

major comments (2)

[Abstract] Abstract: The central claim that composition pattern (parallel vs. sequential) determines viability rests on the assumption that the isolated SSM/linear-attention subgraph is truly zero-cost and produces drafts from the identical distribution as the target model. No FLOPs, latency, or distribution-divergence (e.g., KL) measurements are provided to support this, leaving open the possibility that the 18x gap (0.68 vs. 0.038 at k=2) arises from unaccounted overhead or mismatch rather than composition alone.
[Abstract] Abstract: The perplexity-ratio predictor (3.15x maps to alpha=0.37 at k=4 for Falcon; 81.96x to alpha=0.019 for Qwen) is presented as a practical proxy, but the abstract gives no details on the ablation protocol, error bars, or statistical robustness of the mapping, which is load-bearing for the claim that perplexity degradation forecasts speculative performance without running the method.

minor comments (2)

[Abstract] Abstract: The reported alpha values lack error bars, number of evaluation runs, or dataset details, making it difficult to assess the reliability of the 18x gap and scale-invariance claims.
[Abstract] Abstract: It is unclear whether the acceptance rates hold under non-greedy decoding (e.g., temperature sampling) or if they are specific to greedy decoding as stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and agree to make revisions that provide the requested measurements and protocol details to better support the central claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that composition pattern (parallel vs. sequential) determines viability rests on the assumption that the isolated SSM/linear-attention subgraph is truly zero-cost and produces drafts from the identical distribution as the target model. No FLOPs, latency, or distribution-divergence (e.g., KL) measurements are provided to support this, leaving open the possibility that the 18x gap (0.68 vs. 0.038 at k=2) arises from unaccounted overhead or mismatch rather than composition alone.

Authors: We acknowledge that the manuscript does not report explicit FLOPs, latency, or KL-divergence measurements to verify the zero-cost and identical-distribution assumptions. The zero-cost property follows from reusing existing subgraphs of the hybrid model without new parameters or modules. To address the concern that the acceptance-rate gap may stem from unaccounted overhead or distributional mismatch, we will add in the revision: (i) FLOPs and wall-clock latency comparisons of the isolated subgraph versus the full target model, (ii) KL divergence between the draft and target next-token distributions, and (iii) a brief discussion of how these quantities relate to the observed 18x gap. These additions will allow readers to evaluate whether architectural composition remains the dominant factor after the assumptions are quantified. revision: yes
Referee: [Abstract] Abstract: The perplexity-ratio predictor (3.15x maps to alpha=0.37 at k=4 for Falcon; 81.96x to alpha=0.019 for Qwen) is presented as a practical proxy, but the abstract gives no details on the ablation protocol, error bars, or statistical robustness of the mapping, which is load-bearing for the claim that perplexity degradation forecasts speculative performance without running the method.

Authors: We agree that the abstract omits necessary details on the ablation protocol, error bars, and statistical robustness of the perplexity-ratio predictor. In the revised manuscript we will update the abstract to include a concise description of the ablation protocol (component removal or masking procedure), report error bars or standard deviations for the perplexity ratios and corresponding acceptance rates, and note the consistency of the mapping across the evaluated model families and sizes. The main text will be expanded with the full experimental protocol and any correlation statistics, thereby making the predictor claim transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurements of acceptance rates are self-contained

full rationale

The paper reports direct experimental results: acceptance rates (alpha) measured under greedy decoding on Falcon-H1 (parallel hybrid) and Qwen3.5 (sequential hybrid), yielding the observed 18x gap at k=2. The perplexity-ratio mapping to alpha is presented as an empirical correlation from ablation studies, not a derivation that reduces alpha to a fitted input by the paper's equations. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises; the composition-pattern claim is an interpretation of comparative measurements across distinct architectures, with scale-invariance checked by running the same protocol at different model sizes. The findings rest on external benchmarks (actual decoding runs) rather than tautological reductions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard speculative decoding assumptions plus the novel premise that hybrid subgraphs can be isolated at zero cost. No new entities are postulated; free parameters are limited to evaluation choices such as draft length.

free parameters (1)

draft length k
Chosen evaluation hyperparameter that directly affects reported acceptance rates; values 2 and 4 are used.

axioms (1)

domain assumption Speculative decoding can verify multiple drafted tokens in parallel without changing the final output distribution
Invoked implicitly when claiming the internal draft is zero-cost and viable.

pith-pipeline@v0.9.0 · 5598 in / 1384 out tokens · 28078 ms · 2026-05-09T18:36:29.507235+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 16 canonical work pages · 4 internal anchors

[1]

Leviathan, M

Y. Leviathan, M. Kalman, Y. Matias, Fast inference from transformers via speculative decoding, in: Proceedings of the International Conference on Machine Learning (ICML), 2023

2023
[2]

C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, J. Jumper, Accelerating large language model decoding with speculative sampling, arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review arXiv 2023
[3]

Aly, Beidi Chen, and Carole-Jean Wu

M. Elhoushi, A. Shrivastava, D. Liskovich, B. Hosmer, B. Wasti, L. Lai, A. Mahmoud, B. Acun, S. Agarwal, A. Roman, A. A. Aly, B. Chen, C.-J. Wu, LayerSkip: Enabling early exit inference and self-speculative decoding, arXiv preprint arXiv:2404.16710, 2024

work page arXiv 2024
[4]

L. Chen, R. Shan, H. Wang, L. Wang, Z. Liu, R. Luo, J. Wang, H. Alinejad-Rokny, M. Yang, CLaSp: In-context layer skip for self- speculative decoding, arXiv preprint arXiv:2505.24196, 2025

work page arXiv 2025
[5]

H. Xia, Y. Li, J. Zhang, C. Du, W. Li, SWIFT: On-the-fly self-speculative decodingforLLMinferenceacceleration, arXivpreprintarXiv:2410.06916, 2024

work page arXiv 2024
[6]

D. Choi, S. Oh, S. Dingliwal, J. Tack, K. Kim, W. Song, S. Kim, I. Han, J. Shin, A. Galstyan, S. Katiyar, S. B. Bodapati, Mamba drafters for speculative decoding, arXiv preprint arXiv:2506.01206, 2025

work page arXiv 2025
[7]

Y. Wu, Z. Qin, A. Wong, S. Soatto, STree: Speculative tree decoding for hybrid state-space models, arXiv preprint arXiv:2505.14969, 2025. 26

work page arXiv 2025
[8]

Rad: Redundancy-aware distillation for hybrid models via self-speculative decoding.arXiv preprint arXiv:2505.22135, 2025

Y. Hoshino, H. Tachibana, M. Inahara, H. Takegawa, RAD: Redundancy- aware distillation for hybrid models via self-speculative decoding, arXiv preprint arXiv:2505.22135, 2025

work page arXiv 2025
[9]

T. Dao, A. Gu, Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality, in: Proceedings of the International Conference on Machine Learning (ICML), 2024

2024
[10]

S. Yang, B. Wang, Y. Shen, R. Panda, Y. Kim, Gated linear attention transformers with hardware-efficient training, in: Proceedings of the International Conference on Machine Learning (ICML), 2024

2024
[11]

J. Zuo, M. Velikanov, I. Chahed, Y. Belkada, D. E. Rhayem, G. Kunsch, H. Hacid, et al., Falcon-H1: A family of hybrid-head language models redefining efficiency and performance, arXiv preprint arXiv:2507.22448, 2025

work page arXiv 2025
[12]

Qwen Team, Qwen3.5: Towards native multimodal agents,https:// qwen.ai/blog?id=qwen3.5, 2026

2026
[13]

Jamba: A Hybrid Transformer-Mamba Language Model

O. Lieber, B. Lenz, H. Bata, G. Cohen, J. Osin, I. Dalmedigos, E. Safahi, et al., Jamba: A hybrid Transformer-Mamba language model, arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review arXiv 2024
[14]

L. Ren, Y. Liu, Y. Lu, Y. Shen, C. Liang, P. Abbeel, Samba: Simple hy- brid state space models for efficient unlimited context language modeling, arXiv preprint arXiv:2406.07522, 2024

work page arXiv 2024
[15]

A systematic analysis of hybrid linear attention.arXiv preprint arXiv:2507.06457, 2025

D. Wang, R.-J. Zhu, S. Abreu, Y. Shan, T. Kergan, Y. Pan, Y. Chou, 27 et al., A systematic analysis of hybrid linear attention, arXiv preprint arXiv:2507.06457, 2025

work page arXiv 2025
[16]

R. Ren, Z. Li, Y. Liu, Exploring the limitations of Mamba in COPY and CoT reasoning, arXiv preprint arXiv:2410.03810, 2024

work page arXiv 2024
[17]

W. Amer, U. Das, F. Kurdahi, ConfLayers: Adaptive confidence- based layer skipping for self-speculative decoding, arXiv preprint arXiv:2604.14612, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

S. Cha, G. Kim, D. Han, T. Yang, I. Han, KnapSpec: Self-speculative decoding via adaptive layer selection as a knapsack problem, arXiv preprint arXiv:2602.20217, 2026

work page arXiv 2026
[19]

Borobia, E

H. Borobia, E. Seguí-Mas, G. Tormo-Carbó, Functional component ablation reveals specialization patterns in hybrid language model archi- tectures, arXiv preprint arXiv:2603.22473, 2026.https://arxiv.org/ abs/2603.22473

work page arXiv 2026
[20]

github.io/blog/qwen2.5/, 2024

Qwen Team, Qwen2.5: A party of foundation models,https://qwenlm. github.io/blog/qwen2.5/, 2024

2024
[21]

Merity, C

S. Merity, C. Xiong, J. Bradbury, R. Socher, Pointer sentinel mixture models, in: Proceedings of the International Conference on Learning Representations (ICLR), 2017

2017
[22]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, J. Steinhardt, Measuring massive multitask language understanding, in: Proceedings of the International Conference on Learning Representations (ICLR), 2021. 28

2021
[23]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, J. Schul- man, Training verifiers to solve math word problems, arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, I. Stoica, Efficient memory management for large language model serving with PagedAttention, in: Proceedings of the Symposium on Operating Systems Principles (SOSP), 2023. 29

2023