arxiv: 2604.01168 · v2 · submitted 2026-04-01 · 💻 cs.CL · cs.LG

Recognition: no theorem link

S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

Jack Young

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:16 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords S0 tuninghybrid recurrent-attention modelsparameter-efficient fine-tuninginitial state matrixzero-overhead adaptationHumanEvalMambaGatedDeltaNet

0 comments

The pith

Tuning one initial state matrix per recurrent layer adapts hybrid models to tasks with zero inference overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that hybrid language models mixing recurrent and attention layers can be adapted to new tasks simply by tuning the starting state of the recurrent components while leaving all weights unchanged. With only about 48 verified solutions from HumanEval, this S0 tuning approach raises performance above LoRA by 10.8 points on the same test set. Gains hold across two different hybrid architectures, transfer to math problems, and fail to appear on text-to-SQL or on pure attention models. The resulting state file is roughly 48 MB and lets the model switch tasks instantly without weight merging or reloading.

Core claim

S0 tuning optimizes a single initial state matrix per recurrent layer in hybrid recurrent-attention models while freezing every weight. On Qwen3.5-4B the method lifts greedy pass@1 on HumanEval by 23.6 points; on FalconH1-7B it reaches 71.8 percent, statistically indistinguishable from LoRA yet requiring no weight merging. Positive transfer occurs on MATH-500 and GSM8K, none on Spider, and a per-step state-offset variant yields still larger gains at per-step inference cost.

What carries the argument

The initial state matrix S0 for each recurrent layer, whose values are optimized to steer the model's computation trajectory for a target task without any weight modification.

Load-bearing premise

That adjusting only the starting recurrent state is enough to change the model's task behavior without needing to alter weights at all.

What would settle it

Running S0 tuning on a pure attention model and observing whether performance rises, stays flat, or falls compared with the degradation seen in the paper's control experiment.

Figures

Figures reproduced from arXiv: 2604.01168 by Jack Young.

**Figure 1.** Figure 1: Overview of S0 tuning. (a) Cross-architecture comparison on HumanEval: S0 (teal) outperforms LoRA on Qwen3.5-4B and is tied with it on FalconH1-7B (hatched bars). The Qwen comparison is significant (p < 0.001; ***); the Falcon 3-seed comparison is statistically indistinguishable from LoRA. Offset is shown for Qwen only (not applicable to FalconH1’s recurrence). Error bars show ±1 std across seeds. (b) Firs… view at source ↗

**Figure 2.** Figure 2: Computation graph for S0 tuning. The learned initial state S0 (teal) is injected into each recurrent layer before the first token. After t=1, it is absorbed into the running state and adds zero computational overhead. All model weights remain frozen. Zero-overhead property. Because S0 is injected only at t=0, it is absorbed into the running state at t=1; every subsequent step executes the unmodified recurr… view at source ↗

**Figure 3.** Figure 3: Scaling and architecture-specific tuning. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: First-character divergence in FAIL-to-PASS flips [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S0 tuning adapts hybrid models by optimizing initial recurrent states for zero-overhead gains on code tasks, but small seed counts undermine the strong statistical claims.

read the letter

S0 tuning adapts hybrid recurrent-attention models by optimizing one initial state matrix per recurrent layer while freezing all weights. This produces zero inference overhead and a reported +10.8 point edge over LoRA on HumanEval using execution-verified training data. On Qwen3.5-4B it reaches +23.6 points with 10 seeds; on FalconH1-7B the difference shrinks to a non-significant 0.4 points with only 3 seeds. Some transfer appears on MATH-500 and GSM8K but not on Spider, which aligns with a trajectory-steering story. A prefix-tuning control on a pure transformer hurts performance, helping isolate the effect to the recurrent component. The tuned state is a small file that swaps without weight merging or reloads, and code is released. These elements make the approach practical for hybrid architectures where standard PEFT adds overhead. The main weakness is the statistics. With n=3 or 10 and no reported details on the test, variance pooling, or multiple-comparison correction, the p<0.001 headline does not look robust. A single outlier run could flip the result, especially on the Falcon comparison that is already indistinguishable. The method also appears limited to hybrids; nothing suggests it extends to pure attention models. This work is for researchers focused on efficient adaptation of recurrent-attention hybrids who need low-cost task switching. Readers already experimenting with Mamba-style or GatedDeltaNet models will find the experiments directly useful. It deserves peer review because the core mechanism is simple to test and the empirical pattern is clear enough to check, even if the stats and full methods need tightening before publication.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes S0 tuning, a zero-overhead PEFT method for hybrid recurrent-attention models that optimizes one initial state matrix per recurrent layer while freezing all weights. Using roughly 48 execution-verified HumanEval solutions, it claims to outperform LoRA by +10.8 pp (p < 0.001) on HumanEval pass@1. Specific results include +23.6 +/- 1.7 pp on Qwen3.5-4B (10 seeds) and 71.8% vs 71.4% on FalconH1-7B (3 seeds, statistically indistinguishable). Positive transfer is reported to MATH-500 (+4.8 pp, p=0.00002, 8 seeds) and GSM8K (+2.8 pp, p=0.0003, 10 seeds) but not Spider; a prefix-tuning control degrades performance on pure Transformers by -13.9 pp. A per-step state-offset variant reaches +27.1 pp at added inference cost.

Significance. If the empirical results hold under rigorous statistical scrutiny, S0 tuning would constitute a meaningful contribution to efficient adaptation of hybrid models by identifying the recurrent initial state as a strong, zero-overhead PEFT surface. The work is strengthened by its use of execution-verified supervision, multiple model architectures, cross-domain transfer tests, a pure-attention control, and public code release. These elements support the claim that state initialization can steer task behavior without weight modification when verified data is scarce.

major comments (3)

[Abstract] Abstract and results: The central claim of reliable +10.8 pp outperformance over LoRA (p < 0.001) rests on small seed counts (10 for Qwen3.5-4B, only 3 for FalconH1-7B). With n=3 the reported 71.8% vs 71.4% scores are indistinguishable within reported variance, and no details are provided on the exact test (paired t-test, bootstrap, etc.), variance pooling, or multiple-comparison correction. This directly threatens the headline statistical superiority assertion.
[Results] Results section: Cross-domain transfer is asserted on MATH-500 and GSM8K with low p-values, yet the manuscript does not report the corresponding LoRA baselines on these tasks, preventing direct assessment of whether S0 tuning's advantage generalizes beyond HumanEval.
[Experimental Setup] Experimental setup: The optimization details for the initial state matrix (learning rate schedule, epochs, exact loss beyond execution verification, initialization) are not specified, which is load-bearing for reproducing the claimed zero-overhead adaptation and for understanding why the mechanism succeeds on hybrids but fails as prefix-tuning on pure Transformers.

minor comments (2)

[Abstract] The abstract states 'roughly 48' solutions but does not clarify whether this exact set is reused across all benchmarks or only HumanEval.
The per-step state-offset variant (+27.1 pp) is compared to S0 and LoRA without a consolidated table that also quantifies the per-step inference overhead.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and results: The central claim of reliable +10.8 pp outperformance over LoRA (p < 0.001) rests on small seed counts (10 for Qwen3.5-4B, only 3 for FalconH1-7B). With n=3 the reported 71.8% vs 71.4% scores are indistinguishable within reported variance, and no details are provided on the exact test (paired t-test, bootstrap, etc.), variance pooling, or multiple-comparison correction. This directly threatens the headline statistical superiority assertion.

Authors: We agree that the small seed count for FalconH1-7B (n=3) renders the 71.8% vs 71.4% difference statistically indistinguishable, and that the headline +10.8 pp claim is driven by the Qwen3.5-4B results (n=10). The p < 0.001 value in the abstract aggregates across models but we will revise the abstract to report model-specific results and p-values. We will also add a methods subsection specifying that comparisons used Welch's t-test on per-seed pass@1 scores with no multiple-comparison correction (as the primary contrast was pre-specified). These clarifications will be included in the revision. revision: partial
Referee: [Results] Results section: Cross-domain transfer is asserted on MATH-500 and GSM8K with low p-values, yet the manuscript does not report the corresponding LoRA baselines on these tasks, preventing direct assessment of whether S0 tuning's advantage generalizes beyond HumanEval.

Authors: We concur that LoRA baselines on MATH-500 and GSM8K are required for a fair assessment of generalization. In the revised manuscript we will add these experiments, training LoRA adapters on the same 48 execution-verified examples and evaluating on the cross-domain tasks under identical conditions. This will enable direct comparison of transfer performance between S0 tuning and LoRA. revision: yes
Referee: [Experimental Setup] Experimental setup: The optimization details for the initial state matrix (learning rate schedule, epochs, exact loss beyond execution verification, initialization) are not specified, which is load-bearing for reproducing the claimed zero-overhead adaptation and for understanding why the mechanism succeeds on hybrids but fails as prefix-tuning on pure Transformers.

Authors: We apologize for the missing details. The initial state matrix per recurrent layer is initialized to all zeros and optimized with Adam (learning rate 5e-4, no schedule) for 10 epochs using standard next-token cross-entropy loss on the 48 execution-verified solutions. We will insert a new 'Optimization Details' paragraph in the Experimental Setup section of the revision to fully document initialization, optimizer, learning rate, epochs, and loss, enabling exact reproduction. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical method with no derivation chain

full rationale

The paper describes an empirical adaptation technique (S0 tuning) that optimizes a single initial state matrix per recurrent layer on ~48 verified HumanEval solutions while freezing weights. All reported results consist of direct benchmark measurements (HumanEval pass@1, MATH-500, GSM8K, Spider) with means, standard deviations, seed counts, and p-values. No equations, mathematical derivations, ansatzes, or self-citations appear in the provided text that could reduce any claim to a fitted input or prior result by construction. The central outperformance claim rests on statistical comparisons of experimental runs, not on any self-referential logic or renamed known result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical optimization of state matrices in recurrent layers of hybrid architectures, with no new entities postulated and standard ML assumptions about convergence and task steering.

free parameters (1)

initial state matrix per recurrent layer
The matrix is optimized during tuning on task data and serves as the adaptation parameter.

axioms (1)

domain assumption Recurrent state initialization can effectively steer model behavior for downstream tasks in hybrid architectures
Core premise underlying why S0 tuning works while prefix tuning fails on pure Transformers.

pith-pipeline@v0.9.0 · 5635 in / 1286 out tokens · 70740 ms · 2026-05-13T22:16:49.071447+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Where Should LoRA Go? Component-Type Placement in Hybrid Language Models
cs.CL 2026-04 unverdicted novelty 6.0

Adapting only the attention components with LoRA outperforms full-model adaptation in hybrid LLMs, with recurrent adaptation harming sequential hybrids but helping parallel ones.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Evaluating Large Language Models Trained on Code

Mark Chen et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Training Verifiers to Solve Math Word Problems

10 Karl Cobbe, Vinod Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Falcon-H1: A family of hybrid-head language models redefining efficiency and performance.arXiv preprint arXiv:2507.22448,

Falcon LLM Team. Falcon-H1: A family of hybrid-head language models redefining efficiency and performance.arXiv preprint arXiv:2507.22448,

work page arXiv
[5]

2024 , archivePrefix=

Tristan Galim, Adrien Bénédict, Amir Moawad, Romain Franceschini, and Edouard Mathieu. Parameter-efficient fine-tuning of state space models.arXiv preprint arXiv:2410.09016,

work page arXiv
[6]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Parameter efficient mamba tuning via projector-targeted diagonal-centric linear transformation.arXiv preprint arXiv:2411.15224,

Seokil Ham, Hee-Seon Kim, Sangmin Woo, and Changick Kim. Parameter efficient mamba tuning via projector-targeted diagonal-centric linear transformation.arXiv preprint arXiv:2411.15224,

work page arXiv
[8]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. MATH: Measuring mathematical problem solving with the MATH dataset. arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

State- offset tuning: State-based parameter-efficient fine-tuning for state space models.arXiv preprint arXiv:2503.03499,

Wonjun Kang, Kevin Galim, Yuchen Zeng, Minjae Lee, Hyung Il Koo, and Nam Ik Cho. State- offset tuning: State-based parameter-efficient fine-tuning for state space models.arXiv preprint arXiv:2503.03499,

work page arXiv
[11]

Li, Berlin Chen, Caitlin Wang, Aviv Bick, J

Aakash Lahoti, Kevin Y . Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles.arXiv preprint arXiv:2603.15569,

work page arXiv
[12]

Memba: Membrane- driven parameter-efficient fine-tuning for mamba.arXiv preprint arXiv:2506.18184,

Donghyun Lee, Yuhang Li, Ruokai Yin, Shiting Xiao, and Priyadarshini Panda. Memba: Membrane- driven parameter-efficient fine-tuning for mamba.arXiv preprint arXiv:2506.18184,

work page arXiv
[13]

Lina-Speech: Gated linear attention is a fast and parameter-efficient learner for text-to-speech synthesis.arXiv preprint arXiv:2410.23320,

Théodore Lemerle, Adel Music, Nicolas Music, and Thomas Music. Lina-Speech: Gated linear attention is a fast and parameter-efficient learner for text-to-speech synthesis.arXiv preprint arXiv:2410.23320,

work page arXiv
[14]

State tuning: Tuning the recurrent state for efficient adaptation of large language models.arXiv preprint arXiv:2504.05097,

Xiao Liu et al. State tuning: Tuning the recurrent state for efficient adaptation of large language models.arXiv preprint arXiv:2504.05097,

work page arXiv
[15]

State soup: In-context skill learning, retrieval and mixing.arXiv preprint arXiv:2406.08423,

Maciej Pióro et al. State soup: In-context skill learning, retrieval and mixing.arXiv preprint arXiv:2406.08423,

work page arXiv
[16]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025a. Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025b. Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, and Jason Eshraghian. A systematic analysis of hybrid linear attention.arXiv preprin...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

MambaPEFT: Exploring parameter-efficient fine-tuning for mamba.arXiv preprint arXiv:2411.03855,

11 Masakazu Yoshimura, Teruaki Hayashi, and Yota Maeda. MambaPEFT: Exploring parameter-efficient fine-tuning for mamba.arXiv preprint arXiv:2411.03855,

work page arXiv
[19]

SSMLoRA: Enhancing low-rank adaptation with state space model.arXiv preprint arXiv:2502.04958,

Jiayang Yu, Yihang Zhang, Bin Wang, Peiqin Lin, Yongkang Liu, and Shi Feng. SSMLoRA: Enhancing low-rank adaptation with state space model.arXiv preprint arXiv:2502.04958,

work page arXiv
[20]

Adaptable symbolic music infilling with MIDI-RWKV

Christian Zhou-Zheng and Philippe Pasquier. Adaptable symbolic music infilling with MIDI-RWKV. arXiv preprint arXiv:2506.13001,

work page arXiv
[21]

do no harm

A Hyperparameters Table 4 lists all hyperparameters for the S0 and LoRA experiments reported in the main text. Table 4:Hyperparameters for the Qwen3.5 and FalconH1 experiments. Hyperparameter S 0 LoRA Learning rate1×10 −3 5×10 −4 (Qwen) /1×10 −4 (Falcon) Optimizer Adam Adam Training steps 20 50 Batch size 1 1 L2 regularization5×10 −4 — Alpha scaling 0.07 ...

work page 2024