SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning
Pith reviewed 2026-06-26 01:45 UTC · model grok-4.3
The pith
Hankel-reduced SSM adapters in MLP blocks outperform LoRA on long-context tasks with matching parameter count.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An SSM adapter initialized by balanced truncation of empirical Hankel Grammians and injected at MLP sites supplies a parameter-efficient residual that matches LoRA's compute cost through FFT scanning while delivering higher task performance on long-context sequence modeling benchmarks.
What carries the argument
The HRM adapter: an SSM residual module whose matrices are obtained by balanced truncation of empirical Hankel Grammians, allowing exact FFT-based parallel scan via the preserved time-invariance of the system matrix.
If this is right
- HRM shows consistent gains across 18 synthetic configurations of DFA and parity tracking plus enwik8 character modeling.
- Gate analysis indicates the adapter learns to modulate its own recurrence, supplying an architectural alternative to low-rank updates.
- Placing the adapter in MLP blocks rather than attention projectors is required for the observed superiority on state-accumulation tasks.
- Computational cost remains identical to LoRA at every context length because the scan is realized exactly by FFT.
Where Pith is reading between the lines
- The same Hankel truncation step could be applied to initialize SSM adapters inside other base architectures without retraining the reduction.
- Task suitability may be predictable from whether the target problem rewards explicit state accumulation rather than attention mixing.
- If the reduced-order model is kept fixed across tasks, the approach could lower the engineering cost of adapting new long-context models.
Load-bearing premise
Balanced truncation of empirical Hankel Grammians yields an initialization for the SSM adapter that transfers usefully to downstream fine-tuning without any task-specific re-derivation of the reduced-order model.
What would settle it
Run the same iso-parametric comparison on Mistral-7B but replace the LongBench suite with a new long-context task whose state accumulation demands differ sharply from QuALITY or QMSum; if HRM then falls below or equal to the LoRA baseline, the claim that the Hankel initialization supplies a generally suitable adapter is falsified.
Figures
read the original abstract
While parameter-efficient fine-tuning (PEFT) typically targets attention projectors, its efficacy for tasks requiring sequential state accumulation remains under-explored. We examine if PEFT for such tasks can benefit from state space model (SSMs) adapters, and if MLP blocks are better injection sites. We introduce Hankel Reduced order Model (HRM) adapter, an SSM-based residual module initialized via Balanced Truncation of empirical Hankel Grammians. By leveraging the time-invariance of the system matrix $\bar{A}$, HRM enables an exact FFT-based parallel scan, achieving computational parity with LoRA across all context lengths. In iso-parametric evaluations on Mistral-7B (8.4M trainable parameters), HRM outperforms LoRA variants on LongBench tasks, including QuALITY (+34.8\% relative accuracy) and QMSum (+71.6\% relative ROUGE-1). HRM further demonstrates consistent superiority across 18 configurations of synthetic state-tracking (DFA, Parity) and character-level language modeling (enwik8). Gate analysis reveals that HRM adapters effectively learn to modulate recurrence, providing a robust architectural alternative to low-rank adaptation for long-context sequence modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Hankel Reduced-order Model (HRM) adapters as an SSM-based PEFT method for long-context fine-tuning. HRM is initialized via balanced truncation of empirical Hankel Grammians, injected into MLP blocks of models like Mistral-7B, and leverages time-invariant system matrices for exact FFT-based parallel scans. In iso-parametric comparisons (8.4M trainable params), it reports outperforming LoRA variants on LongBench (e.g., +34.8% relative accuracy on QuALITY, +71.6% ROUGE-1 on QMSum) and across 18 synthetic state-tracking and language modeling configurations, with gate analysis showing learned modulation of recurrence.
Significance. If the results hold under rigorous verification, the work provides evidence that SSM adapters can outperform standard low-rank methods for tasks involving sequential state accumulation, with injection site mattering and the reduced-order initialization enabling efficient inference. The computational parity with LoRA via FFT scans is a practical strength for long contexts.
major comments (3)
- [Abstract, §4] Abstract and §4 (empirical evaluation): the central claim of consistent outperformance (e.g., +34.8% on QuALITY) is reported without error bars, statistical tests, number of runs, or full baseline hyperparameter details; this makes it impossible to determine whether the gains are robust or could arise from post-hoc selection of injection site or truncation order.
- [§3.2] §3.2 (HRM initialization): the balanced truncation procedure relies on empirical Hankel Grammians, but the manuscript does not specify or ablate the input sequences used for their estimation (random vs. task-specific data); without evidence that performance is insensitive to this choice, the transferability claim without task-specific re-derivation remains unverified and load-bearing for the method's generality.
- [§4.3] §4.3 (ablation studies): no ablation is presented on the reduced model order (the sole free parameter listed in the axiom ledger), which directly controls the initialization quality and computational cost; this omission leaves open whether the reported superiority holds across reasonable orders or is tuned to the evaluated tasks.
minor comments (2)
- [§3] Notation for the reduced system matrix $ar{A}$ is introduced without an explicit equation linking it to the original SSM parameters; a clarifying equation would improve readability.
- [Tables 1-3, Figures 2-4] Table captions and figure legends should explicitly state the number of random seeds and whether results are averaged; this is standard for empirical PEFT papers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting issues of statistical robustness, initialization transparency, and ablation completeness. We will revise the manuscript to incorporate error bars, statistical tests, full hyperparameter details, clarification on Hankel input sequences with supporting ablation, and an ablation on reduced model order.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (empirical evaluation): the central claim of consistent outperformance (e.g., +34.8% on QuALITY) is reported without error bars, statistical tests, number of runs, or full baseline hyperparameter details; this makes it impossible to determine whether the gains are robust or could arise from post-hoc selection of injection site or truncation order.
Authors: We agree the lack of error bars and statistical tests weakens claims of robustness. In revision we will rerun all LongBench and synthetic experiments with 5 random seeds, report means ± std, add paired statistical tests (e.g., Wilcoxon), and document the full hyperparameter grids searched for LoRA, DoRA, and other baselines in the appendix. This will also document the protocol used to select injection site and truncation order, mitigating post-hoc selection concerns. revision: yes
-
Referee: [§3.2] §3.2 (HRM initialization): the balanced truncation procedure relies on empirical Hankel Grammians, but the manuscript does not specify or ablate the input sequences used for their estimation (random vs. task-specific data); without evidence that performance is insensitive to this choice, the transferability claim without task-specific re-derivation remains unverified and load-bearing for the method's generality.
Authors: Hankel Grammians were estimated from random Gaussian sequences of length 1024; we will state this explicitly in §3.2. We will also add a targeted ablation comparing random inputs against task-specific sequences drawn from LongBench and synthetic data, showing that random inputs produce comparable downstream performance. This supports the transferability claim while addressing the referee's concern. revision: partial
-
Referee: [§4.3] §4.3 (ablation studies): no ablation is presented on the reduced model order (the sole free parameter listed in the axiom ledger), which directly controls the initialization quality and computational cost; this omission leaves open whether the reported superiority holds across reasonable orders or is tuned to the evaluated tasks.
Authors: We agree an ablation on reduced order is necessary. The order was fixed at 16 to enforce iso-parametric comparison (8.4 M parameters). In revision we will add results for orders 8, 16, 24, and 32 on QuALITY, QMSum, DFA, Parity, and enwik8, together with corresponding inference-time measurements, demonstrating that superiority over LoRA holds across this range. revision: yes
Circularity Check
No circularity: empirical claims rest on benchmark results, not self-referential derivations
full rationale
The paper introduces an SSM adapter initialized by balanced truncation of empirical Hankel Grammians and reports iso-parametric gains versus LoRA on LongBench and synthetic tasks. No equations, predictions, or uniqueness theorems are presented that reduce by construction to fitted inputs or prior self-citations. All load-bearing statements are experimental comparisons (e.g., +34.8% on QuALITY), which are externally falsifiable and do not form a closed loop with the initialization procedure itself.
Axiom & Free-Parameter Ledger
free parameters (1)
- reduced model order
axioms (1)
- domain assumption Balanced truncation of empirical Hankel Gramian yields a faithful low-order approximation suitable for neural adapter initialization
Reference graph
Works this paper leans on
-
[1]
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
De, S., Smith, S. L., Fernando, A., Botev, A., Cristian- Muraru, G., Gu, A., Haroun, R., Berrada, L., Chen, Y ., Srinivasan, S., et al. Griffin: Mixing gated linear recur- rences with local attention for efficient language models. arXiv preprint arXiv:2402.19427,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Transformer feed-forward layers are key-value memories
Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. InProceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495,
2021
-
[3]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Efficiently Modeling Long Sequences with Structured State Spaces
Gu, A., Goel, K., and R ´e, C. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Flora: Low-rank adapters are secretly gradient compressors.arXiv preprint arXiv:2402.03293,
Hao, Y ., Cao, Y ., and Mou, L. Flora: Low-rank adapters are secretly gradient compressors.arXiv preprint arXiv:2402.03293,
-
[6]
Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset
Hawthorne, C., Stasyuk, A., Roberts, A., Simon, I., Huang, C.-Z. A., Dieleman, S., Elsen, E., Engel, J., and Eck, D. Enabling factorized piano music modeling and generation with the maestro dataset.arXiv preprint arXiv:1810.12247,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Lora+: Efficient low rank adaptation of large models.arXiv preprint arXiv:2402.12354,
Hayou, S., Ghosh, N., and Yu, B. Lora+: Efficient low rank adaptation of large models.arXiv preprint arXiv:2402.12354,
-
[9]
URL https://arxiv.org/abs/2310.06825. Koˇcisk`y, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, K. M., Melis, G., and Grefenstette, E. The narrativeqa reading comprehension challenge.Transactions of the Association for Computational Linguistics, 6:317–328,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
The power of scale for parameter-efficient prompt tuning
Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 3045–3059,
2021
-
[11]
Jamba: A Hybrid Transformer-Mamba Language Model
Lieber, O., Lenz, B., Bata, H., Cohen, G., Osin, J., Dalmedi- gos, I., Safahi, E., Meirom, S., Belinkov, Y ., Shalev- Shwartz, S., et al. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., and Raffel, C. A. Few-shot parameter-efficient fine- tuning is better and cheaper than in-context learning.Ad- vances in Neural Information Processing Systems, 35: 1950–1965,
1950
-
[13]
Y ., Parrish, A., Joshi, N., Nangia, N., Phang, J., Chen, A., Padmakumar, V ., Ma, J., Thompson, J., He, H., et al
Pang, R. Y ., Parrish, A., Joshi, N., Nangia, N., Phang, J., Chen, A., Padmakumar, V ., Ma, J., Thompson, J., He, H., et al. Quality: Question answering with long input texts, yes! InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5336– 5358,
2022
-
[14]
Can mamba learn how to learn? a comparative study on in-context learning tasks
Park, J., Park, J., Xiong, Z., Lee, N., Cho, J., Oymak, S., Lee, K., and Papailiopoulos, D. Can mamba learn how to learn? a comparative study on in-context learning tasks. arXiv preprint arXiv:2402.04248,
-
[15]
Adapterhub: A framework for adapting transformers
Pfeiffer, J., R ¨uckl´e, A., Poth, C., Kamath, A., Vuli ´c, I., Ruder, S., Cho, K., and Gurevych, I. Adapterhub: A framework for adapting transformers. InProceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 46–54,
2020
-
[16]
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
Zhang, M., Chen, H., Shen, C., Yang, Z., Ou, L., Yu, X., and Zhuang, B. Loraprune: Pruning meets low-rank parameter-efficient fine-tuning. 2023a. Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N., He, P., Cheng, Y ., Chen, W., and Zhao, T. Adalora: Adaptive budget allocation for parameter-efficient fine- tuning.arXiv preprint arXiv:2303.10512, 2023b. ...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Iso-parameter Comparison In order to compare HRM and LoRA on an equal footing, we ensure that r and ˆd are chosen such that |PLoRA −P HRM | ≤ 0.1%
10 SSM Adapters via Hankel Reduced-order Modeling A. Iso-parameter Comparison In order to compare HRM and LoRA on an equal footing, we ensure that r and ˆd are chosen such that |PLoRA −P HRM | ≤ 0.1%. Such an iso-parametric table to choose r and ˆd is shown below. All experiments in the paper use all three tiers to demonstrate consistency, and conclusions...
2048
-
[18]
The HRM adapter’s dominant state mode has a learned eigenvalue¯amax (¯amax ≈0.97–0.99 after training)
The region between curves represents the BPC advantage of HRM over LoRA. The HRM adapter’s dominant state mode has a learned eigenvalue¯amax (¯amax ≈0.97–0.99 after training). The fraction of signal retained from a token k steps ago is ¯ak max. At T=512, the adapter retains ¯a256 max ≈0.97 256 ≈0.0006 of signal from the midpoint of the context window. Thi...
2048
-
[19]
This contrast validates the memory hypothesis: HRM helps when multi-dimensional state is required, not when the task can be solved by single-bit counting
show that DFA exhibits a large, advantage with T while parity shows essentially zero HRM benefit. This contrast validates the memory hypothesis: HRM helps when multi-dimensional state is required, not when the task can be solved by single-bit counting. F. LongBench Tasks Table 4.Comparison of HRM against baselines on LongBench: QuALITY , QMSum, NarrativeQ...
2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.