arxiv: 2604.22127 · v1 · submitted 2026-04-24 · 💻 cs.CL · cs.LG

Recognition: unknown

Where Should LoRA Go? Component-Type Placement in Hybrid Language Models

Hector Borobia , Elies Segu\'i-Mas , Guillermina Tormo-Carb\'o

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:13 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords hybridattentionlorahybridsparallelplacementsequentialadaptation

0 comments

The pith

Adapting only the attention components with LoRA outperforms full-model adaptation in hybrid LLMs, with recurrent adaptation harming sequential hybrids but helping parallel ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Hybrid language models mix two kinds of processing: attention layers that weigh different parts of the input, and recurrent layers that keep a running state as they read sequences. The study compares two hybrids, one where the parts run one after another and one where they run side by side. Researchers added low-rank adapters (LoRA) to different parts and measured results on math, language, and other tasks. They found that putting adapters only on the attention parts usually beats adapting the entire model while using far fewer extra parameters. In the sequential hybrid, adding adapters to the recurrent part hurt scores badly on math problems. In the parallel hybrid, the same choice helped. Sequential hybrids also lost earlier skills more easily after adaptation. The pattern shows that how the model is built changes what works for fine-tuning.

Core claim

We find that the attention pathway -- despite being the minority component -- consistently outperforms full-model adaptation with 5-10x fewer trainable parameters. Crucially, adapting the recurrent backbone is destructive in sequential hybrids (-14.8 pp on GSM8K) but constructive in parallel ones (+8.6 pp).

Load-bearing premise

That the performance differences arise primarily from hybrid topology (sequential versus parallel) rather than from model-specific details such as exact recurrent implementation, training data, or hyperparameter choices.

Figures

Figures reproduced from arXiv: 2604.22127 by Elies Segu\'i-Mas, Guillermina Tormo-Carb\'o, Hector Borobia.

**Figure 1.** Figure 1: Parameter distribution by component type. In both architectures, the attention view at source ↗

**Figure 2.** Figure 2: Benchmark × condition heatmaps (delta versus base accuracy). Warm colors indicate improvement, cool colors degradation. Falcon shows uniformly positive GSM8K gains across all conditions, while Qwen exhibits severe degradation for GDN-involved and softmax_plus_mlp conditions. performance (−14.8 pp below baseline), making it worse than the un-finetuned model. In the parallel Falcon-H1-0.5B, ssm_only improve… view at source ↗

**Figure 3.** Figure 3: Cross-task transfer after UltraChat training. Qwen shows pervasive forgetting view at source ↗

**Figure 4.** Figure 4: Trainable parameters versus mean benchmark accuracy (Pareto scatter). Each view at source ↗

**Figure 5.** Figure 5: Radar chart comparison of single-component conditions after GSM8K training. view at source ↗

read the original abstract

Hybrid language models that interleave attention with recurrent components are increasingly competitive with pure Transformers, yet standard LoRA practice applies adapters uniformly without considering the distinct functional roles of each component type. We systematically study component-type LoRA placement across two hybrid architectures -- Qwen3.5-0.8B (sequential, GatedDeltaNet + softmax attention) and Falcon-H1-0.5B (parallel, Mamba-2 SSM + attention) -- fine-tuned on three domains and evaluated on five benchmarks. We find that the attention pathway -- despite being the minority component -- consistently outperforms full-model adaptation with 5-10x fewer trainable parameters. Crucially, adapting the recurrent backbone is destructive in sequential hybrids (-14.8 pp on GSM8K) but constructive in parallel ones (+8.6 pp). We further document a transfer asymmetry: parallel hybrids exhibit positive cross-task transfer while sequential hybrids suffer catastrophic forgetting. These results establish that hybrid topology fundamentally determines adaptation response, and that component-aware LoRA placement is a necessary design dimension for hybrid architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an empirical study on LoRA adapter placement in hybrid language models that combine attention and recurrent components. Using two specific hybrids—one sequential (Qwen3.5-0.8B with GatedDeltaNet) and one parallel (Falcon-H1-0.5B with Mamba-2)—fine-tuned on three domains and tested on five benchmarks, the authors claim that attention-only adaptation is superior to full-model adaptation, that recurrent adaptation harms performance in sequential hybrids but helps in parallel ones, and that parallel hybrids show better cross-task transfer while sequential ones suffer forgetting. They conclude that component-type placement is a key design choice determined by hybrid topology.

Significance. Should the findings prove robust after addressing confounds, this would contribute to the field by demonstrating that standard uniform LoRA is not optimal for hybrids and that topology influences adaptation dynamics. It provides practical insights for parameter-efficient fine-tuning of competitive hybrid models, potentially leading to more efficient training practices. The use of multiple domains and benchmarks is a strength, though the small number of models limits broader applicability.

major comments (3)

[Abstract] The assertion that the results 'establish that hybrid topology fundamentally determines adaptation response' overstates the evidence. The study compares only two models that differ in multiple respects beyond topology: recurrent component type (GatedDeltaNet vs Mamba-2 SSM), model size (0.8B vs 0.5B), and pretraining details. Without additional experiments using the same recurrent module in both sequential and parallel configurations or matched scales, the causal role of topology cannot be isolated from these confounds.
[Results] The reported performance changes, such as -14.8 percentage points on GSM8K for recurrent adaptation in the sequential hybrid and +8.6 pp in the parallel hybrid, are presented without visible error bars, statistical tests, or full tables showing all benchmarks and conditions. This undermines confidence in the 'destructive' versus 'constructive' characterization.
[Experiments] The experimental setup does not appear to include controls for hyperparameter differences or training data variations between the two models, which could explain the observed differences in adaptation behavior rather than the sequential vs. parallel topology.

minor comments (2)

[Abstract] Specify the three domains and five benchmarks explicitly in the abstract for better reader orientation.
Ensure that all figures and tables include clear legends and that any abbreviations like LoRA are defined on first use in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each of the major comments in detail below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] The assertion that the results 'establish that hybrid topology fundamentally determines adaptation response' overstates the evidence. The study compares only two models that differ in multiple respects beyond topology: recurrent component type (GatedDeltaNet vs Mamba-2 SSM), model size (0.8B vs 0.5B), and pretraining details. Without additional experiments using the same recurrent module in both sequential and parallel configurations or matched scales, the causal role of topology cannot be isolated from these confounds.

Authors: We agree that our claim in the abstract overstates the causal isolation of topology given the differences in recurrent modules, model sizes, and pretraining between the two hybrids. While the observed patterns align with topology (sequential vs. parallel), we cannot rule out contributions from other factors. In the revised version, we will moderate the language in the abstract to 'indicate that hybrid topology influences adaptation dynamics' and expand the discussion section to explicitly acknowledge these confounds and call for future work with controlled experiments. This constitutes a partial revision as we are unable to conduct additional large-scale experiments at this time. revision: partial
Referee: [Results] The reported performance changes, such as -14.8 percentage points on GSM8K for recurrent adaptation in the sequential hybrid and +8.6 pp in the parallel hybrid, are presented without visible error bars, statistical tests, or full tables showing all benchmarks and conditions. This undermines confidence in the 'destructive' versus 'constructive' characterization.

Authors: The referee is correct that the main results lack error bars, statistical tests, and comprehensive tables. We will revise the results section to include error bars from multiple random seeds, perform and report paired statistical tests (e.g., t-tests) for the key performance differences, and move full benchmark tables to the appendix. These additions will provide stronger support for the destructive and constructive characterizations of recurrent adaptation. revision: yes
Referee: [Experiments] The experimental setup does not appear to include controls for hyperparameter differences or training data variations between the two models, which could explain the observed differences in adaptation behavior rather than the sequential vs. parallel topology.

Authors: We used the default or recommended hyperparameters for each base model and applied the same fine-tuning data domains and schedules where possible. However, we did not explicitly match all hyperparameters or isolate data variations as controls. We will add a dedicated subsection in the experimental setup detailing all hyperparameter choices and data preprocessing steps for transparency. Additionally, we will include a limitations paragraph noting that while patterns hold across domains, unaccounted variations may contribute. We maintain that the topology difference is the primary variable of interest, but acknowledge the need for this clarification. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical observations from fine-tuning experiments

full rationale

The paper reports direct experimental results from applying LoRA to specific components in two fixed hybrid models (Qwen3.5-0.8B and Falcon-H1-0.5B) across domains and benchmarks. Claims such as attention outperforming full adaptation or recurrent adaptation being destructive/constructive are stated as measured performance deltas (e.g., -14.8 pp, +8.6 pp) rather than any derivation, prediction, or first-principles result. No equations, ansatzes, uniqueness theorems, or self-citations appear as load-bearing steps in the provided text; the central findings rest on observable outcomes from the fine-tuning runs themselves and do not reduce to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study with no explicit mathematical derivations or new theoretical constructs. No free parameters, axioms, or invented entities are introduced in the provided abstract.

pith-pipeline@v0.9.0 · 5495 in / 1112 out tokens · 47057 ms · 2026-05-08T12:13:22.354394+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 9 canonical work pages · 6 internal anchors

[1]

E.J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low-Rank Adaptation of Large Language Models, in: ICLR, 2022. arXiv:2106.09685

work page internal anchor Pith review arXiv 2022
[2]

2024 , archivePrefix=

K. Galim, W. Kang, Y. Zeng, H.I. Koo, K. Lee, Parameter-Efficient Fine-Tuning of State Space Models, in: ICML, 2025. arXiv:2410.09016

work page arXiv 2025
[3]

S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

J. Young,S 0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent- Attention Models, arXiv preprint arXiv:2604.01168, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Borobia, E

H. Borobia, E. Seguí-Mas, G. Tormo-Carbó, Functional Component Ab- lation Reveals Specialization Patterns in Hybrid Language Model Archi- tectures, arXiv preprint arXiv:2603.22473, 2026

work page arXiv 2026
[5]

Zuo, et al., Falcon-H1: A Family of Hybrid-Head Language Models, Technical Report, Technology Innovation Institute, 2025

J. Zuo, et al., Falcon-H1: A Family of Hybrid-Head Language Models, Technical Report, Technology Innovation Institute, 2025

2025
[6]

Qwen Team, Qwen3 Technical Report, arXiv preprint, 2025

2025
[7]

S. Yang, B. Wang, Y. Shen, R. Panda, Y. Kim, Gated Linear Attention Transformers with Hardware-Efficient Training, in: ICML, 2024

2024
[8]

T. Dao, A. Gu, Transformers are SSMs: Generalized Models and Ef- ficient Algorithms through Structured State Space Duality, in: ICML, 2024

2024
[9]

S.Wang, etal., ASystematicAnalysisofHybridLinearAttention, arXiv preprint, 2025. 19

2025
[10]

A. Gu, T. Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces, arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review arXiv 2023
[11]

Lieber, et al., Jamba: A Hybrid Transformer-Mamba Language Model, arXiv preprint, 2024

O. Lieber, et al., Jamba: A Hybrid Transformer-Mamba Language Model, arXiv preprint, 2024

2024
[12]

Glorioso, et al., Zamba: A Compact 7B SSM Hybrid Model, arXiv preprint, 2024

P. Glorioso, et al., Zamba: A Compact 7B SSM Hybrid Model, arXiv preprint, 2024

2024
[13]

Zhang, M

Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng, W. Chen, T. Zhao, AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine- Tuning, in: ICLR, 2023

2023
[14]

Liu, C.-Y

S.-Y. Liu, C.-Y. Wang, H. Yin, P. Molchanov, Y.-C.F. Wang, K.-T. Cheng, M.-H. Chen, DoRA: Weight-Decomposed Low-Rank Adapta- tion, in: ICML, 2024

2024
[15]

MambaPEFT: Exploring parameter-efficient fine-tuning for mamba.arXiv preprint arXiv:2411.03855,

M. Yoshimura, T. Hayashi, Y. Maeda, MambaPEFT: Explor- ing Parameter-Efficient Fine-Tuning for Mamba, in: ICLR, 2025. arXiv:2411.03855

work page arXiv 2025
[16]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, J. Schul- man, Training Verifiers to Solve Math Word Problems, arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review arXiv 2021
[17]

Chaudhary, Code Alpaca: An Instruction-Following LLaMA Model for Code Generation, GitHub repository, 2023

S. Chaudhary, Code Alpaca: An Instruction-Following LLaMA Model for Code Generation, GitHub repository, 2023

2023
[18]

N. Ding, Y. Chen, B. Xu, Y. Qin, Z. Zheng, S. Hu, Z. Liu, M. Sun, B. Zhou, Enhancing Chat Language Models by Scaling High-quality Instructional Conversations, in: EMNLP, 2023

2023
[19]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, J. Steinhardt, Measuring Massive Multitask Language Understanding, in: ICLR, 2021

2021
[20]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, O. Tafjord, Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, arXiv preprint arXiv:1803.05457, 2018. 20

work page internal anchor Pith review arXiv 2018
[21]

Zellers, A

R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, HellaSwag: Can a Machine Really Finish Your Sentence?, in: ACL, 2019

2019
[22]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H.P. de Oliveira Pinto, J. Ka- plan, et al., Evaluating Large Language Models Trained on Code, arXiv preprint arXiv:2107.03374, 2021. 21

work page internal anchor Pith review arXiv 2021