Component Ablation for Efficient Hybrid Language Model Architectures: Performance, Resilience, and Compression Implications
read the original abstract
Hybrid language models combine softmax attention with linear-time sequence mechanisms such as state-space or linear-attention layers, but the functional contribution of each component type remains insufficiently characterized. We study component-level ablation in two sub-1B hybrid language models, Qwen3.5-0.8B and Falcon-H1-0.5B, using likelihood-based evaluation, downstream benchmarks, layer-wise interventions, random controls, and representation-level diagnostics. Across the tested models, removing either attention or the alternative sequence-processing pathway substantially degrades performance, indicating that both component types contribute to model behavior. Likelihood metrics are especially sensitive to the linear-attention or state-space pathway, while downstream benchmark degradation depends on task and architecture. Layer-wise ablations show that component importance is position-dependent, with the strongest effects concentrated in early or mid-network components rather than uniformly across depth. Random-removal controls further show that hybrid architectures and same-family Transformer baselines degrade differently under structural perturbation. These results suggest that component ablation is a useful diagnostic for understanding hybrid language model architectures. The findings provide evidence relevant to efficient model design, compression, robustness analysis, and deployment decisions in architectures that combine attention with alternative sequence-processing mechanisms.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
Component-Aware Self-Speculative Decoding in Hybrid Language Models
Component-aware self-speculative decoding achieves high acceptance rates in parallel hybrid models like Falcon-H1 but fails in sequential ones like Qwen3.5, with the gap tied to how components are integrated.
-
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design
Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.
-
Where Should LoRA Go? Component-Type Placement in Hybrid Language Models
Adapting only the attention components with LoRA outperforms full-model adaptation in hybrid LLMs, with recurrent adaptation harming sequential hybrids but helping parallel ones.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.