Recognition: no theorem link
Olmo Hybrid: From Theory to Practice and Back
Pith reviewed 2026-05-13 19:35 UTC · model grok-4.3
The pith
Hybrid models mixing attention and recurrence express tasks like code execution beyond transformers or linear RNNs alone and scale more efficiently at 7B size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. A 7B hybrid model with Gated DeltaNet layers replacing sliding-window attention outperforms a comparable pure transformer on pretraining and mid-training evaluations and scales more efficiently, because the greater expressivity directly improves optimization during pretraining.
What carries the argument
Hybrid architecture that interleaves attention layers with Gated DeltaNet recurrent layers, enabling formal tasks like code execution and driving better scaling efficiency.
If this is right
- Hybrid models can perform code execution and other formal tasks impossible for pure transformers or linear RNNs.
- The 7B Olmo Hybrid achieves lower pretraining loss and stronger downstream results than the matched Olmo 3 transformer.
- Hybrid models exhibit significantly better scaling efficiency than transformers during pretraining.
- Increased expressivity from the hybrid design translates into more efficient scaling on general language-modeling tasks.
Where Pith is reading between the lines
- Future architecture search could prioritize expressivity analysis for specific computational patterns rather than uniform scaling.
- If the expressivity-to-scaling link generalizes, targeted recurrent insertions could lower the compute required for reasoning-intensive language tasks.
- The theory-to-practice loop shown here offers a template for using formal expressivity results to guide large-scale training experiments.
Load-bearing premise
The measured gains in pretraining loss and downstream metrics come from the hybrid architecture's added expressivity rather than from differences in optimization dynamics or layer-replacement details.
What would settle it
A pure transformer trained with identical hyperparameters, optimizer settings, and layer counts but without the recurrent components matching or exceeding the hybrid on the same pretraining and downstream benchmarks would falsify the claim.
read the original abstract
Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that hybrid architectures mixing attention and linear RNN layers (via Gated DeltaNet) possess strictly greater expressivity than either pure transformers or linear RNNs alone, as shown by their ability to solve tasks such as code execution. It then trains a 7B-parameter Olmo Hybrid model by replacing sliding-window layers in Olmo 3 with Gated DeltaNet layers, reports superior pretraining loss scaling and downstream metrics, and supplies a post-hoc theoretical argument that the added expressivity explains the observed efficiency gains.
Significance. If the causal link between the demonstrated task-specific expressivity and the measured scaling improvements holds, the result would be significant: it would supply concrete evidence that hybrid models are not merely inference-efficient but can be fundamentally more expressive and scale better during pretraining, with direct implications for architecture search at the 7B+ scale.
major comments (2)
- [theoretical argument section (post-empirical)] The section returning to theory after the empirical results: the argument that expressivity on formal problems (e.g., code execution) should produce better general scaling laws and downstream performance on unrelated tasks is presented without a quantitative derivation such as a capacity bound, sample-complexity result, or explicit scaling-law exponent relating the hybrid state update to reduced loss. This leaves the central causal claim open to alternative explanations.
- [empirical results and scaling analysis] Empirical comparison section: the claim of a 'controlled, large-scale setting' is weakened by the absence of ablations that isolate Gated DeltaNet from incidental changes in optimization dynamics or layer-replacement details; without these, it is unclear whether the reported pretraining and downstream gains are attributable to expressivity rather than other factors.
minor comments (2)
- [methods] Clarify the precise definition and gating mechanism of Gated DeltaNet early in the methods section to avoid ambiguity when comparing to prior linear RNN variants.
- [training details] Add explicit statements of the hyperparameter matching protocol between Olmo Hybrid and Olmo 3 to strengthen the controlled-comparison claim.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. We address each major comment below and have incorporated revisions to strengthen the clarity and rigor of the theoretical and empirical sections.
read point-by-point responses
-
Referee: The section returning to theory after the empirical results: the argument that expressivity on formal problems (e.g., code execution) should produce better general scaling laws and downstream performance on unrelated tasks is presented without a quantitative derivation such as a capacity bound, sample-complexity result, or explicit scaling-law exponent relating the hybrid state update to reduced loss. This leaves the central causal claim open to alternative explanations.
Authors: We agree that the post-empirical theoretical argument remains conceptual and does not supply a quantitative derivation such as a capacity bound or explicit scaling-law exponent. Our claim is that the demonstrated strict increase in expressivity (ability to solve tasks outside the reach of either pure transformer or linear RNN) implies the hybrid can achieve strictly lower loss for a given capacity on a broader function class; this in turn produces the observed improvement in scaling efficiency. We have revised the section to state the assumptions more explicitly, acknowledge alternative explanations (e.g., optimization dynamics), and clarify that a full quantitative link would require additional distributional assumptions beyond the paper's scope. revision: partial
-
Referee: Empirical comparison section: the claim of a 'controlled, large-scale setting' is weakened by the absence of ablations that isolate Gated DeltaNet from incidental changes in optimization dynamics or layer-replacement details; without these, it is unclear whether the reported pretraining and downstream gains are attributable to expressivity rather than other factors.
Authors: We acknowledge that the absence of exhaustive ablations at 7B scale limits the strength of the causal attribution. We maintained identical data, optimizer, learning-rate schedule, and total compute, with the only change being the substitution of sliding-window layers by Gated DeltaNet. We have added a new subsection that discusses potential confounding factors and reports supporting smaller-scale (1B) controlled ablations that isolate the layer replacement while holding all other variables fixed; these show consistent gains attributable to the hybrid state update. revision: partial
Circularity Check
Theory-practice-theory loop leaves causal link from expressivity to scaling unproven
specific steps
-
other
[Abstract]
"However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop."
Empirical scaling gains are first observed, then a theoretical argument is supplied to explain them; the argument is presented as closing the loop rather than as an a-priori derivation that independently predicts the observed exponent improvement. This renders the central causal attribution dependent on the data it purports to explain.
full rationale
The paper establishes hybrid expressivity beyond transformers/RNNs on formal tasks (e.g., code execution), trains a 7B model replacing layers with Gated DeltaNet, observes superior pretraining scaling and downstream metrics, then returns to theory to argue that the added expressivity explains the scaling gains. No quantitative bridge (capacity bound, sample-complexity result, or scaling-law derivation) is supplied showing how the hybrid update reduces the loss exponent. The explanatory step therefore depends on the empirical outcome it is invoked to justify, producing moderate circularity in the justification loop even though the initial expressivity claim itself is not self-referential.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Hybrid models mixing attention and linear recurrence can express tasks (e.g., code execution) that neither pure transformers nor pure linear RNNs can express
invented entities (1)
-
Gated DeltaNet
no independent evidence
Forward citations
Cited by 3 Pith papers
-
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
-
Sonata: A Hybrid World Model for Inertial Kinematics under Clinical Data Scarcity
Sonata is a small hybrid world model pre-trained to predict future IMU states that outperforms autoregressive baselines on clinical discrimination, fall-risk prediction, and cross-cohort transfer while fitting on-devi...
-
Reasoning Primitives in Hybrid and Non-Hybrid LLMs
Reasoning augmentation extends the difficulty range for both architectures, but hybrid models stay robust longer than transformers as sequential dependence increases in state-based recall tasks.
Reference graph
Works this paper leans on
-
[1]
URLhttps://proceedings.mlr.press/v235/akyurek24a.html. S. Arora and A. Goyal. A theory for emergence of complex skills in language models, 2023. URLhttps://arxiv.org/ abs/2307.15936. S. Arora, S. Eyuboglu, A. Timalsina, I. Johnson, M. Poli, J. Zou, A. Rudra, and C. Re. Zoology: Measuring and improving recall in efficient language models. InICLR, 2024a. UR...
-
[2]
URLhttps://openreview.net/forum?id=eG5oh8l1WZ. A. Mallen, A. Asai, V. Zhong, R. Das, H. Hajishirzi, and D. Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories, 2022. URLhttps://arxiv.org/abs/ 2212.10511. A. Matton, T. Sherborne, D. Aumiller, E. Tommasone, M. Alizadeh, J. He, R....
-
[3]
URLhttps://aclanthology.org/2023.tacl-1.31/
doi: 10.1162/tacl_a_00562. URLhttps://aclanthology.org/2023.tacl-1.31/. W. Merrill and A. Sabharwal. The expressive power of transformers with chain of thought. InICLR, 2024. URL https://openreview.net/forum?id=NjNGlPh8Wh. W. Merrill and A. Sabharwal. Exact expressive power of transformers with padding. InNeurIPS, 2025. URL https://openreview.net/forum?id...
-
[4]
URLhttps://proceedings.mlr.press/v162/rajbhandari22a.html. P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, 2016. URLhttps://aclanthology.org/D16-1264. S. Reddy, D. Chen, and C. D. Manning. CoQA: A conversational question answering challenge.TACL, 7:249–266,
work page 2016
-
[5]
URLhttps://aclanthology.org/Q19-1016. M. Reid, V. Zhong, S. Gururangan, and L. Zettlemoyer. M2D2: A massively multi-domain language modeling dataset. InEMNLP, 2022. URLhttps://aclanthology.org/2022.emnlp-main.63. D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman. GPQA: A graduate- level google-proof Q&A ben...
-
[6]
URLhttps://proceedings.mlr.press/v139/schlag21a.html. R. Shao, A. Asai, S. Z. Shen, H. Ivison, V. Kishore, J. Zhuo, X. Zhao, M. Park, S. G. Finlayson, D. Sontag, T. Murray, S. Min, P. Dasigi, L. Soldaini, F. Brahman, W.-t. Yih, T. Wu, L. Zettlemoyer, Y. Kim, H. Hajishirzi, and P. W. Koh. DR Tulu: Reinforcement learning with evolving rubrics for deep resea...
-
[7]
URLhttps://arxiv.org/abs/2602.14814. L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk, D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Dumas, Y. Elazar, V. Hofmann, A. H. Jha, S. Kumar, L. Lucy, X. Lyu, N. Lambert, I. Magnusson, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, A. Ravichander, K. Richardson, Z. Shen, E. Strubell, N. Subramani, O. T...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/tacl_a_00663 2024
-
[8]
URLhttps://github.com/fla-org/flash-linear-attention. S. Yang, B. Wang, Y. Shen, R. Panda, and Y. Kim. Gated linear attention transformers with hardware-efficient training. InICML, 2024b. URLhttps://proceedings.mlr.press/v235/yang24ab.html. S. Yang, B. Wang, Y. Zhang, Y. Shen, and Y. Kim. Parallelizing linear transformers with the delta rule over sequence...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[9]
with a few small tweaks to the architecture, learning rate schedule and training data: • We removed two heads from the model to make Olmo 3 and Olmo Hybrid more comparable in parameter count and training throughput (see below). • Rather than using the ad-hoc piecewise learning rate schedule from Olmo 3 7B (Olmo Team, 2025), we use a standard cosine decay ...
work page 2025
-
[10]
through both the attention and GDN layers. Ulysses distributes the sequence across devices and uses all-to-all communication to transpose from a sequence-parallel layout to a head-parallel layout before each layer. After the all-to-all, each device holds the full sequence for a subset of heads, which suffices for both attention (where each head attends in...
work page 2025
-
[11]
for the non-attention layers. The Mamba2 sub-layer uses an expansion factor of 2 (intermediate size = 2d), state sizen = 128, ngroups = 1, and a depthwise convolution with kernel size 4. In our hybrid Mamba2 configuration, the Mamba2 layers retain the full MLP from the attention blocks. D.4 Parameter Count and FLOP Computations We report total (non-embedd...
work page 2026
-
[12]
Thus, the expected loss isLϵ ∞ =L 0 −(1−ϵ)∆−ϵ∆ ′
Phase from1 ≤k≤k i: all tasks are learned, with expressible tasks obtaining loss L0 − ∆and inexpressible tasks obtaining lossL0 −∆ ′. Thus, the expected loss isLϵ ∞ =L 0 −(1−ϵ)∆−ϵ∆ ′
-
[13]
Thus, the expected loss isL 0 −(1−ϵ)∆
Phase from ki + 1 ≤k≤k e: only expressible tasks are learned obtaining loss∆. Thus, the expected loss isL 0 −(1−ϵ)∆
-
[14]
Phase fromke + 1≤k <∞: no tasks are learned. Thus, the expected loss isL0. Invoking Lemma 6 with these three phases, the loss˜L(D)is closely approximated by the following: Lϵ ∞ + 1 α ζ(α+ 1) (1−ϵ)∆k −α e +ϵ∆ ′k−α i ≈L ϵ ∞ + (1−ϵ)∆ (D/T) −α/(α+1) +ϵ∆ ′ (D/T ′)−α/(α+1) α ζ(α+ 1) 1/(α+1) =L ϵ ∞ + (1−ϵ)∆T α/(α+1) +ϵ∆ ′T ′α/(α+1) α ζ(α+ 1) 1/(α+1) ·D −α/(α+1) ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.