arxiv: 2604.03444 · v3 · submitted 2026-04-03 · 💻 cs.LG · cs.CL

Recognition: no theorem link

Olmo Hybrid: From Theory to Practice and Back

William Merrill , Yanhong Li , Tyler Romero , Anej Svete , Caia Costello , Pradeep Dasigi , Dirk Groeneveld , David Heineman

show 14 more authors

Bailey Kuehl Nathan Lambert Chuan Li Kyle Lo Saumya Malik DJ Matusz Benjamin Minixhofer Jacob Morrison Luca Soldaini Finbarr Timbers Pete Walsh Noah A. Smith Hannaneh Hajishirzi Ashish Sabharwal

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:35 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords hybrid modelslanguage modelingexpressivityscaling efficiencytransformersrecurrent networkspretrainingcode execution

0 comments

The pith

Hybrid models mixing attention and recurrence express tasks like code execution beyond transformers or linear RNNs alone and scale more efficiently at 7B size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that hybrid language models do not just combine the strengths of transformers and linear RNNs but can handle computations neither can manage on its own. The authors prove theoretically that such hybrids can execute code, a capability outside the reach of pure attention or pure recurrence. They then train a 7B-parameter hybrid by swapping sliding-window layers for Gated DeltaNet layers and show it beats a matched transformer on pretraining loss and downstream metrics while scaling more efficiently. They close the argument by explaining how the added expressivity produces these scaling gains rather than leaving the connection as an empirical mystery.

Core claim

Hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. A 7B hybrid model with Gated DeltaNet layers replacing sliding-window attention outperforms a comparable pure transformer on pretraining and mid-training evaluations and scales more efficiently, because the greater expressivity directly improves optimization during pretraining.

What carries the argument

Hybrid architecture that interleaves attention layers with Gated DeltaNet recurrent layers, enabling formal tasks like code execution and driving better scaling efficiency.

If this is right

Hybrid models can perform code execution and other formal tasks impossible for pure transformers or linear RNNs.
The 7B Olmo Hybrid achieves lower pretraining loss and stronger downstream results than the matched Olmo 3 transformer.
Hybrid models exhibit significantly better scaling efficiency than transformers during pretraining.
Increased expressivity from the hybrid design translates into more efficient scaling on general language-modeling tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future architecture search could prioritize expressivity analysis for specific computational patterns rather than uniform scaling.
If the expressivity-to-scaling link generalizes, targeted recurrent insertions could lower the compute required for reasoning-intensive language tasks.
The theory-to-practice loop shown here offers a template for using formal expressivity results to guide large-scale training experiments.

Load-bearing premise

The measured gains in pretraining loss and downstream metrics come from the hybrid architecture's added expressivity rather than from differences in optimization dynamics or layer-replacement details.

What would settle it

A pure transformer trained with identical hyperparameters, optimizer settings, and layer counts but without the recurrent components matching or exceeding the hybrid on the same pretraining and downstream benchmarks would falsify the claim.

read the original abstract

Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hybrid 7B model beats transformer baseline on scaling but the expressivity-to-gains link stays argued rather than derived.

read the letter

The main thing here is a controlled 7B-scale run where swapping sliding-window layers for Gated DeltaNet produces lower pretraining loss, better downstream numbers, and visibly steeper scaling than the pure OLMo 3 transformer. They also give a clean expressivity proof showing the hybrid can handle code-execution tasks that sit outside what either transformers or linear RNNs can do on their own. That combination of scale and a non-trivial theoretical separation is the actual new piece. The empirical side is the stronger part: a full-model comparison at this size is rare enough that the reported gains deserve attention even if the absolute deltas are modest. The proof itself is straightforward and avoids overclaiming what the components can do separately. The soft spot is exactly the one the paper flags. After showing the formal advantage on specific tasks, it returns to theory to argue why that advantage should produce better general scaling and unrelated benchmarks, but the argument is qualitative rather than a capacity bound or scaling-law derivation. That leaves the causal story open to alternatives like optimization differences from the layer replacement. Training details are also thin, which matters when the claim is about efficiency at frontier scale. This is for groups already running architecture ablations or scaling studies; the 7B result is concrete enough that a referee could check the numbers and the setup without needing to accept the full theoretical loop. I would send it to review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that hybrid architectures mixing attention and linear RNN layers (via Gated DeltaNet) possess strictly greater expressivity than either pure transformers or linear RNNs alone, as shown by their ability to solve tasks such as code execution. It then trains a 7B-parameter Olmo Hybrid model by replacing sliding-window layers in Olmo 3 with Gated DeltaNet layers, reports superior pretraining loss scaling and downstream metrics, and supplies a post-hoc theoretical argument that the added expressivity explains the observed efficiency gains.

Significance. If the causal link between the demonstrated task-specific expressivity and the measured scaling improvements holds, the result would be significant: it would supply concrete evidence that hybrid models are not merely inference-efficient but can be fundamentally more expressive and scale better during pretraining, with direct implications for architecture search at the 7B+ scale.

major comments (2)

[theoretical argument section (post-empirical)] The section returning to theory after the empirical results: the argument that expressivity on formal problems (e.g., code execution) should produce better general scaling laws and downstream performance on unrelated tasks is presented without a quantitative derivation such as a capacity bound, sample-complexity result, or explicit scaling-law exponent relating the hybrid state update to reduced loss. This leaves the central causal claim open to alternative explanations.
[empirical results and scaling analysis] Empirical comparison section: the claim of a 'controlled, large-scale setting' is weakened by the absence of ablations that isolate Gated DeltaNet from incidental changes in optimization dynamics or layer-replacement details; without these, it is unclear whether the reported pretraining and downstream gains are attributable to expressivity rather than other factors.

minor comments (2)

[methods] Clarify the precise definition and gating mechanism of Gated DeltaNet early in the methods section to avoid ambiguity when comparing to prior linear RNN variants.
[training details] Add explicit statements of the hyperparameter matching protocol between Olmo Hybrid and Olmo 3 to strengthen the controlled-comparison claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment below and have incorporated revisions to strengthen the clarity and rigor of the theoretical and empirical sections.

read point-by-point responses

Referee: The section returning to theory after the empirical results: the argument that expressivity on formal problems (e.g., code execution) should produce better general scaling laws and downstream performance on unrelated tasks is presented without a quantitative derivation such as a capacity bound, sample-complexity result, or explicit scaling-law exponent relating the hybrid state update to reduced loss. This leaves the central causal claim open to alternative explanations.

Authors: We agree that the post-empirical theoretical argument remains conceptual and does not supply a quantitative derivation such as a capacity bound or explicit scaling-law exponent. Our claim is that the demonstrated strict increase in expressivity (ability to solve tasks outside the reach of either pure transformer or linear RNN) implies the hybrid can achieve strictly lower loss for a given capacity on a broader function class; this in turn produces the observed improvement in scaling efficiency. We have revised the section to state the assumptions more explicitly, acknowledge alternative explanations (e.g., optimization dynamics), and clarify that a full quantitative link would require additional distributional assumptions beyond the paper's scope. revision: partial
Referee: Empirical comparison section: the claim of a 'controlled, large-scale setting' is weakened by the absence of ablations that isolate Gated DeltaNet from incidental changes in optimization dynamics or layer-replacement details; without these, it is unclear whether the reported pretraining and downstream gains are attributable to expressivity rather than other factors.

Authors: We acknowledge that the absence of exhaustive ablations at 7B scale limits the strength of the causal attribution. We maintained identical data, optimizer, learning-rate schedule, and total compute, with the only change being the substitution of sliding-window layers by Gated DeltaNet. We have added a new subsection that discusses potential confounding factors and reports supporting smaller-scale (1B) controlled ablations that isolate the layer replacement while holding all other variables fixed; these show consistent gains attributable to the hybrid state update. revision: partial

Circularity Check

1 steps flagged

Theory-practice-theory loop leaves causal link from expressivity to scaling unproven

specific steps

other [Abstract]
"However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop."

Empirical scaling gains are first observed, then a theoretical argument is supplied to explain them; the argument is presented as closing the loop rather than as an a-priori derivation that independently predicts the observed exponent improvement. This renders the central causal attribution dependent on the data it purports to explain.

full rationale

The paper establishes hybrid expressivity beyond transformers/RNNs on formal tasks (e.g., code execution), trains a 7B model replacing layers with Gated DeltaNet, observes superior pretraining scaling and downstream metrics, then returns to theory to argue that the added expressivity explains the scaling gains. No quantitative bridge (capacity bound, sample-complexity result, or scaling-law derivation) is supplied showing how the hybrid update reduces the loss exponent. The explanatory step therefore depends on the empirical outcome it is invoked to justify, producing moderate circularity in the justification loop even though the initial expressivity claim itself is not self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on (1) a theoretical proof that hybrids are strictly more expressive than either component and (2) the empirical observation that this expressivity produces better scaling. No explicit free parameters are named. The Gated DeltaNet layer is treated as a given architectural primitive.

axioms (1)

domain assumption Hybrid models mixing attention and linear recurrence can express tasks (e.g., code execution) that neither pure transformers nor pure linear RNNs can express
Stated as the first theoretical result in the abstract.

invented entities (1)

Gated DeltaNet no independent evidence
purpose: Recurrent layer used to replace sliding-window attention layers in the hybrid model
Introduced or specialized as the concrete recurrent component that enables the hybrid architecture.

pith-pipeline@v0.9.0 · 5665 in / 1373 out tokens · 49939 ms · 2026-05-13T19:35:09.305255+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
cs.AI 2026-05 unverdicted novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
Sonata: A Hybrid World Model for Inertial Kinematics under Clinical Data Scarcity
cs.LG 2026-04 unverdicted novelty 6.0

Sonata is a small hybrid world model pre-trained to predict future IMU states that outperforms autoregressive baselines on clinical discrimination, fall-risk prediction, and cross-cohort transfer while fitting on-devi...
Reasoning Primitives in Hybrid and Non-Hybrid LLMs
cs.CL 2026-04 unverdicted novelty 5.0

Reasoning augmentation extends the difficulty range for both architectures, but hybrid models stay robust longer than transformers as sequential dependence increases in state-based recall tasks.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 3 Pith papers · 2 internal anchors

[1]

URLhttps://proceedings.mlr.press/v235/akyurek24a.html. S. Arora and A. Goyal. A theory for emergence of complex skills in language models, 2023. URLhttps://arxiv.org/ abs/2307.15936. S. Arora, S. Eyuboglu, A. Timalsina, I. Johnson, M. Poli, J. Zou, A. Rudra, and C. Re. Zoology: Measuring and improving recall in efficient language models. InICLR, 2024a. UR...

work page doi:10.1145/12130.12131 2023
[2]

URLhttps://openreview.net/forum?id=eG5oh8l1WZ. A. Mallen, A. Asai, V. Zhong, R. Das, H. Hajishirzi, and D. Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories, 2022. URLhttps://arxiv.org/abs/ 2212.10511. A. Matton, T. Sherborne, D. Aumiller, E. Tommasone, M. Alizadeh, J. He, R....

work page doi:10.18653/v1/w19-3901 2022
[3]

URLhttps://aclanthology.org/2023.tacl-1.31/

doi: 10.1162/tacl_a_00562. URLhttps://aclanthology.org/2023.tacl-1.31/. W. Merrill and A. Sabharwal. The expressive power of transformers with chain of thought. InICLR, 2024. URL https://openreview.net/forum?id=NjNGlPh8Wh. W. Merrill and A. Sabharwal. Exact expressive power of transformers with padding. InNeurIPS, 2025. URL https://openreview.net/forum?id...

work page doi:10.1162/tacl_a_00562 2023
[4]

URLhttps://proceedings.mlr.press/v162/rajbhandari22a.html. P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, 2016. URLhttps://aclanthology.org/D16-1264. S. Reddy, D. Chen, and C. D. Manning. CoQA: A conversational question answering challenge.TACL, 7:249–266,

work page 2016
[5]

URLhttps://aclanthology.org/Q19-1016. M. Reid, V. Zhong, S. Gururangan, and L. Zettlemoyer. M2D2: A massively multi-domain language modeling dataset. InEMNLP, 2022. URLhttps://aclanthology.org/2022.emnlp-main.63. D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman. GPQA: A graduate- level google-proof Q&A ben...

work page doi:10.1007/bf02451546 2022
[6]

URLhttps://proceedings.mlr.press/v139/schlag21a.html. R. Shao, A. Asai, S. Z. Shen, H. Ivison, V. Kishore, J. Zhuo, X. Zhao, M. Park, S. G. Finlayson, D. Sontag, T. Murray, S. Min, P. Dasigi, L. Soldaini, F. Brahman, W.-t. Yih, T. Wu, L. Zettlemoyer, Y. Kim, H. Hajishirzi, and P. W. Koh. DR Tulu: Reinforcement learning with evolving rubrics for deep resea...

work page arXiv 2025
[7]

URLhttps://arxiv.org/abs/2602.14814. L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk, D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Dumas, Y. Elazar, V. Hofmann, A. H. Jha, S. Kumar, L. Lucy, X. Lyu, N. Lambert, I. Magnusson, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, A. Ravichander, K. Richardson, Z. Shen, E. Strubell, N. Subramani, O. T...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/tacl_a_00663 2024
[8]

URLhttps://github.com/fla-org/flash-linear-attention. S. Yang, B. Wang, Y. Shen, R. Panda, and Y. Kim. Gated linear attention transformers with hardware-efficient training. InICML, 2024b. URLhttps://proceedings.mlr.press/v235/yang24ab.html. S. Yang, B. Wang, Y. Zhang, Y. Shen, and Y. Kim. Parallelizing linear transformers with the delta rule over sequence...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[9]

• Rather than using the ad-hoc piecewise learning rate schedule from Olmo 3 7B (Olmo Team, 2025), we use a standard cosine decay to 10% of the maximum learning rate

with a few small tweaks to the architecture, learning rate schedule and training data: • We removed two heads from the model to make Olmo 3 and Olmo Hybrid more comparable in parameter count and training throughput (see below). • Rather than using the ad-hoc piecewise learning rate schedule from Olmo 3 7B (Olmo Team, 2025), we use a standard cosine decay ...

work page 2025
[10]

Ulysses distributes the sequence across devices and uses all-to-all communication to transpose from a sequence-parallel layout to a head-parallel layout before each layer

through both the attention and GDN layers. Ulysses distributes the sequence across devices and uses all-to-all communication to transpose from a sequence-parallel layout to a head-parallel layout before each layer. After the all-to-all, each device holds the full sequence for a subset of heads, which suffices for both attention (where each head attends in...

work page 2025
[11]

envelopes

for the non-attention layers. The Mamba2 sub-layer uses an expansion factor of 2 (intermediate size = 2d), state sizen = 128, ngroups = 1, and a depthwise convolution with kernel size 4. In our hybrid Mamba2 configuration, the Mamba2 layers retain the full MLP from the attention blocks. D.4 Parameter Count and FLOP Computations We report total (non-embedd...

work page 2026
[12]

Thus, the expected loss isLϵ ∞ =L 0 −(1−ϵ)∆−ϵ∆ ′

Phase from1 ≤k≤k i: all tasks are learned, with expressible tasks obtaining loss L0 − ∆and inexpressible tasks obtaining lossL0 −∆ ′. Thus, the expected loss isLϵ ∞ =L 0 −(1−ϵ)∆−ϵ∆ ′

work page
[13]

Thus, the expected loss isL 0 −(1−ϵ)∆

Phase from ki + 1 ≤k≤k e: only expressible tasks are learned obtaining loss∆. Thus, the expected loss isL 0 −(1−ϵ)∆

work page
[14]

Thus, the expected loss isL0

Phase fromke + 1≤k <∞: no tasks are learned. Thus, the expected loss isL0. Invoking Lemma 6 with these three phases, the loss˜L(D)is closely approximated by the following: Lϵ ∞ + 1 α ζ(α+ 1) (1−ϵ)∆k −α e +ϵ∆ ′k−α i ≈L ϵ ∞ + (1−ϵ)∆ (D/T) −α/(α+1) +ϵ∆ ′ (D/T ′)−α/(α+1) α ζ(α+ 1) 1/(α+1) =L ϵ ∞ + (1−ϵ)∆T α/(α+1) +ϵ∆ ′T ′α/(α+1) α ζ(α+ 1) 1/(α+1) ·D −α/(α+1) ...

work page arXiv 2048