Comparing Transformers and Hybrid Models at the Token Level

William Merrill; Yanhong Li

arxiv: 2606.20936 · v1 · pith:PJBYY7AJnew · submitted 2026-06-18 · 💻 cs.CL · cs.AI

Comparing Transformers and Hybrid Models at the Token Level

Yanhong Li , William Merrill This is my paper

Pith reviewed 2026-06-26 16:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords hybrid language modelstoken-level lossrecurrent layersattention mechanismsstate trackingbracket matchingn-gram copyingpretraining diagnostics

0 comments

The pith

Hybrid models lower loss on semantic state tokens while transformers excel on n-grams and bracket matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares per-token losses of a matched transformer and hybrid model on the same prefixes, breaking results down by token tags, copy features, delimiters, and synthetic probes. The hybrid shows lower loss overall, especially on open-class content words, opening delimiters, and tasks that track pronouns or entities, but the advantage shrinks or reverses on closed-class words, closing delimiters, repeated n-grams, and bracket-matching probes. These patterns indicate that recurrent layers help maintain and use document semantic state, while attention layers handle local copying and syntactic completion. A reader would care because the split clarifies which capabilities explain hybrid gains and points to architecture choices that target specific prediction types.

Core claim

Using the open weights of matched Olmo transformer and Olmo Hybrid models, the authors measure loss at identical target tokens under identical prefixes and stratify results by natural token tags, copy features, delimiter structure, and controlled synthetic probes. The hybrid exhibits lower loss on most tag families, with the largest gains for open-class content words and smaller gains for many closed-class function words. Across prose, code, and markup the hybrid advantage is larger on opening delimiters than on closing delimiters and nearly vanishes on repeated n-grams. Synthetic probes replicate the split: the hybrid is favored on pronoun-memory and entity-tracking tasks, whereas the trans

What carries the argument

Token-level loss comparison stratified by tag families, delimiter opening versus closing, copy features, and synthetic probes for state tracking versus bracket matching.

If this is right

Token-level decompositions sharpen pretraining diagnostics by revealing which data types favor each architecture.
Hybrid models gain most where long-range semantic state must be maintained.
Pure transformers remain competitive or superior on local syntactic and copying patterns.
Filtered evaluations on specific token categories can guide architecture selection.
The non-uniform pattern across delimiters and n-grams suggests targeted use of recurrence rather than uniform replacement of attention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future designs could route tokens dynamically to recurrent or attention layers based on predicted tag or delimiter type.
The same stratification method could be applied to larger-scale hybrids to test whether the semantic-versus-syntactic split scales.
This view raises whether mixed architectures could be trained with loss weighting that emphasizes the token types each component handles best.
Similar token-level breakdowns might clarify performance gaps in other hybrid variants that combine recurrence with attention.

Load-bearing premise

The transformer and hybrid models are matched closely enough in training data, size, and optimization that observed token-level loss differences can be attributed to the presence of recurrent layers rather than other uncontrolled factors.

What would settle it

Retraining both models from identical random seeds on the exact same data order and observing whether the token-type loss differences remain or disappear.

Figures

Figures reproduced from arXiv: 2606.20936 by William Merrill, Yanhong Li.

**Figure 1.** Figure 1: Synthetic probe examples. We vary the distance d between an antecedent/opener and a scored target token. Analysis III: Controlled synthetic probes. The natural-token analyses are observational: they reveal where the gap appears in real text, but a surface tag does not directly specify the decision the model must make at the scored position. We therefore add three controlled synthetic probe families ( [PIT… view at source ↗

**Figure 2.** Figure 2: Raw and adjusted tag effects in prose. Left panels show raw paired loss gaps, where positive values mean the hybrid has lower NLL on that token family. Right panels show regressionadjusted effects relative to the global mean from the regressions. Top panels use the full coarse POS taxonomy; bottom panels use the separately fitted three-way aggregate model (content/function/other). where y + is the correct… view at source ↗

**Figure 3.** Figure 3: Tag vocabulary size is associated with the prose tag effects. Each point is a coarse prose tag from [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Brackets and repeated n-grams. A: Open and close bracket raw gaps across domains; line segments show the open–close difference. B: Raw paired loss gaps for repeated-token events from repeated 1-grams through 16-grams. Positive raw gaps mean lower hybrid NLL. C: Implied repetition effects from an extension of Equation 1 with COPY1–COPY16. Negative adjusted effects mean that repetition reduces the hybrid adv… view at source ↗

**Figure 5.** Figure 5: Controlled probes at the final matched checkpoint. Pronoun memory and entity tracking are scored by contrastive accuracy and log-probability margin; structural closure is scored by NLL on the closing token. Higher is better for accuracy and margin; lower is better for NLL. Hybrid advantage nearly disappears on repeated n-grams. Figure 4B–C isolates repeated spans, an observable proxy for prefix-based copyi… view at source ↗

**Figure 6.** Figure 6: Discourse tracking as richer state tracking. Left: predicting the masked content phrase requires tracking that the book has ended up with Mary and using that discourse state to make readingrelated objects likely. Right: closed-world A5 permutation updates apply ordered, deterministic state changes over a fixed set of slots, followed by a deterministic readout such as an assertion about one slot. is a dete… view at source ↗

**Figure 7.** Figure 7: Filtered token losses surface architecture differences during 1B pretraining. Token-loss curves at WSD-annealed checkpoints for a Transformer, a Hybrid, and a Pure RNN. Transformer–Hybrid separation is roughly 0.12 nats, about twice as large, and the ordering becomes HYBRID < PURE RNN < TRANSFORMER. This supports the intended use of the filter: removing copy positions and restricting to open-class targets… view at source ↗

**Figure 8.** Figure 8: Full control-feature coefficient diagnostic. Left: raw paired loss gaps for repeated 1–4-gram events. Right: frequency-adjusted coefficients from the coarse-tag regression, including difficulty, position, previous-token distance, target-token frequency, and repeated-token controls. Raw gaps show absolute hybrid advantage on repeated-token subsets, whereas coefficients show adjusted shifts in the paired hyb… view at source ↗

read the original abstract

Hybrid language models that mix attention and recurrent layers have shown promise: theoretically, recurrent layers ameliorate the limitations of pure transformers on state tracking, and empirically, hybrids can outperform pure transformers in loss and downstream evaluations \citep{waleffe2024empirical,merrill2026olmohybrid}. Yet it remains unclear which data or capabilities drive these gains, and to what degree they reflect the theoretical advantages motivating hybrid models. We address this question using the open weights from Olmo 3 \citep{olmo2025olmo3} and Olmo Hybrid \citep{merrill2026olmohybrid}: we compare the loss of a matched transformer and hybrid at the same target tokens under the same prefixes, stratifying the results by natural token tags, copy features, delimiter structure, and controlled synthetic probes. The hybrid has lower loss on most tag families, but the gains are not uniform: they are largest for open-class content words and smaller for many closed-class function words. Across prose, code, and markup, the hybrid's loss advantage is larger on opening delimiters than on the corresponding closing delimiters, and nearly vanishes on repeated $n$-grams. Synthetic probes show the same split: the hybrid is favored on pronoun-memory and entity-tracking tasks, whereas the transformer is favored on bracket-matching tasks that require choosing closing delimiters. These patterns suggest that the recurrent layers in hybrids improve predictions that leverage the semantic state of a document, whereas attention helps on tokens predictable by $n$-gram copying or syntactic bracket matching. We conclude with proof-of-concept filtered evaluations showing how token-level decompositions can sharpen pretraining diagnostics for hybrid architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Token-level split shows hybrids better on semantic state and content words but transformers stronger on n-grams and brackets, with model matching still needing tighter proof.

read the letter

The paper's main value is the stratified token-level comparison between the matched Olmo transformer and Olmo Hybrid. It breaks losses down by tag families, opening versus closing delimiters, repeated n-grams, and synthetic probes for pronoun memory, entity tracking, and bracket matching. The patterns line up with the stated motivation: recurrent layers help where semantic state matters, while attention handles local copying and syntactic closure. Using open weights and doing the same-prefix loss comparison is a straightforward way to get at this without aggregate metrics hiding the differences.

The setup is reproducible enough to be useful. The synthetic probes give some control that natural data alone lacks, and the finding that gains nearly vanish on repeated n-grams is a clear, falsifiable observation.

The soft spot is the model matching. The abstract calls the models matched, but the stress-test concern is reasonable until the methods show identical parameter counts, data mixtures, optimizer steps, and initialization. Any uncontrolled difference could produce the observed tag and delimiter patterns without the recurrent-versus-attention distinction doing the work. If the full paper has those controls documented, the attribution holds; otherwise it stays suggestive.

This is for people already working on hybrid architectures or fine-grained pretraining diagnostics. A reader who wants to know where state-tracking advantages actually appear will get something concrete from it. It deserves peer review because the decomposition method is worth referee time even if the causal claims need more support on the matching side.

Referee Report

2 major / 1 minor

Summary. The manuscript compares the token-level cross-entropy loss of a pure transformer (Olmo 3) and a hybrid model (Olmo Hybrid) that interleaves recurrent and attention layers. Using open weights, it stratifies loss differences by token tags, copy features, delimiter structure, and synthetic probes, concluding that recurrent layers aid predictions based on semantic document state while attention aids n-gram copying and syntactic bracket matching.

Significance. If the models are sufficiently matched, this work offers a detailed empirical map of architectural strengths at the token level, directly testing theoretical claims about state tracking in hybrids versus transformers. It could inform future hybrid designs and pretraining diagnostics.

major comments (2)

Abstract: The central attribution of loss differences to recurrent layers versus attention requires that the Olmo transformer and Olmo Hybrid differ only in layer type. The abstract states the models are 'matched,' but without explicit confirmation of identical parameter counts, training data, optimizer schedules, and initialization, alternative explanations for the observed patterns in tag families, delimiters, and probes cannot be ruled out.
Abstract: The manuscript reports loss differences across strata but does not mention error bars, variance estimates, or statistical tests. This makes it difficult to determine whether the reported advantages (e.g., larger gains on opening delimiters or pronoun-memory tasks) are statistically reliable.

minor comments (1)

Abstract: The citation for Olmo Hybrid is given as merrill2026olmohybrid; ensure the year and arXiv details are accurate in the reference list.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments on model matching and statistical reporting. We address both points below and will revise the manuscript accordingly to strengthen the claims.

read point-by-point responses

Referee: Abstract: The central attribution of loss differences to recurrent layers versus attention requires that the Olmo transformer and Olmo Hybrid differ only in layer type. The abstract states the models are 'matched,' but without explicit confirmation of identical parameter counts, training data, optimizer schedules, and initialization, alternative explanations for the observed patterns in tag families, delimiters, and probes cannot be ruled out.

Authors: The Olmo 3 transformer and Olmo Hybrid models were released as matched configurations (same parameter count, training corpus, optimizer, and schedule) except for the substitution of recurrent layers for some attention layers, as documented in their source papers. We will revise the abstract and add an explicit methods paragraph confirming these details from the original releases to rule out alternative explanations and make the attribution to layer type fully transparent. revision: yes
Referee: Abstract: The manuscript reports loss differences across strata but does not mention error bars, variance estimates, or statistical tests. This makes it difficult to determine whether the reported advantages (e.g., larger gains on opening delimiters or pronoun-memory tasks) are statistically reliable.

Authors: We agree that variance estimates and statistical tests would improve interpretability. In revision we will report standard errors computed over document segments or multiple evaluation shards for the stratified losses, and add brief statistical comparisons (e.g., paired t-tests or bootstrap intervals) for the key reported advantages such as opening vs. closing delimiters and the synthetic probes. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparison

full rationale

The paper conducts a direct empirical comparison of token-level losses between two open-weight models (Olmo transformer and Olmo Hybrid) stratified by tags, copy features, delimiters, and synthetic probes. No equations, derivations, or predictions are present that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The central claims rest on observable loss differences from existing models rather than any internal fitting or uniqueness theorem imported from the authors' prior work. Self-citations to model sources are factual references to released weights, not load-bearing justifications for the results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical comparison study with no mathematical derivations or new postulated entities; all claims rest on model outputs and standard evaluation practices.

pith-pipeline@v0.9.1-grok · 5823 in / 947 out tokens · 19434 ms · 2026-06-26T16:59:53.321340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Zoology: Measuring and improving recall in efficient language models

Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. Zoology: Measuring and improving recall in efficient language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

2024
[2]

Transformers in uniform TC 0.Trans

David Chiang. Transformers in uniform TC 0.Trans. Mach. Learn. Res., 2025,

2025
[3]

Riccardo Grazzi, Julien Siems, Arber Zela, Jörg K. H. Franke, Frank Hutter, and Massimiliano Pontil. Unlocking state-tracking in linear rnns through negative eigenvalues. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

2025
[4]

Kakade, and Eran Malach

Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. Repeat after me: Trans- formers are better than state space models at copying. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conference on Machine Learning, ICML 2024, Vi...

2024
[5]

URL https: //aclanthology.org/2022.tacl-1.66/

doi: 10.1162/TACL\_A\ _00562. URLhttps://doi.org/10.1162/tacl_a_00562. William Merrill, Jackson Petty, and Ashish Sabharwal. The illusion of state in state-space models. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Le...

work page internal anchor Pith review doi:10.1162/tacl 2024
[6]

URLhttps://arxiv.org/abs/2604.03444. Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznan- ski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Sa...

Pith/arXiv arXiv
[7]

URLhttps://arxiv.org/abs/2512.13961. Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng. RWKV-7 "goose" with expressive dynamic state evolution.CoRR, ab...

Pith/arXiv arXiv
[9]

URL https: //openreview.net/forum?id=RDbuSCWhad. Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, and Bryan Catanzaro. An empirical study of mamba-based language models.CoRR, abs...

Pith/arXiv arXiv
[10]

An Empirical Study of Mamba-based Language Models

doi: 10.48550/ARXIV .2406.07887. URLhttps://doi.org/10.48550/arXiv.2406.07887. Gail Weiss, Yoav Goldberg, and Eran Yahav. Thinking like transformers,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[11]

Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, and Yoon Kim

URL https: //arxiv.org/abs/2106.06981. Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, and Yoon Kim. PaTH attention: Position encoding via accumulating householder transformations. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

arXiv
[12]

main" Textcontent,entityfree-form page text Commentcomment <!– ... –> Punctuationangle_open,angle_close, slash,quote <,>,/,

URL https://arxiv.org/abs/2105. 11115. 14 A Empirical methodology details A.1 Domains, packing, and pairing protocol We evaluate both models on the same evaluation sequences and compute NLL at every position. Text is packed into contiguous sequences of length T= 8192 . Allcomparisons arepairedat the level of a single next-token decision:same checkpoint, s...

1979
[13]

The sign convention is the same as in the main text

The raw gaps ask whether the hybrid has lower NLL on a subset in absolute terms; the coefficients ask how each feature shifts the paired hybrid–transformer gap after controlling for domain, tag, word position, sequence position, difficulty, previous-token distance, and target-token frequency. The sign convention is the same as in the main text. Positive r...

2026

[1] [1]

Zoology: Measuring and improving recall in efficient language models

Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. Zoology: Measuring and improving recall in efficient language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

2024

[2] [2]

Transformers in uniform TC 0.Trans

David Chiang. Transformers in uniform TC 0.Trans. Mach. Learn. Res., 2025,

2025

[3] [3]

Riccardo Grazzi, Julien Siems, Arber Zela, Jörg K. H. Franke, Frank Hutter, and Massimiliano Pontil. Unlocking state-tracking in linear rnns through negative eigenvalues. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

2025

[4] [4]

Kakade, and Eran Malach

Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. Repeat after me: Trans- formers are better than state space models at copying. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Forty-first International Conference on Machine Learning, ICML 2024, Vi...

2024

[5] [5]

URL https: //aclanthology.org/2022.tacl-1.66/

doi: 10.1162/TACL\_A\ _00562. URLhttps://doi.org/10.1162/tacl_a_00562. William Merrill, Jackson Petty, and Ashish Sabharwal. The illusion of state in state-space models. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Le...

work page internal anchor Pith review doi:10.1162/tacl 2024

[6] [6]

URLhttps://arxiv.org/abs/2604.03444. Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznan- ski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Sa...

Pith/arXiv arXiv

[7] [7]

URLhttps://arxiv.org/abs/2512.13961. Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng. RWKV-7 "goose" with expressive dynamic state evolution.CoRR, ab...

Pith/arXiv arXiv

[8] [9]

URL https: //openreview.net/forum?id=RDbuSCWhad. Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, and Bryan Catanzaro. An empirical study of mamba-based language models.CoRR, abs...

Pith/arXiv arXiv

[9] [10]

An Empirical Study of Mamba-based Language Models

doi: 10.48550/ARXIV .2406.07887. URLhttps://doi.org/10.48550/arXiv.2406.07887. Gail Weiss, Yoav Goldberg, and Eran Yahav. Thinking like transformers,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv

[10] [11]

Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, and Yoon Kim

URL https: //arxiv.org/abs/2106.06981. Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, and Yoon Kim. PaTH attention: Position encoding via accumulating householder transformations. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

arXiv

[11] [12]

main" Textcontent,entityfree-form page text Commentcomment <!– ... –> Punctuationangle_open,angle_close, slash,quote <,>,/,

URL https://arxiv.org/abs/2105. 11115. 14 A Empirical methodology details A.1 Domains, packing, and pairing protocol We evaluate both models on the same evaluation sequences and compute NLL at every position. Text is packed into contiguous sequences of length T= 8192 . Allcomparisons arepairedat the level of a single next-token decision:same checkpoint, s...

1979

[12] [13]

The sign convention is the same as in the main text

The raw gaps ask whether the hybrid has lower NLL on a subset in absolute terms; the coefficients ask how each feature shifts the paired hybrid–transformer gap after controlling for domain, tag, word position, sequence position, difficulty, previous-token distance, and target-token frequency. The sign convention is the same as in the main text. Positive r...

2026