arxiv: 2605.08301 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Priming: Hybrid State Space Models From Pre-trained Transformers

Aditya Chattopadhyay , Elvis Nunez , Prannay Kaul , Benjamin Bowman , Evan Becker , Luca Zancato , David Thomas , Wei Xia

show 1 more author

Stefano Soatto

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords hybrid state space modelsprimingknowledge transferlong-context reasoningstate space modelstransformersMambamodel efficiency

0 comments

The pith

Priming initializes hybrid attention-SSM models from pre-trained transformers and recovers performance with under 0.5% of the original training tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Priming as a method that initializes hybrid models combining attention and state-space model layers directly from existing pre-trained transformers. Short alignment and post-training phases then restore downstream task quality without needing to restart pre-training from random weights. A reader would care because this approach makes large-scale exploration of hybrid architectures practical, yielding models with smaller key-value caches, faster decoding, and native support for long contexts while remaining agnostic to the source transformer's family or scale. It also permits the first apples-to-apples comparison of different SSM layer designs under identical training conditions.

Core claim

Priming converts hybrid architecture design from a full pre-training problem into a knowledge transfer task. It initializes a hybrid attention-plus-SSM model from a pre-trained transformer, then applies short alignment and post-training phases that recover downstream quality using less than 0.5% of the source model's original token budget. The procedure works across dense and mixture-of-experts transformers of varying families and sizes, and it enables controlled scaling experiments that reveal a consistent expressiveness ordering among SSM variants: Gated KalmaNet outperforms Gated DeltaNet, which in turn outperforms Mamba-2, with the ordering directly forecasting long-context reasoning and

What carries the argument

Priming, the procedure of copying transformer weights into a hybrid attention-SSM architecture followed by alignment and post-training phases that transfer knowledge without full retraining.

If this is right

Hybrid models produced by Priming deliver up to 2.3 times higher decode throughput than the source transformer while remaining within 1% of its quality.
The expressiveness ranking GKA greater than GDN greater than Mamba-2 directly predicts which hybrid variant performs best on long-context reasoning tasks.
At 32B scale the primed GKA hybrid improves average reasoning scores by 3.8 points over its source Qwen3-32B model.
The released model zoo and training code allow other researchers to repeat or extend the same controlled SSM comparisons at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Transformer representations appear general enough that recurrent SSM components can be substituted in with only modest additional training.
The same priming approach might be tested for initializing hybrids that combine attention with other recurrent or memory-efficient blocks beyond the three SSMs compared here.
Widespread adoption of primed hybrids could shift inference cost curves for long-context applications such as multi-step reasoning and retrieval-augmented generation.

Load-bearing premise

That a hybrid model started from transformer weights can be aligned and post-trained to recover quality without catastrophic forgetting or loss of pre-trained capabilities.

What would settle it

If a primed hybrid model trained on the same post-training data as its source transformer fails to reach within 1% of the transformer's downstream performance on the reported long-context reasoning benchmarks, the recovery claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.08301 by Aditya Chattopadhyay, Benjamin Bowman, David Thomas, Elvis Nunez, Evan Becker, Luca Zancato, Prannay Kaul, Stefano Soatto, Wei Xia.

**Figure 1.** Figure 1: Primed Hybrid models match Transformer quality at half the memory and decode faster. (Left) Reasoning accuracy of Primed Hybrid reasoning models vs. the source Qwen3 Transformer across five benchmarks at 8B and 32B scale. At 8B, the GKA Hybrid uniformly outperforms its GDN counterpart, consistent with the expressiveness hierarchy GKA > GDN (Section 4). (Top right) Decode throughput speedup of GKAPrimed-HQ… view at source ↗

**Figure 2.** Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

**Figure 3.** Figure 3: Long-context evaluation at 32B scale. Performance of our Primed Hybrid IT models (GKA and GDN) against the Qwen3-32B [Long] Transformer baseline fine-tuned with the same Stage 2 IT recipe. Results are reported as a weighted average across context lengths from 8K to 128K, with geometrically increasing weights that double at each successive context length. The bottom-right panel (Aggregate) aggregates perfor… view at source ↗

**Figure 4.** Figure 4: Per-layer importance scores for Qwen3-8B and Qwen3-32B. Each bar is the relative drop in mean HELMET performance (across five sub-tasks: Synthetic Recall, RAG, Many-shot ICL, Generation with Citations, and Passage Re-ranking) when the corresponding Attention layer is individually replaced with SWA of window size w = 2048. Importance is concentrated in a small set of middle-to-late layers in both models; th… view at source ↗

**Figure 5.** Figure 5: Effect of layer pattern strategy on long-context performance for Hybrid models with a 50% Hybrid ratio sourced from Qwen3-8B. We compare our Selective Pattern against the Uniform Pattern baseline. Each line averages across three Hybrid models, one per SSM layer type (Mamba2, GDN, GKA); scores are reported as a fraction of the Transformer baseline’s score at the same context length. All models, including th… view at source ↗

**Figure 6.** Figure 6: Effect of Stage 1 supervision granularity on long-context performance for Hybrid models with a 50% Hybrid ratio sourced from Qwen3- 8B. We compare end-to-end MSE (Equation (4)) against layerwise MSE (Equation (5)) for three SSM layer types (GKA, GDN, Mamba2). For each SSM type, the solid bar shows performance under layerwise supervision and the hatched overlay shows the improvement from endto-end supervis… view at source ↗

**Figure 7.** Figure 7: Effect of State Expansion via AGQA on long-context performance for Hybrid models with a 50% Hybrid ratio sourced from Qwen3-32B. We compare Adaptive GQA (AGQA) against the standard GQA baseline for two SSM layer types (GKA, GDN). For each SSM type, the solid bar shows performance under standard GQA and the hatched overlay shows the improvement from AGQA. Qwen3- 32B [Long] is the Transformer baseline. Sc… view at source ↗

**Figure 8.** Figure 8: Reasoning output token length distributions. Output token length distributions for GKAPrimed-HQwen3-Reasoner (r=30) and Qwen3-32B [Reasoner-SFT] on AIME 2025 (n=3600 rollouts per model). Left: full benchmark (30 problems). Right: hard subset (15 hardest problems). Both models were trained with the same reasoning SFT recipe and produce similar-length thinking traces, confirming that the Hybrid’s speedups a… view at source ↗

**Figure 9.** Figure 9: State composition for hybrid models. Long inputs are partitioned into chunks at the native context length, each processed independently. KV caches are concatenated while SSM states are merged (via averaging or an alternative method as in Section E), yielding training-free context extension. Let L denote the model’s native context length and suppose we wish to process an input of length 28 [PITH_FULL_IMAGE… view at source ↗

**Figure 10.** Figure 10: State composition extends effective context well beyond the 128K native training window. Performance of Primed Hybrid IT models (GKA-8B, GDN-8B, Mamba2-8B, GKA-32B) on RULER NIAH and BABILong at 128K (native), 256K (2×), and 512K (4×) contexts. Solid bars show standard prefill (running the model on the full input in a single pass); hatched bars show the additional gain from state composition with 128K chu… view at source ↗

**Figure 11.** Figure 11: Input selectivity through βt improves long context capabilities of GKA-Primed models. Long-context performance, averaged over the tasks described in Section 5.1.1 for 8B and 32B GKA-PrimedHQwen3-IT models with and without input selectivity (βt), across sequence lengths from 8k to 128k. Adding βt generally improves performance, with the gains becoming more pronounced at longer context lengths. 8.2 Variabl… view at source ↗

**Figure 12.** Figure 12: GKA: Trading compute for speed at inference time. (a): GKA-Primed-HQwen3-IT-8B model. (b): GKA-Primed-HQwen3-IT-32B model. Each curve represents a different context length, and marker shapes denote the number of Chebyshev iterations r. Reducing r from its training value (30) increases throughput with only a modest drop in long-context performance. In both cases, a single trained model supports variable te… view at source ↗

**Figure 13.** Figure 13: Speedup of symmetric tiled kernels over the non-tiled baseline on H200 and A100 GPUs. The x-axis shows batch size normalized by the number of SMs on each GPU, so that the periodic pattern, caused by successive waves of Triton program instances filling and then exceeding SM capacity, aligns across GPU models. The tiled_small_batch variant provides a consistent speedup at all batch sizes by exploiting Ht sy… view at source ↗

**Figure 14.** Figure 14: Fused teacher-student architecture used during Stage 1 of Priming. For every SSM layer in the student Hybrid, the fused architecture maintains two parallel layers, corresponding to two parallel pathways for processing the input: the teacher’s Attention layer (dashed, frozen) and the student’s SSM layer (solid, learnable). These layers are highlighted in yellow in the figure. All other layers (highlighted … view at source ↗

**Figure 15.** Figure 15: SP sharding patterns for NSP = 4. (a) Simple SP: the sequence is split into NSP contiguous chunks, one per GPU. (b) Zig-zag SP: the sequence is split into 2NSP chunks and each GPU receives two discontiguous chunks (e.g., GPU 1 gets chunks 1 and 8), balancing the causal sequence mixing layers workload across ranks. Simple SP pattern. Let the input sequence be of shape [l, b, d], where l is the sequence len… view at source ↗

**Figure 16.** Figure 16: Stage 1 alignment recipe. Data composition by domain (left) and training configuration (right). Stage 1 models are trained on 40B tokens consisting of a mix of instruction-following and web text. 10.2 Instruction Tuning Starting from a Hybrid model that has undergone Stage 0 initialization (Section 3.2.2) and Stage 1 alignment (Section 10.1), we describe our Stage 2 recipe for producing long-context instr… view at source ↗

**Figure 17.** Figure 17: Instruction model training data distribution. Domain composition by token count for longcontext continued pre-traing (left) and instruction-tuning SFT (right). “Safety” refers to samples that train appropriate refusal behavior and robustness to adversarial prompts. context length, we increase the RoPE base frequency from 1M (the source Transformer’s default) to 5M [Xiong et al., 2024]. SFT for instructio… view at source ↗

**Figure 18.** Figure 18: Our multi-stage long-context reasoning pipeline. Stages 0+1 of the Priming procedure allow our Hybrid architectures to sidestep pre-training costs by leveraging a pre-trained Transformer model. Stage 2 for reasoning consists of three phases. (i) Short-context SFT consists of SFT on packed reasoning examples at 32K context. Next, (ii) a context-extension phase where we train on a mix of reasoning samples a… view at source ↗

**Figure 19.** Figure 19: We depict the document and token ratios of our reasoning SFT data mixture by domain. [PITH_FULL_IMAGE:figures/full_fig_p044_19.png] view at source ↗

**Figure 20.** Figure 20: Effect of the primed gate initialization on long-context performance. Per-context-length scores of Mamba2-Primed-HQwen3-IT at 8B scale, with (green) and without (red) the gate initialization from Equation (26). The “w/o primed gate” variant follows Wang et al. [2024] and initializes WG randomly. Both variants use the same Stage 0/1/2 recipe and differ only in WG initialization. Scores are the average over… view at source ↗

read the original abstract

Hybrid State-Space models combine Attention with recurrent State-Space Model (SSM) layers, balancing eidetic memory from Attention with compressed fading memory from SSMs. This yields smaller Key-Value caches and faster decoding than Transformers, along with a richer architectural design space. Exploring that design space at scale has so far required training from scratch, a barrier that has kept most large-model Hybrid research within a narrow range of architectures. We introduce Priming, a method that turns Hybrid architecture design from a pre-training problem into a knowledge transfer one. Priming initializes a Hybrid model from a pre-trained Transformer and, through short alignment and post-training phases, recovers downstream quality using less than 0.5% of the source model's pre-training token budget. Priming is agnostic to the source Transformer family (e.g., Qwen, Llama, Mistral), model class (dense or Mixture-of-Experts), and model scale. Priming enables us to run the first controlled comparison of SSM layer types at scale under identical conditions. We evaluate, Gated KalmaNet (GKA), Gated DeltaNet (GDN), and Mamba-2, and show that their expressiveness hierarchy, GKA>GDN>Mamba-2, directly predicts downstream performance on long-context reasoning tasks. We scale Priming to 8B/32B reasoning models with native 128K contexts. Our Hybrid GKA 32B improves over its source Qwen3-32B by +3.8 average reasoning points, while staying within 1% of a Transformer post-trained on the same data and enabling up to 2.3x higher decode throughput. To foster research on Hybrid architectures, we release a model zoo of primed Hybrid models for long-context reasoning and instruction following, together with the Priming training and inference code (Sequence Parallelism algorithms for long-context training, optimized GKA kernels, and vLLM serving plugin), all under Apache~2.0 License.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Priming, a knowledge-transfer method that initializes hybrid models (combining attention with SSM layers such as Gated KalmaNet, Gated DeltaNet, or Mamba-2) from pre-trained Transformers. Short alignment and post-training phases (<0.5% of the source pre-training token budget) recover downstream quality. This enables the first controlled, scale-matched comparison of SSM layer types, revealing an expressiveness hierarchy GKA > GDN > Mamba-2 that predicts long-context reasoning performance. The method is shown to be agnostic to source family, scale, and density (dense/MoE); an 8B/32B GKA hybrid improves +3.8 average reasoning points over Qwen3-32B while remaining within 1% of a same-data Transformer baseline and delivering up to 2.3x decode throughput. A model zoo, training/inference code (including Sequence Parallelism and optimized kernels), and vLLM plugin are released under Apache 2.0.

Significance. If the empirical claims hold, Priming materially lowers the barrier to large-scale hybrid architecture exploration by converting it from a full pre-training problem into a transfer problem. The controlled SSM-layer comparison and released artifacts (models, code, kernels) provide immediate value for the community studying efficient long-context models. The reported hierarchy offers a falsifiable prediction that can be tested by others using the released zoo.

major comments (2)

The abstract states concrete gains (+3.8 reasoning points, 2.3x throughput, within 1% of Transformer baseline) but supplies no information on data splits, statistical significance, number of runs, or post-hoc selection criteria. The full manuscript must include these controls (e.g., in the experimental section) for the central claim of successful transfer without catastrophic forgetting to be verifiable.
The hierarchy claim (GKA > GDN > Mamba-2 directly predicts downstream performance) is load-bearing for the paper's contribution on controlled comparison. The manuscript should report the exact metrics, context lengths, and task suite used to establish this ordering, together with any ablation that isolates layer type from other architectural differences.

minor comments (2)

Clarify the precise definition of 'alignment phase' versus 'post-training phase' and the token budgets allocated to each, ideally with a table or equation.
The claim of agnosticism to source family/scale is strong; a brief table summarizing results across at least two additional source models (beyond Qwen) would strengthen it.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We are pleased that the significance of Priming for lowering the barrier to hybrid architecture exploration is recognized. We address each major comment below and will update the manuscript to incorporate the requested details.

read point-by-point responses

Referee: The abstract states concrete gains (+3.8 reasoning points, 2.3x throughput, within 1% of Transformer baseline) but supplies no information on data splits, statistical significance, number of runs, or post-hoc selection criteria. The full manuscript must include these controls (e.g., in the experimental section) for the central claim of successful transfer without catastrophic forgetting to be verifiable.

Authors: We acknowledge that the abstract is necessarily concise and does not include these details. The full manuscript's experimental section and appendices already describe the post-training data (a curated mix of long-context and reasoning corpora totaling less than 0.5% of the source pre-training budget), the fixed evaluation suite, and results from multiple independent runs. To directly address verifiability, we will add an explicit subsection on 'Reproducibility and Statistical Controls' that states the data splits, confirms three independent runs for the 32B-scale results with standard deviations, notes the absence of post-hoc selection, and reiterates that the same data was used for the matched Transformer baseline. This will make the transfer claims fully verifiable without altering any reported numbers. revision: yes
Referee: The hierarchy claim (GKA > GDN > Mamba-2 directly predicts downstream performance) is load-bearing for the paper's contribution on controlled comparison. The manuscript should report the exact metrics, context lengths, and task suite used to establish this ordering, together with any ablation that isolates layer type from other architectural differences.

Authors: We agree that the hierarchy is central and must be documented with precision. The controlled comparison already fixes the source Transformer, attention layers, training recipe, and hyperparameters across GKA, GDN, and Mamba-2 variants, isolating the SSM layer. The ordering is established on long-context reasoning performance at 128K context using average accuracy across the Needle-in-Haystack, multi-hop QA, and long-document reasoning tasks that match the 32B evaluation suite. We will expand Section 5 to explicitly enumerate the metrics, context lengths, and task suite, and add a dedicated ablation table that varies only the SSM layer while holding all other factors constant. This will confirm the hierarchy's predictive power for downstream performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical knowledge-transfer procedure (Priming) that initializes hybrid SSM-Attention models from existing pre-trained Transformers, followed by short alignment and post-training phases. No equations, derivations, or first-principles results are presented that reduce to their own inputs by construction. The central claims rest on reported downstream performance numbers, model releases, and controlled empirical comparisons of SSM layer types; these are falsifiable experimental outcomes rather than self-referential definitions or fitted parameters renamed as predictions. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method relies on standard transfer-learning assumptions (that Transformer weights provide a useful initialization for hybrid layers and that short alignment suffices) but introduces no new free parameters, axioms, or invented entities beyond the named SSM variants.

pith-pipeline@v0.9.0 · 5692 in / 1233 out tokens · 51259 ms · 2026-05-12T01:08:34.813553+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 14 internal anchors

[1]

GQA : Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA : Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895--4901, Singa...

work page doi:10.18653/v1/2023.emnlp-main.298 2023
[2]

Training-free long-context scaling of large language models

Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models. In International Conference on Machine Learning, pages 1493--1510. PMLR, 2024

work page 2024
[3]

Zoology: Measuring and improving recall in efficient language models

Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Re. Zoology: Measuring and improving recall in efficient language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=LY3ukUANko

work page 2024
[4]

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

Sangmin Bae, Bilge Acun, Haroun Habeeb, Seungyeon Kim, Chien-Yu Lin, Liang Luo, Junjie Wang, and Carole-Jean Wu. Hybrid architectures for language models: Systematic analysis and design insights. arXiv preprint arXiv:2510.04800, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Chan, James Demmel, June Donato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk van der Vorst

Richard Barrett, Michael Berry, Tony F. Chan, James Demmel, June Donato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. Society for Industrial and Applied Mathematics, 1994. doi:10.1137/1.9781611971538. URL https://epubs.siam.org/doi/...

work page doi:10.1137/1.9781611971538 1994
[6]

arXiv preprint arXiv:2508.14444 (2025)

Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, et al. Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model. arXiv preprint arXiv:2508.14444, 2025

work page arXiv 2025
[7]

o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G \

Maximilian Beck, Korbinian P \"o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G \"u nter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. x LSTM : Extended long short-term memory. Advances in Neural Information Processing Systems, 37: 0 107547--107603, 2024

work page 2024
[8]

Transformers to ssms: Distilling quadratic knowledge to subquadratic models

Aviv Bick, Kevin Li, Eric Xing, J Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic knowledge to subquadratic models. Advances in neural information processing systems, 37: 0 31788--31812, 2024

work page 2024
[9]

Nvidia nemotron 3: Efficient and open intelligence, 2025

Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, et al. Nvidia nemotron 3: Efficient and open intelligence. arXiv preprint arXiv:2512.20856, 2025 a

work page arXiv 2025
[10]

Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning.arXiv preprint arXiv:2512.20848, 2025

Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, et al. Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning. arXiv preprint arXiv:2512.20848, 2025 b

work page arXiv 2025
[11]

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report. arXiv preprint arXiv:2603.00729, 2026

work page arXiv 2026
[12]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025 a

work page internal anchor Pith review arXiv 2025
[13]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. In ...

work page 2025
[15]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian A Rohr, Michael J

Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Christian Norgaard, Nayantara Mudur, Martyna Beata Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V. Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian A Rohr, Michael J. Statt, Dan Morris, Drew Purves, Elise Kleeman, Ruth Alcantara, Matthew Abraham, Muqth...

work page 2025
[17]

Transformer-xl: Attentive language models beyond a fixed-length context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 2978--2988, 2019

work page 2019
[18]

Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning, pages 10041--10071. PMLR, 2024

work page 2024
[19]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, 35: 0 16344--16359, 2022

work page 2022
[20]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024

work page internal anchor Pith review arXiv 2024
[21]

Fewer truncations improve language modeling

Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, and Stefano Soatto. Fewer truncations improve language modeling. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=kRxCDDFNpp

work page 2024
[22]

Hymba: A hybrid-head architecture for small language models

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, ZIJIA CHEN, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Celine Lin, Jan Kautz, and Pavlo Molchanov. Hymba: A hybrid-head architecture for small language models. In The Thirteenth International Conference on Learning Representations, 2025. URL ht...

work page 2025
[23]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1 0 (1): 0 12, 2021

work page 2021
[24]

AREAL : A large-scale asynchronous reinforcement learning system for language reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, WANG JIASHU, Tongkai Yang, Binhang Yuan, and Yi Wu. AREAL : A large-scale asynchronous reinforcement learning system for language reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id...

work page 2026
[25]

Extending the context of pretrained llms by dropping their positional embeddings

Yoav Gelberg, Koshi Eguchi, Takuya Akiba, and Edoardo Cetin. Extending the context of pretrained llms by dropping their positional embeddings. arXiv preprint arXiv:2512.12167, 2025

work page arXiv 2025
[26]

Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model, 2024. URL https://arxiv.org/abs/2405.16712

work page arXiv 2024
[27]

RADLADS : Rapid attention distillation to linear attention decoders at scale

Daniel Goldstein, Eric Alcaide, Janna Lu, and Eugene Cheah. RADLADS : Rapid attention distillation to linear attention decoders at scale. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=38GehGepDd

work page 2025
[28]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Combining recurrent, convolutional, and continuous-time models with linear state space layers

Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher R \'e . Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34: 0 572--585, 2021

work page 2021
[31]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=uYLFoz1vlAC

work page 2022
[32]

Jet-nemotron: Efficient language model with post neural architecture search

Yuxian Gu, Qinghao Hu, Haocheng Xi, Junyu Chen, Shang Yang, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[33]

A survey of model reduction by balanced truncation and some new results

Serkan Gugercin and Athanasios C Antoulas. A survey of model reduction by balanced truncation and some new results. International Journal of Control, 77 0 (8): 0 748--766, 2004

work page 2004
[34]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021
[35]

RULER : What s the real context size of your long-context language models? In First Conference on Language Modeling, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER : What s the real context size of your long-context language models? In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=kIoBbc76Sy

work page 2024
[36]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023

work page internal anchor Pith review arXiv 2023
[37]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=chfJJYC3iL

work page 2025
[38]

Kakade, and eran malach

Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and eran malach. Repeat after me: Transformers are better than state space models at copying. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=duRRoGeoQT

work page 2024
[39]

SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

work page 2024
[40]

R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82 0 (1): 0 35--45, 1960

work page 1960
[41]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran c ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156--5165. PMLR, 2020

work page 2020
[42]

Reformer: The efficient transformer

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB

work page 2020
[43]

BABIL ong: Testing the limits of LLM s with long context reasoning-in-a-haystack

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Igorevich Sorokin, Artyom Sorokin, and Mikhail Burtsev. BABIL ong: Testing the limits of LLM s with long context reasoning-in-a-haystack. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=u7m2CG84BQ

work page 2024
[44]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[45]

Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh...

work page 2025
[46]

Liger: Linearizing large language models to gated recurrent structures

Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and Yu Cheng. Liger: Linearizing large language models to gated recurrent structures. In International Conference on Machine Learning, pages 32452--32466. PMLR, 2025

work page 2025
[47]

Distilling to hybrid attention models via kl-guided layer selection

Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda, Jiawei Zhou, and Yoon Kim. Distilling to hybrid attention models via kl-guided layer selection. arXiv preprint arXiv:2512.20569, 2025

work page arXiv 2025
[48]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review arXiv 2024
[49]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214--3252, 2022

work page 2022
[50]

On the stochastic realization problem

Anders Lindquist and Giorgio Picci. On the stochastic realization problem. SIAM Journal on Control and Optimization, 17 0 (3): 0 365--389, 1979

work page 1979
[51]

Ringattention with blockwise transformers for near-infinite context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ringattention with blockwise transformers for near-infinite context. In International Conference on Learning Representations, volume 2024, pages 3992--4008, 2024

work page 2024
[52]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in neural information processing systems, 36: 0 21558--21572, 2023

work page 2023
[53]

PICASO : Permutation-invariant context composition with state space models

Tian Yu Liu, Alessandro Achille, Matthew Trager, Aditya Golatkar, Luca Zancato, and Stefano Soatto. PICASO : Permutation-invariant context composition with state space models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=88TC1AWV27

work page 2025
[54]

L. Ljung. System Identification: Theory for the User. Prentice Hall information and system sciences series. Prentice Hall PTR, 1999. ISBN 9780136566953. URL https://books.google.com/books?id=nHFoQgAACAAJ

work page 1999
[55]

Error propagation properties of recursive least-squares adaptation algorithms

Stefan Ljung and Lennart Ljung. Error propagation properties of recursive least-squares adaptation algorithms. Automatica, 21 0 (2): 0 157--167, 1985. ISSN 0005-1098. doi:https://doi.org/10.1016/0005-1098(85)90110-4. URL https://www.sciencedirect.com/science/article/pii/0005109885901104

work page doi:10.1016/0005-1098(85)90110-4 1985
[56]

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022

work page arXiv 2022
[57]

AMC/AIME : MAA invitational competitions, 2025

Mathematical Association of America . AMC/AIME : MAA invitational competitions, 2025. URL https://maa.org/maa-invitational-competitions/. Accessed: 2025

work page 2025
[58]

Linearizing large language models

Jean Mercat, Igor Vasiljevic, Sedrick Scott Keh, Kushal Arora, Achal Dave, Adrien Gaidon, and Thomas Kollar. Linearizing large language models. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=soGxskHGox

work page 2024
[59]

Landmark attention: Random-access infinite context length for transformers

Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for transformers. In Workshop on Efficient Systems for Foundation Models @ ICML2023, 2023. URL https://openreview.net/forum?id=PkoGERXS1B

work page 2023
[60]

Leave no context behind: Efficient infinite context transformers with infini-attention.arXiv preprint arXiv:2404.07143, 101, 2024

Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv preprint arXiv:2404.07143, 101, 2024

work page arXiv 2024
[61]

Arthur G. O. Mutambara. Decentralized Estimation and Control for Multisensor Systems. CRC Press, 1998

work page 1998
[62]

Expansion span: Combining fading memory and retrieval in hybrid state space models

Elvis Nunez, Luca Zancato, Benjamin Bowman, Aditya Golatkar, Wei Xia, and Stefano Soatto. Expansion span: Combining fading memory and retrieval in hybrid state space models. In International Conference on Neuro-symbolic Systems, pages 570--596. PMLR, 2025

work page 2025
[63]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

Resurrecting recurrent neural networks for long sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pages 26670--26698. PMLR, 2023

work page 2023
[65]

Marconi: Prefix caching for the era of hybrid LLM s

Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, and Ravi Netravali. Marconi: Prefix caching for the era of hybrid LLM s. In Eighth Conference on Machine Learning and Systems, 2025. URL https://openreview.net/forum?id=RUaMUu7vMX

work page 2025
[66]

Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Advances in Neural Information Processing Systems, 2024

work page 2024
[67]

Time-Varying Systems and Computations

Alle-Jan Veen Patrick Dewilde. Time-Varying Systems and Computations. Springer New York, NY, 1998. doi:10.1007/978-1-4757-2817-0

work page doi:10.1007/978-1-4757-2817-0 1998
[68]

Ya RN : Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Ya RN : Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u

work page 2024
[69]

Gated kalmanet: A fading memory layer through test-time ridge regression

Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia, and Stefano Soatto. Gated kalmanet: A fading memory layer through test-time ridge regression. arXiv preprint arXiv:2511.21016, 2025

work page arXiv 2025
[70]

Generalizing verifiable instruction following, 2025

Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following, 2025

work page 2025
[71]

Ulysses sequence parallelism in the hugging face ecosystem, 2025

Kashif Rasul and Bekman Stas. Ulysses sequence parallelism in the hugging face ecosystem, 2025. URL https://huggingface.co/blog/ulysses-sp. Blog post

work page 2025
[72]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98

work page 2024
[73]

Samba: Simple hybrid state space models for efficient unlimited context language modeling

Liliang Ren, Yang Liu, Yadong Lu, yelong shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=bIlnpVM4bc

work page 2025
[74]

Understanding and improving length generalization in recurrent models, 2025

Ricardo Buitrago Ruiz and Albert Gu. Understanding and improving length generalization in recurrent models, 2025. URL https://arxiv.org/abs/2507.02782

work page arXiv 2025
[75]

Sandberg and A

H. Sandberg and A. Rantzer. Balanced truncation of linear time-varying systems. IEEE Transactions on Automatic Control, 49 0 (2): 0 217--229, 2004. doi:10.1109/TAC.2003.822862

work page doi:10.1109/tac.2003.822862 2004
[76]

A.H. Sayed. Fundamentals of Adaptive Filtering. IEEE Press. Wiley, 2003. ISBN 9780471461265. URL https://books.google.com/books?id=VaAV4uqMuKYC

work page 2003
[77]

A.H. Sayed. Adaptive Filters. IEEE Press. Wiley, 2011. ISBN 9781118210840. URL https://books.google.com/books?id=VBaenqIVftUC

work page 2011
[78]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=tVConYid20

work page 2024
[79]

HybridFlow: A Flexible and Efficient RLHF Framework , url=

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys '25, page 1279–1297, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400711961. doi:1...

work page doi:10.1145/3689031.3696075 2025
[80]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

Showing first 80 references.