pith. machine review for the scientific record. sign in

arxiv: 2605.08301 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Priming: Hybrid State Space Models From Pre-trained Transformers

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords hybrid state space modelsprimingknowledge transferlong-context reasoningstate space modelstransformersMambamodel efficiency
0
0 comments X

The pith

Priming initializes hybrid attention-SSM models from pre-trained transformers and recovers performance with under 0.5% of the original training tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Priming as a method that initializes hybrid models combining attention and state-space model layers directly from existing pre-trained transformers. Short alignment and post-training phases then restore downstream task quality without needing to restart pre-training from random weights. A reader would care because this approach makes large-scale exploration of hybrid architectures practical, yielding models with smaller key-value caches, faster decoding, and native support for long contexts while remaining agnostic to the source transformer's family or scale. It also permits the first apples-to-apples comparison of different SSM layer designs under identical training conditions.

Core claim

Priming converts hybrid architecture design from a full pre-training problem into a knowledge transfer task. It initializes a hybrid attention-plus-SSM model from a pre-trained transformer, then applies short alignment and post-training phases that recover downstream quality using less than 0.5% of the source model's original token budget. The procedure works across dense and mixture-of-experts transformers of varying families and sizes, and it enables controlled scaling experiments that reveal a consistent expressiveness ordering among SSM variants: Gated KalmaNet outperforms Gated DeltaNet, which in turn outperforms Mamba-2, with the ordering directly forecasting long-context reasoning and

What carries the argument

Priming, the procedure of copying transformer weights into a hybrid attention-SSM architecture followed by alignment and post-training phases that transfer knowledge without full retraining.

If this is right

  • Hybrid models produced by Priming deliver up to 2.3 times higher decode throughput than the source transformer while remaining within 1% of its quality.
  • The expressiveness ranking GKA greater than GDN greater than Mamba-2 directly predicts which hybrid variant performs best on long-context reasoning tasks.
  • At 32B scale the primed GKA hybrid improves average reasoning scores by 3.8 points over its source Qwen3-32B model.
  • The released model zoo and training code allow other researchers to repeat or extend the same controlled SSM comparisons at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Transformer representations appear general enough that recurrent SSM components can be substituted in with only modest additional training.
  • The same priming approach might be tested for initializing hybrids that combine attention with other recurrent or memory-efficient blocks beyond the three SSMs compared here.
  • Widespread adoption of primed hybrids could shift inference cost curves for long-context applications such as multi-step reasoning and retrieval-augmented generation.

Load-bearing premise

That a hybrid model started from transformer weights can be aligned and post-trained to recover quality without catastrophic forgetting or loss of pre-trained capabilities.

What would settle it

If a primed hybrid model trained on the same post-training data as its source transformer fails to reach within 1% of the transformer's downstream performance on the reported long-context reasoning benchmarks, the recovery claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.08301 by Aditya Chattopadhyay, Benjamin Bowman, David Thomas, Elvis Nunez, Evan Becker, Luca Zancato, Prannay Kaul, Stefano Soatto, Wei Xia.

Figure 1
Figure 1. Figure 1: Primed Hybrid models match Transformer quality at half the memory and decode faster. (Left) Reasoning accuracy of Primed Hybrid reasoning models vs. the source Qwen3 Transformer across five benchmarks at 8B and 32B scale. At 8B, the GKA Hybrid uniformly outperforms its GDN counterpart, consistent with the expressiveness hierarchy GKA > GDN (Section 4). (Top right) Decode throughput speedup of GKA￾Primed-HQ… view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Long-context evaluation at 32B scale. Performance of our Primed Hybrid IT models (GKA and GDN) against the Qwen3-32B [Long] Transformer baseline fine-tuned with the same Stage 2 IT recipe. Results are reported as a weighted average across context lengths from 8K to 128K, with geometrically increasing weights that double at each successive context length. The bottom-right panel (Aggregate) aggregates perfor… view at source ↗
Figure 4
Figure 4. Figure 4: Per-layer importance scores for Qwen3-8B and Qwen3-32B. Each bar is the relative drop in mean HELMET performance (across five sub-tasks: Synthetic Recall, RAG, Many-shot ICL, Generation with Citations, and Passage Re-ranking) when the corresponding Attention layer is individually replaced with SWA of window size w = 2048. Importance is concentrated in a small set of middle-to-late layers in both models; th… view at source ↗
Figure 5
Figure 5. Figure 5: Effect of layer pattern strategy on long-context performance for Hybrid models with a 50% Hybrid ratio sourced from Qwen3-8B. We compare our Selective Pattern against the Uniform Pattern baseline. Each line averages across three Hybrid models, one per SSM layer type (Mamba2, GDN, GKA); scores are reported as a fraction of the Transformer baseline’s score at the same context length. All models, including th… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of Stage 1 supervision granularity on long-context performance for Hybrid models with a 50% Hybrid ratio sourced from Qwen3- 8B. We compare end-to-end MSE (Equation (4)) against layerwise MSE (Equation (5)) for three SSM layer types (GKA, GDN, Mamba2). For each SSM type, the solid bar shows performance under layerwise supervision and the hatched overlay shows the improvement from end￾to-end supervis… view at source ↗
Figure 7
Figure 7. Figure 7: Effect of State Expansion via AGQA on long-context performance for Hybrid models with a 50% Hybrid ratio sourced from Qwen3-32B. We compare Adap￾tive GQA (AGQA) against the standard GQA baseline for two SSM layer types (GKA, GDN). For each SSM type, the solid bar shows perfor￾mance under standard GQA and the hatched over￾lay shows the improvement from AGQA. Qwen3- 32B [Long] is the Transformer baseline. Sc… view at source ↗
Figure 8
Figure 8. Figure 8: Reasoning output token length distributions. Output token length distributions for GKA￾Primed-HQwen3-Reasoner (r=30) and Qwen3-32B [Reasoner-SFT] on AIME 2025 (n=3600 rollouts per model). Left: full benchmark (30 problems). Right: hard subset (15 hardest problems). Both models were trained with the same reasoning SFT recipe and produce similar-length thinking traces, confirming that the Hybrid’s speedups a… view at source ↗
Figure 9
Figure 9. Figure 9: State composition for hybrid models. Long inputs are partitioned into chunks at the native context length, each processed independently. KV caches are concatenated while SSM states are merged (via averaging or an alternative method as in Section E), yielding training-free context extension. Let L denote the model’s native context length and suppose we wish to process an input of length 28 [PITH_FULL_IMAGE… view at source ↗
Figure 10
Figure 10. Figure 10: State composition extends effective context well beyond the 128K native training window. Performance of Primed Hybrid IT models (GKA-8B, GDN-8B, Mamba2-8B, GKA-32B) on RULER NIAH and BABILong at 128K (native), 256K (2×), and 512K (4×) contexts. Solid bars show standard prefill (running the model on the full input in a single pass); hatched bars show the additional gain from state composition with 128K chu… view at source ↗
Figure 11
Figure 11. Figure 11: Input selectivity through βt improves long context capabilities of GKA-Primed models. Long-context performance, averaged over the tasks described in Section 5.1.1 for 8B and 32B GKA-Primed￾HQwen3-IT models with and without input selectivity (βt), across sequence lengths from 8k to 128k. Adding βt generally improves performance, with the gains becoming more pronounced at longer context lengths. 8.2 Variabl… view at source ↗
Figure 12
Figure 12. Figure 12: GKA: Trading compute for speed at inference time. (a): GKA-Primed-HQwen3-IT-8B model. (b): GKA-Primed-HQwen3-IT-32B model. Each curve represents a different context length, and marker shapes denote the number of Chebyshev iterations r. Reducing r from its training value (30) increases throughput with only a modest drop in long-context performance. In both cases, a single trained model supports variable te… view at source ↗
Figure 13
Figure 13. Figure 13: Speedup of symmetric tiled kernels over the non-tiled baseline on H200 and A100 GPUs. The x-axis shows batch size normalized by the number of SMs on each GPU, so that the periodic pattern, caused by successive waves of Triton program instances filling and then exceeding SM capacity, aligns across GPU models. The tiled_small_batch variant provides a consistent speedup at all batch sizes by exploiting Ht sy… view at source ↗
Figure 14
Figure 14. Figure 14: Fused teacher-student architecture used during Stage 1 of Priming. For every SSM layer in the student Hybrid, the fused architecture maintains two parallel layers, corresponding to two parallel pathways for processing the input: the teacher’s Attention layer (dashed, frozen) and the student’s SSM layer (solid, learnable). These layers are highlighted in yellow in the figure. All other layers (highlighted … view at source ↗
Figure 15
Figure 15. Figure 15: SP sharding patterns for NSP = 4. (a) Simple SP: the sequence is split into NSP contiguous chunks, one per GPU. (b) Zig-zag SP: the sequence is split into 2NSP chunks and each GPU receives two discontiguous chunks (e.g., GPU 1 gets chunks 1 and 8), balancing the causal sequence mixing layers workload across ranks. Simple SP pattern. Let the input sequence be of shape [l, b, d], where l is the sequence len… view at source ↗
Figure 16
Figure 16. Figure 16: Stage 1 alignment recipe. Data composition by domain (left) and training configuration (right). Stage 1 models are trained on 40B tokens consisting of a mix of instruction-following and web text. 10.2 Instruction Tuning Starting from a Hybrid model that has undergone Stage 0 initialization (Section 3.2.2) and Stage 1 alignment (Section 10.1), we describe our Stage 2 recipe for producing long-context instr… view at source ↗
Figure 17
Figure 17. Figure 17: Instruction model training data distribution. Domain composition by token count for long￾context continued pre-traing (left) and instruction-tuning SFT (right). “Safety” refers to samples that train appropriate refusal behavior and robustness to adversarial prompts. context length, we increase the RoPE base frequency from 1M (the source Transformer’s default) to 5M [Xiong et al., 2024]. SFT for instructio… view at source ↗
Figure 18
Figure 18. Figure 18: Our multi-stage long-context reasoning pipeline. Stages 0+1 of the Priming procedure allow our Hybrid architectures to sidestep pre-training costs by leveraging a pre-trained Transformer model. Stage 2 for reasoning consists of three phases. (i) Short-context SFT consists of SFT on packed reasoning examples at 32K context. Next, (ii) a context-extension phase where we train on a mix of reasoning samples a… view at source ↗
Figure 19
Figure 19. Figure 19: We depict the document and token ratios of our reasoning SFT data mixture by domain. [PITH_FULL_IMAGE:figures/full_fig_p044_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Effect of the primed gate initialization on long-context performance. Per-context-length scores of Mamba2-Primed-HQwen3-IT at 8B scale, with (green) and without (red) the gate initialization from Equation (26). The “w/o primed gate” variant follows Wang et al. [2024] and initializes WG randomly. Both variants use the same Stage 0/1/2 recipe and differ only in WG initialization. Scores are the average over… view at source ↗
read the original abstract

Hybrid State-Space models combine Attention with recurrent State-Space Model (SSM) layers, balancing eidetic memory from Attention with compressed fading memory from SSMs. This yields smaller Key-Value caches and faster decoding than Transformers, along with a richer architectural design space. Exploring that design space at scale has so far required training from scratch, a barrier that has kept most large-model Hybrid research within a narrow range of architectures. We introduce Priming, a method that turns Hybrid architecture design from a pre-training problem into a knowledge transfer one. Priming initializes a Hybrid model from a pre-trained Transformer and, through short alignment and post-training phases, recovers downstream quality using less than 0.5% of the source model's pre-training token budget. Priming is agnostic to the source Transformer family (e.g., Qwen, Llama, Mistral), model class (dense or Mixture-of-Experts), and model scale. Priming enables us to run the first controlled comparison of SSM layer types at scale under identical conditions. We evaluate, Gated KalmaNet (GKA), Gated DeltaNet (GDN), and Mamba-2, and show that their expressiveness hierarchy, GKA>GDN>Mamba-2, directly predicts downstream performance on long-context reasoning tasks. We scale Priming to 8B/32B reasoning models with native 128K contexts. Our Hybrid GKA 32B improves over its source Qwen3-32B by +3.8 average reasoning points, while staying within 1% of a Transformer post-trained on the same data and enabling up to 2.3x higher decode throughput. To foster research on Hybrid architectures, we release a model zoo of primed Hybrid models for long-context reasoning and instruction following, together with the Priming training and inference code (Sequence Parallelism algorithms for long-context training, optimized GKA kernels, and vLLM serving plugin), all under Apache~2.0 License.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Priming, a knowledge-transfer method that initializes hybrid models (combining attention with SSM layers such as Gated KalmaNet, Gated DeltaNet, or Mamba-2) from pre-trained Transformers. Short alignment and post-training phases (<0.5% of the source pre-training token budget) recover downstream quality. This enables the first controlled, scale-matched comparison of SSM layer types, revealing an expressiveness hierarchy GKA > GDN > Mamba-2 that predicts long-context reasoning performance. The method is shown to be agnostic to source family, scale, and density (dense/MoE); an 8B/32B GKA hybrid improves +3.8 average reasoning points over Qwen3-32B while remaining within 1% of a same-data Transformer baseline and delivering up to 2.3x decode throughput. A model zoo, training/inference code (including Sequence Parallelism and optimized kernels), and vLLM plugin are released under Apache 2.0.

Significance. If the empirical claims hold, Priming materially lowers the barrier to large-scale hybrid architecture exploration by converting it from a full pre-training problem into a transfer problem. The controlled SSM-layer comparison and released artifacts (models, code, kernels) provide immediate value for the community studying efficient long-context models. The reported hierarchy offers a falsifiable prediction that can be tested by others using the released zoo.

major comments (2)
  1. The abstract states concrete gains (+3.8 reasoning points, 2.3x throughput, within 1% of Transformer baseline) but supplies no information on data splits, statistical significance, number of runs, or post-hoc selection criteria. The full manuscript must include these controls (e.g., in the experimental section) for the central claim of successful transfer without catastrophic forgetting to be verifiable.
  2. The hierarchy claim (GKA > GDN > Mamba-2 directly predicts downstream performance) is load-bearing for the paper's contribution on controlled comparison. The manuscript should report the exact metrics, context lengths, and task suite used to establish this ordering, together with any ablation that isolates layer type from other architectural differences.
minor comments (2)
  1. Clarify the precise definition of 'alignment phase' versus 'post-training phase' and the token budgets allocated to each, ideally with a table or equation.
  2. The claim of agnosticism to source family/scale is strong; a brief table summarizing results across at least two additional source models (beyond Qwen) would strengthen it.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We are pleased that the significance of Priming for lowering the barrier to hybrid architecture exploration is recognized. We address each major comment below and will update the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: The abstract states concrete gains (+3.8 reasoning points, 2.3x throughput, within 1% of Transformer baseline) but supplies no information on data splits, statistical significance, number of runs, or post-hoc selection criteria. The full manuscript must include these controls (e.g., in the experimental section) for the central claim of successful transfer without catastrophic forgetting to be verifiable.

    Authors: We acknowledge that the abstract is necessarily concise and does not include these details. The full manuscript's experimental section and appendices already describe the post-training data (a curated mix of long-context and reasoning corpora totaling less than 0.5% of the source pre-training budget), the fixed evaluation suite, and results from multiple independent runs. To directly address verifiability, we will add an explicit subsection on 'Reproducibility and Statistical Controls' that states the data splits, confirms three independent runs for the 32B-scale results with standard deviations, notes the absence of post-hoc selection, and reiterates that the same data was used for the matched Transformer baseline. This will make the transfer claims fully verifiable without altering any reported numbers. revision: yes

  2. Referee: The hierarchy claim (GKA > GDN > Mamba-2 directly predicts downstream performance) is load-bearing for the paper's contribution on controlled comparison. The manuscript should report the exact metrics, context lengths, and task suite used to establish this ordering, together with any ablation that isolates layer type from other architectural differences.

    Authors: We agree that the hierarchy is central and must be documented with precision. The controlled comparison already fixes the source Transformer, attention layers, training recipe, and hyperparameters across GKA, GDN, and Mamba-2 variants, isolating the SSM layer. The ordering is established on long-context reasoning performance at 128K context using average accuracy across the Needle-in-Haystack, multi-hop QA, and long-document reasoning tasks that match the 32B evaluation suite. We will expand Section 5 to explicitly enumerate the metrics, context lengths, and task suite, and add a dedicated ablation table that varies only the SSM layer while holding all other factors constant. This will confirm the hierarchy's predictive power for downstream performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical knowledge-transfer procedure (Priming) that initializes hybrid SSM-Attention models from existing pre-trained Transformers, followed by short alignment and post-training phases. No equations, derivations, or first-principles results are presented that reduce to their own inputs by construction. The central claims rest on reported downstream performance numbers, model releases, and controlled empirical comparisons of SSM layer types; these are falsifiable experimental outcomes rather than self-referential definitions or fitted parameters renamed as predictions. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method relies on standard transfer-learning assumptions (that Transformer weights provide a useful initialization for hybrid layers and that short alignment suffices) but introduces no new free parameters, axioms, or invented entities beyond the named SSM variants.

pith-pipeline@v0.9.0 · 5692 in / 1233 out tokens · 51259 ms · 2026-05-12T01:08:34.813553+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 14 internal anchors

  1. [1]

    GQA : Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA : Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895--4901, Singa...

  2. [2]

    Training-free long-context scaling of large language models

    Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models. In International Conference on Machine Learning, pages 1493--1510. PMLR, 2024

  3. [3]

    Zoology: Measuring and improving recall in efficient language models

    Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Re. Zoology: Measuring and improving recall in efficient language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=LY3ukUANko

  4. [4]

    Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

    Sangmin Bae, Bilge Acun, Haroun Habeeb, Seungyeon Kim, Chien-Yu Lin, Liang Luo, Junjie Wang, and Carole-Jean Wu. Hybrid architectures for language models: Systematic analysis and design insights. arXiv preprint arXiv:2510.04800, 2025

  5. [5]

    Chan, James Demmel, June Donato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk van der Vorst

    Richard Barrett, Michael Berry, Tony F. Chan, James Demmel, June Donato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. Society for Industrial and Applied Mathematics, 1994. doi:10.1137/1.9781611971538. URL https://epubs.siam.org/doi/...

  6. [6]

    arXiv preprint arXiv:2508.14444 (2025)

    Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, et al. Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model. arXiv preprint arXiv:2508.14444, 2025

  7. [7]

    o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G \

    Maximilian Beck, Korbinian P \"o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G \"u nter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. x LSTM : Extended long short-term memory. Advances in Neural Information Processing Systems, 37: 0 107547--107603, 2024

  8. [8]

    Transformers to ssms: Distilling quadratic knowledge to subquadratic models

    Aviv Bick, Kevin Li, Eric Xing, J Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic knowledge to subquadratic models. Advances in neural information processing systems, 37: 0 31788--31812, 2024

  9. [9]

    Nvidia nemotron 3: Efficient and open intelligence, 2025

    Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, et al. Nvidia nemotron 3: Efficient and open intelligence. arXiv preprint arXiv:2512.20856, 2025 a

  10. [10]

    Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning.arXiv preprint arXiv:2512.20848, 2025

    Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, et al. Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning. arXiv preprint arXiv:2512.20848, 2025 b

  11. [11]

    Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al

    Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report. arXiv preprint arXiv:2603.00729, 2026

  12. [12]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025 a

  13. [13]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  14. [14]

    Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun

    Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. In ...

  15. [15]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  16. [16]

    Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian A Rohr, Michael J

    Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Christian Norgaard, Nayantara Mudur, Martyna Beata Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V. Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian A Rohr, Michael J. Statt, Dan Morris, Drew Purves, Elise Kleeman, Ruth Alcantara, Matthew Abraham, Muqth...

  17. [17]

    Transformer-xl: Attentive language models beyond a fixed-length context

    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 2978--2988, 2019

  18. [18]

    Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning, pages 10041--10071. PMLR, 2024

  19. [19]

    Flashattention: Fast and memory-efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, 35: 0 16344--16359, 2022

  20. [20]

    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

    Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024

  21. [21]

    Fewer truncations improve language modeling

    Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, and Stefano Soatto. Fewer truncations improve language modeling. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=kRxCDDFNpp

  22. [22]

    Hymba: A hybrid-head architecture for small language models

    Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, ZIJIA CHEN, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Celine Lin, Jan Kautz, and Pavlo Molchanov. Hymba: A hybrid-head architecture for small language models. In The Thirteenth International Conference on Learning Representations, 2025. URL ht...

  23. [23]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1 0 (1): 0 12, 2021

  24. [24]

    AREAL : A large-scale asynchronous reinforcement learning system for language reasoning

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, WANG JIASHU, Tongkai Yang, Binhang Yuan, and Yi Wu. AREAL : A large-scale asynchronous reinforcement learning system for language reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id...

  25. [25]

    Extending the context of pretrained llms by dropping their positional embeddings

    Yoav Gelberg, Koshi Eguchi, Takuya Akiba, and Edoardo Cetin. Extending the context of pretrained llms by dropping their positional embeddings. arXiv preprint arXiv:2512.12167, 2025

  26. [26]

    Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,

    Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model, 2024. URL https://arxiv.org/abs/2405.16712

  27. [27]

    RADLADS : Rapid attention distillation to linear attention decoders at scale

    Daniel Goldstein, Eric Alcaide, Janna Lu, and Eugene Cheah. RADLADS : Rapid attention distillation to linear attention decoders at scale. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=38GehGepDd

  28. [28]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  29. [29]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  30. [30]

    Combining recurrent, convolutional, and continuous-time models with linear state space layers

    Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher R \'e . Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34: 0 572--585, 2021

  31. [31]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=uYLFoz1vlAC

  32. [32]

    Jet-nemotron: Efficient language model with post neural architecture search

    Yuxian Gu, Qinghao Hu, Haocheng Xi, Junyu Chen, Shang Yang, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  33. [33]

    A survey of model reduction by balanced truncation and some new results

    Serkan Gugercin and Athanasios C Antoulas. A survey of model reduction by balanced truncation and some new results. International Journal of Control, 77 0 (8): 0 748--766, 2004

  34. [34]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ

  35. [35]

    RULER : What s the real context size of your long-context language models? In First Conference on Language Modeling, 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER : What s the real context size of your long-context language models? In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=kIoBbc76Sy

  36. [36]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023

  37. [37]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=chfJJYC3iL

  38. [38]

    Kakade, and eran malach

    Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and eran malach. Repeat after me: Transformers are better than state space models at copying. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=duRRoGeoQT

  39. [39]

    SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

  40. [40]

    R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82 0 (1): 0 35--45, 1960

  41. [41]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran c ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156--5165. PMLR, 2020

  42. [42]

    Reformer: The efficient transformer

    Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB

  43. [43]

    BABIL ong: Testing the limits of LLM s with long context reasoning-in-a-haystack

    Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Igorevich Sorokin, Artyom Sorokin, and Mikhail Burtsev. BABIL ong: Testing the limits of LLM s with long context reasoning-in-a-haystack. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=u7m2CG84BQ

  44. [44]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  45. [45]

    Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh...

  46. [46]

    Liger: Linearizing large language models to gated recurrent structures

    Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and Yu Cheng. Liger: Linearizing large language models to gated recurrent structures. In International Conference on Machine Learning, pages 32452--32466. PMLR, 2025

  47. [47]

    Distilling to hybrid attention models via kl-guided layer selection

    Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda, Jiawei Zhou, and Yoon Kim. Distilling to hybrid attention models via kl-guided layer selection. arXiv preprint arXiv:2512.20569, 2025

  48. [48]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024

  49. [49]

    Truthfulqa: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214--3252, 2022

  50. [50]

    On the stochastic realization problem

    Anders Lindquist and Giorgio Picci. On the stochastic realization problem. SIAM Journal on Control and Optimization, 17 0 (3): 0 365--389, 1979

  51. [51]

    Ringattention with blockwise transformers for near-infinite context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ringattention with blockwise transformers for near-infinite context. In International Conference on Learning Representations, volume 2024, pages 3992--4008, 2024

  52. [52]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in neural information processing systems, 36: 0 21558--21572, 2023

  53. [53]

    PICASO : Permutation-invariant context composition with state space models

    Tian Yu Liu, Alessandro Achille, Matthew Trager, Aditya Golatkar, Luca Zancato, and Stefano Soatto. PICASO : Permutation-invariant context composition with state space models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=88TC1AWV27

  54. [54]

    L. Ljung. System Identification: Theory for the User. Prentice Hall information and system sciences series. Prentice Hall PTR, 1999. ISBN 9780136566953. URL https://books.google.com/books?id=nHFoQgAACAAJ

  55. [55]

    Error propagation properties of recursive least-squares adaptation algorithms

    Stefan Ljung and Lennart Ljung. Error propagation properties of recursive least-squares adaptation algorithms. Automatica, 21 0 (2): 0 157--167, 1985. ISSN 0005-1098. doi:https://doi.org/10.1016/0005-1098(85)90110-4. URL https://www.sciencedirect.com/science/article/pii/0005109885901104

  56. [56]

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022

  57. [57]

    AMC/AIME : MAA invitational competitions, 2025

    Mathematical Association of America . AMC/AIME : MAA invitational competitions, 2025. URL https://maa.org/maa-invitational-competitions/. Accessed: 2025

  58. [58]

    Linearizing large language models

    Jean Mercat, Igor Vasiljevic, Sedrick Scott Keh, Kushal Arora, Achal Dave, Adrien Gaidon, and Thomas Kollar. Linearizing large language models. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=soGxskHGox

  59. [59]

    Landmark attention: Random-access infinite context length for transformers

    Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for transformers. In Workshop on Efficient Systems for Foundation Models @ ICML2023, 2023. URL https://openreview.net/forum?id=PkoGERXS1B

  60. [60]

    Leave no context behind: Efficient infinite context transformers with infini-attention.arXiv preprint arXiv:2404.07143, 101, 2024

    Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv preprint arXiv:2404.07143, 101, 2024

  61. [61]

    Arthur G. O. Mutambara. Decentralized Estimation and Control for Multisensor Systems. CRC Press, 1998

  62. [62]

    Expansion span: Combining fading memory and retrieval in hybrid state space models

    Elvis Nunez, Luca Zancato, Benjamin Bowman, Aditya Golatkar, Wei Xia, and Stefano Soatto. Expansion span: Combining fading memory and retrieval in hybrid state space models. In International Conference on Neuro-symbolic Systems, pages 570--596. PMLR, 2025

  63. [63]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

  64. [64]

    Resurrecting recurrent neural networks for long sequences

    Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pages 26670--26698. PMLR, 2023

  65. [65]

    Marconi: Prefix caching for the era of hybrid LLM s

    Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, and Ravi Netravali. Marconi: Prefix caching for the era of hybrid LLM s. In Eighth Conference on Machine Learning and Systems, 2025. URL https://openreview.net/forum?id=RUaMUu7vMX

  66. [66]

    Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

    Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Advances in Neural Information Processing Systems, 2024

  67. [67]

    Time-Varying Systems and Computations

    Alle-Jan Veen Patrick Dewilde. Time-Varying Systems and Computations. Springer New York, NY, 1998. doi:10.1007/978-1-4757-2817-0

  68. [68]

    Ya RN : Efficient context window extension of large language models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Ya RN : Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u

  69. [69]

    Gated kalmanet: A fading memory layer through test-time ridge regression

    Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia, and Stefano Soatto. Gated kalmanet: A fading memory layer through test-time ridge regression. arXiv preprint arXiv:2511.21016, 2025

  70. [70]

    Generalizing verifiable instruction following, 2025

    Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following, 2025

  71. [71]

    Ulysses sequence parallelism in the hugging face ecosystem, 2025

    Kashif Rasul and Bekman Stas. Ulysses sequence parallelism in the hugging face ecosystem, 2025. URL https://huggingface.co/blog/ulysses-sp. Blog post

  72. [72]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98

  73. [73]

    Samba: Simple hybrid state space models for efficient unlimited context language modeling

    Liliang Ren, Yang Liu, Yadong Lu, yelong shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=bIlnpVM4bc

  74. [74]

    Understanding and improving length generalization in recurrent models, 2025

    Ricardo Buitrago Ruiz and Albert Gu. Understanding and improving length generalization in recurrent models, 2025. URL https://arxiv.org/abs/2507.02782

  75. [75]

    Sandberg and A

    H. Sandberg and A. Rantzer. Balanced truncation of linear time-varying systems. IEEE Transactions on Automatic Control, 49 0 (2): 0 217--229, 2004. doi:10.1109/TAC.2003.822862

  76. [76]

    A.H. Sayed. Fundamentals of Adaptive Filtering. IEEE Press. Wiley, 2003. ISBN 9780471461265. URL https://books.google.com/books?id=VaAV4uqMuKYC

  77. [77]

    A.H. Sayed. Adaptive Filters. IEEE Press. Wiley, 2011. ISBN 9781118210840. URL https://books.google.com/books?id=VBaenqIVftUC

  78. [78]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=tVConYid20

  79. [79]

    HybridFlow: A Flexible and Efficient RLHF Framework , url=

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys '25, page 1279–1297, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400711961. doi:1...

  80. [80]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

Showing first 80 references.