pith. machine review for the scientific record. sign in

arxiv: 2604.19877 · v1 · submitted 2026-04-21 · 💻 cs.LG

Recognition: unknown

Super Apriel: One Checkpoint, Many Speeds

Alireza Mousavi-Hosseini, Aman Tiwari, Denis Kocetkov, Joel Lamy Poirier, Kelechi Ogueji, Nanda H Krishna, Rafael Pardinas, Raymond Li, Sathwik Tejaswi Madhusudhan, Shruthan Radhakrishna, SLAM Labs: Oleksiy Ostapenko, Srinivas Sunkara, Torsten Scholak, Valerie Becaert

Pith reviewed 2026-05-10 02:22 UTC · model grok-4.3

classification 💻 cs.LG
keywords supernetattention mixersinference servingspeed-quality tradeoffmodel distillationtransformer layersspeculative decoding
0
0 comments X

The pith

A single 15B supernet checkpoint supports multiple inference speed presets by switching per-layer mixer choices at serving time without reloading weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that training one supernet model with four mixer options available in every decoder layer creates a shared set of weights from which any combination of those mixers can be selected per layer to form a placement. These placements can be changed between requests during serving because no weights need to be swapped, producing a range of speed-quality operating points from the identical checkpoint. The full-attention placement performs on par with the original teacher model across benchmarks, while hybrid placements deliver 2.9 times to 10.7 times higher decode throughput at 96 percent down to 77 percent quality retention, with larger gains at longer contexts. A surrogate predictor is used to navigate the enormous space of possible placements and locate strong tradeoffs at each target speed.

Core claim

Super Apriel is a 15B-parameter supernet in which every decoder layer provides four trained mixer choices—Full Attention, Sliding Window Attention, Kimi Delta Attention, and Gated DeltaNet—so that a placement formed by selecting one mixer per layer can be activated at inference time without any weight reloading, thereby enabling multiple speed presets from one checkpoint; the all-Full-Attention placement matches the Apriel 1.6 teacher on all reported benchmarks, and recommended hybrid placements span 2.9× to 10.7× decode throughput at 96% to 77% quality retention.

What carries the argument

A supernet placement: the assignment of one mixer out of the four trained options to each decoder layer, which can be switched at serving time because all mixers share the same underlying weights.

If this is right

  • The all-full-attention placement matches the teacher model on every benchmark reported.
  • Hybrid placements achieve 2.9× to 10.7× decode throughput gains while retaining 77% to 96% of quality.
  • The throughput advantage over the all-full-attention baseline grows as context length increases.
  • The shared checkpoint enables speculative decoding without requiring a separate draft model.
  • The surrogate predictor reduces the search over the large space of possible layer assignments to a tractable problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A single deployed checkpoint could replace several separately trained models for different latency targets, lowering memory usage in serving clusters.
  • The observed instability of efficient placements at 15B scale suggests that scaling laws for supernet training may need explicit stabilization methods at larger sizes.
  • The same per-layer choice mechanism could be applied to other model components such as feed-forward networks to create even more operating points.
  • Combining the runtime mixer switching with KV-cache optimizations or quantization would likely compound the speed benefits further.

Load-bearing premise

That placements selected by the surrogate predictor at 15B scale will deliver the stated quality retention and throughput numbers when used with real user inputs and production serving hardware.

What would settle it

Measure decode throughput and benchmark scores for one of the paper's recommended hybrid placements on a long-context task outside the original evaluation set and check whether the reported multipliers and retention percentages still hold.

read the original abstract

We release Super Apriel, a 15B-parameter supernet in which every decoder layer provides four trained mixer choices -- Full Attention (FA), Sliding Window Attention (SWA), Kimi Delta Attention (KDA), and Gated DeltaNet (GDN). A placement selects one mixer per layer; placements can be switched between requests at serving time without reloading weights, enabling multiple speed presets from a single checkpoint. The shared checkpoint also enables speculative decoding without a separate draft model. The all-FA preset matches the Apriel 1.6 teacher on all reported benchmarks; recommended hybrid presets span $2.9\times$ to $10.7\times$ decode throughput at 96% to 77% quality retention, with throughput advantages that compound at longer context lengths. With four mixer types across 48 layers, the configuration space is vast. A surrogate that predicts placement quality from the per-layer mixer assignment makes the speed-quality landscape tractable and identifies the best tradeoffs at each speed level. We investigate whether the best configurations at each speed level can be identified early in training or only after convergence. Rankings stabilize quickly at 0.5B scale, but the most efficient configurations exhibit higher instability at 15B, cautioning against extrapolation from smaller models. Super Apriel is trained by stochastic distillation from a frozen Apriel 1.6 teacher, followed by supervised fine-tuning. We release the supernet weights, Fast-LLM training code, vLLM serving code, and a placement optimization toolkit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Super Apriel, a 15B-parameter supernet in which each of 48 decoder layers offers four trained mixer choices (Full Attention, Sliding Window Attention, Kimi Delta Attention, Gated DeltaNet). A placement assigns one mixer per layer; these placements can be switched between requests at serving time without reloading weights, yielding multiple speed presets from a single checkpoint. The all-FA preset matches the Apriel 1.6 teacher on reported benchmarks. Recommended hybrid placements deliver 2.9×–10.7× decode throughput at 96–77% quality retention, with gains compounding at longer contexts. A surrogate model predicts placement quality from the per-layer assignment to navigate the configuration space. The supernet is obtained via stochastic distillation from the frozen teacher followed by supervised fine-tuning. The authors release the weights, Fast-LLM training code, vLLM serving code, and a placement optimization toolkit.

Significance. If the central claims hold, the work offers a practical mechanism for serving a single large-model checkpoint under multiple latency/quality operating points by runtime mixer switching, which is valuable for production inference pipelines with heterogeneous request profiles. The release of weights, training/serving code, and the optimization toolkit is a clear strength that supports reproducibility and extension. The approach also demonstrates that speculative decoding can reuse the same checkpoint without a separate draft model. These contributions would be of interest to the efficient-inference community provided the surrogate-driven speed-quality points are shown to be reliable at the target scale.

major comments (3)
  1. [Abstract and surrogate-experiments section] The headline hybrid-preset claims (2.9×–10.7× throughput at 96–77% quality retention) rest on the surrogate identifying high-quality placements at 15B scale. The manuscript itself reports that “the most efficient configurations exhibit higher instability at 15B,” yet provides no direct benchmark results or error bars for the specific recommended hybrids; only surrogate predictions are cited. This gap directly undermines in the operating points.
  2. [Investigation of early identification of configurations] The claim that rankings “stabilize quickly at 0.5B scale” is used to justify early identification of good placements, but the same paragraph notes elevated instability precisely for the efficient (high-speed) configurations at 15B. No ablation or correlation study is presented showing that the surrogate’s 15B predictions remain accurate for the low-quality-retention hybrids that deliver the largest speedups.
  3. [Benchmark and serving-results section] Quality-retention percentages are reported relative to the teacher on “all reported benchmarks,” but the manuscript supplies neither the exact task suite, number of evaluation runs, nor variance estimates. Without these, it is impossible to judge whether the 77% floor for the fastest preset is statistically distinguishable from the surrogate’s prediction error.
minor comments (2)
  1. [Mixer definitions] The definitions and architectural details of the newly introduced Kimi Delta Attention (KDA) and Gated DeltaNet (GDN) mixers are referenced but not fully specified in the main text; a concise appendix equation or diagram would improve clarity.
  2. [Serving implementation] The description of how placements are switched inside the vLLM serving path (kernel selection, KV-cache handling) is high-level; pseudocode or a short code snippet would help readers reproduce the “no weight reload” property.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications based on the experiments reported in the manuscript. Where revisions are warranted, we indicate them explicitly.

read point-by-point responses
  1. Referee: [Abstract and surrogate-experiments section] The headline hybrid-preset claims (2.9×–10.7× throughput at 96–77% quality retention) rest on the surrogate identifying high-quality placements at 15B scale. The manuscript itself reports that “the most efficient configurations exhibit higher instability at 15B,” yet provides no direct benchmark results or error bars for the specific recommended hybrids; only surrogate predictions are cited. This gap directly undermines in the operating points.

    Authors: We acknowledge the concern regarding the reliance on surrogate predictions for the headline hybrid claims at 15B scale. The manuscript explicitly notes the higher instability observed for the most efficient configurations at this scale, which is why we present the surrogate as a practical navigation tool rather than a perfect substitute. Direct evaluation of every recommended hybrid at full 15B scale was not performed due to computational cost; the surrogate was validated through smaller-scale ablations where direct comparisons were feasible. We will revise the abstract and surrogate-experiments section to more prominently discuss the surrogate's limitations, the observed instability, and the fact that the reported operating points are surrogate-derived predictions. This will include clearer caveats on confidence intervals. revision: partial

  2. Referee: [Investigation of early identification of configurations] The claim that rankings “stabilize quickly at 0.5B scale” is used to justify early identification of good placements, but the same paragraph notes elevated instability precisely for the efficient (high-speed) configurations at 15B. No ablation or correlation study is presented showing that the surrogate’s 15B predictions remain accurate for the low-quality-retention hybrids that deliver the largest speedups.

    Authors: The stabilization of rankings at 0.5B scale is an empirical observation from our scaling experiments, while the elevated instability at 15B for efficient placements is separately reported as a caution against naive extrapolation. We agree that an explicit correlation or ablation study focused on the low-quality-retention (high-speed) hybrids would strengthen the justification for using the surrogate at 15B. We will add a targeted discussion and, where data permits, a small-scale correlation analysis in the revised manuscript to address this gap directly. revision: yes

  3. Referee: [Benchmark and serving-results section] Quality-retention percentages are reported relative to the teacher on “all reported benchmarks,” but the manuscript supplies neither the exact task suite, number of evaluation runs, nor variance estimates. Without these, it is impossible to judge whether the 77% floor for the fastest preset is statistically distinguishable from the surrogate’s prediction error.

    Authors: We will revise the benchmark and serving-results section to explicitly enumerate the task suite (the same benchmarks used to validate the Apriel 1.6 teacher), the number of evaluation runs performed, and any available variance estimates. This will allow readers to assess the statistical reliability of the quality-retention figures relative to both the teacher and the surrogate's prediction uncertainty. revision: yes

standing simulated objections not resolved
  • Direct 15B-scale benchmark results and error bars for the specific recommended hybrid placements are unavailable, as these configurations were evaluated exclusively via the surrogate model owing to computational constraints.

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical training and evaluation

full rationale

The paper presents an empirical supernet trained via stochastic distillation from a frozen teacher followed by supervised fine-tuning. Placements are selected using a surrogate model trained on per-layer assignments, but the headline speed-quality numbers (2.9×–10.7× throughput at 96–77% retention) and benchmark matches are obtained from direct vLLM serving measurements and evaluations on reported tasks, not derived from the surrogate by construction. No equations reduce fitted parameters to predictions, no uniqueness theorems are invoked via self-citation, and no ansatz or renaming of known results occurs. Self-citation to the Apriel 1.6 teacher provides context but is not load-bearing for the new measurements, which remain independently verifiable via released weights and code. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claims depend on the empirical success of stochastic distillation into a shared-weight supernet and the predictive accuracy of the surrogate; the design assumes mixer compatibility within one set of weights.

free parameters (1)
  • surrogate model parameters
    The surrogate that predicts placement quality from mixer assignments is trained on data and therefore contains fitted parameters.
axioms (1)
  • domain assumption Different mixer choices can be activated at inference time within the same shared weights without additional training or reloading.
    This is required for the runtime switching benefit described in the abstract.
invented entities (2)
  • Kimi Delta Attention (KDA) no independent evidence
    purpose: One of the four mixer options per layer.
    Presented as a distinct attention mechanism available for placement selection.
  • Gated DeltaNet (GDN) no independent evidence
    purpose: One of the four mixer options per layer.
    Presented as a distinct attention mechanism available for placement selection.

pith-pipeline@v0.9.0 · 5650 in / 1542 out tokens · 60947 ms · 2026-05-10T02:22:59.173869+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 12 canonical work pages · 9 internal anchors

  1. [1]

    GQA : Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. GQA : Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 4895--4901, 2023

  2. [2]

    Language models enable simple systems for generating structured views of heterogeneous data lakes

    Simran Arora, Brandon Yang, Sabri Eyber, Avanika Rose, Aditya Narayan, Ines Chami, Dorsa Sadigh, and Christopher R \'e . Language models enable simple systems for generating structured views of heterogeneous data lakes. 2023

  3. [3]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. ^2 -bench: Evaluating conversational agents in a dual-control environment, 2025. URL https://arxiv.org/abs/2506.07982

  4. [4]

    Peters, and Arman Cohan

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. 2020

  5. [5]

    Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vaswani, and Quoc V. Le. Understanding and simplifying one-shot architecture search. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp.\ 549--558, 2018

  6. [6]

    Li, Eric P

    Aviv Bick, Kevin Y. Li, Eric P. Xing, J. Zico Kolter, and Albert Gu. Transformers to SSM s: Distilling quadratic knowledge to subquadratic models. 2024

  7. [7]

    Llamba: Scaling distilled recurrent models for efficient language processing

    Aviv Bick et al. Llamba: Scaling distilled recurrent models for efficient language processing. 2025

  8. [8]

    Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models

    Andrew Blakeman et al. Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models. 2025

  9. [9]

    Once-for-all: Train one network and specialize it for efficient deployment

    Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize it for efficient deployment. In Proceedings of the International Conference on Learning Representations (ICLR), 2020

  10. [10]

    MiniMax-M1 : Scaling test-time compute efficiently with Lightning Attention

    An Chen et al. MiniMax-M1 : Scaling test-time compute efficiently with Lightning Attention . 2025

  11. [11]

    Generating long sequences with sparse transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. 2019

  12. [12]

    Training verifiers to solve math word problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. 2021

  13. [13]

    Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality. 2024

  14. [14]

    Smith, Anushan Fernando, et al

    Soham De, Samuel L. Smith, Anushan Fernando, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models. 2024

  15. [15]

    Cluster approach to order-disorder transformations in alloys

    Didier de Fontaine. Cluster approach to order-disorder transformations in alloys. In Solid State Physics, volume 47, pp.\ 33--176. 1994

  16. [16]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. 2021. Transformer Circuits Thread. Accessed 2026-02-17

  17. [17]

    Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  18. [18]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. 2023

  19. [19]

    Jet-nemotron: Efficient language model with post neural architecture search

    Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search. 2025

  20. [20]

    OpenThoughts: Data Recipes for Reasoning Models

    Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

  21. [21]

    Single path one-shot neural architecture search with uniform path sampling

    Zichao Guo, Xiangyu Zhang, Haoyuan Mu, et al. Single path one-shot neural architecture search with uniform path sampling. In Proceedings of the European Conference on Computer Vision (ECCV), pp.\ 544--560, 2020

  22. [22]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. 2021 a

  23. [23]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. 2021 b

  24. [24]

    RULER : What's the real context size of your long-context language models? 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Joshua Ainslie, Nelson Elhage, Albert Gu, Beidi Kim, and Tri Dao. RULER : What's the real context size of your long-context language models? 2024

  25. [25]

    Hoos, and Kevin Leyton-Brown

    Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Proceedings of the 5th International Conference on Learning and Intelligent Optimization (LION), pp.\ 507--523, 2011

  26. [26]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

  27. [27]

    Jones, Matthias Schonlau, and William J

    Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13 0 (4): 0 455--492, 1998

  28. [28]

    LLMTest\_NeedleInAHaystack : Pressure testing LLMs

    Greg Kamradt. LLMTest\_NeedleInAHaystack : Pressure testing LLMs . https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2023

  29. [29]

    Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnab \'a s Poczos, and Eric P. Xing. Neural architecture search with Bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems (NeurIPS), volume 31, 2018

  30. [30]

    Kimi linear: An expressive, efficient attention architecture

    Kimi Team , Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture. 2025

  31. [31]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention . 2023

  32. [32]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023

  33. [33]

    OpenCeres : When open information extraction meets the semi-structured web

    Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar. OpenCeres : When open information extraction meets the semi-structured web. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp.\ 3564--3577, 2019

  34. [34]

    Olmo hybrid: From theory to practice

    William Merrill, Yanhong Li, et al. Olmo hybrid: From theory to practice. Allen Institute for AI, 2025. URL https://allenai.org/papers/olmo-hybrid

  35. [35]

    Kimi linear (code and artifacts)

    Moonshot AI . Kimi linear (code and artifacts). https://github.com/MoonshotAI/Kimi-Linear, 2025. Accessed 2026-02-17

  36. [36]

    Nemo skills: A project to improve skills of large language models

    NVIDIA. Nemo skills: A project to improve skills of large language models. https://github.com/NVIDIA-NeMo/Skills, 2024. Accessed: 2026-04-07

  37. [37]

    NVIDIA Nemotron Nano 2: An accurate and efficient hybrid Mamba - Transformer reasoning model

    NVIDIA . NVIDIA Nemotron Nano 2: An accurate and efficient hybrid Mamba - Transformer reasoning model. 2025 a . Technical report

  38. [38]

    Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning

    NVIDIA . Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning. Technical report, NVIDIA Research, 2025 b . URL https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

  39. [39]

    Apriel- H1 : Towards efficient enterprise reasoning models

    Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier, Shruthan Radhakrishna, Soham Parikh, Shambhavi Mishra, Sebastien Paquet, Srinivas Sunkara, Val \'e rie B \'e caert, Sathwik Tejaswi Madhusudhan, and Torsten Scholak. Apriel- H1 : Towards efficient enterprise reasoning models. 2025

  40. [40]

    RWKV : Reinventing RNN s for the transformer era

    Bo Peng, Eric Alcaide, Quentin Anthony, et al. RWKV : Reinventing RNN s for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP, 2023

  41. [41]

    Guan, Barret Zoph, Quoc V

    Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp.\ 4095--4104, 2018

  42. [42]

    Humanity's Last Exam

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity's last exam. arXiv preprint arXiv:2501.14249, 2025

  43. [43]

    Qwen3.5 : Towards native multimodal agents, February 2026

    Qwen Team . Qwen3.5 : Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

  44. [44]

    Apriel-1.5-15b-thinker, 2025

    Shruthan Radhakrishna, Aman Tiwari, Aanjaneya Shukla, Masoud Hashemi, et al. Apriel-1.5-15b-thinker, 2025

  45. [45]

    Evalchemy , 2025

    Negin Raoof, Etash Kumar Guha, Ryan Marten, Jean Mercat, Eric Frankel, Sedrick Keh, Hritik Bansal, Georgios Smyrnis, Marianna Nezhurina, Trung Vu, Zayne Rea Sprague, Mike A Merrill, Liangyu Chen, Caroline Choi, Zaid Khan, Sachin Grover, Benjamin Feuer, Ashima Suvarna, Shiye Su, Wanjia Zhao, Kartik Sharma, Charlie Cheng-Jie Ji, Kushal Arora, Jeffrey Li, Aa...

  46. [46]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level google-proof Q&A benchmark. arXiv preprint arXiv:2311.12022, 2023

  47. [47]

    Sanchez, Fran c ois Ducastelle, and Denis Gratias

    Juan M. Sanchez, Fran c ois Ducastelle, and Denis Gratias. Generalized cluster description of multicomponent systems. Physica A, 128 0 (1--2): 0 334--350, 1984

  48. [48]

    Linear transformers are secretly fast weight programmers

    Imanol Schlag, Kazuki Irie, and J \"u rgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021

  49. [49]

    Apriel-1.6-15b-thinker

    ServiceNow Research . Apriel-1.6-15b-thinker. https://huggingface.co/blog/ServiceNow-AI/apriel-1p6-15b-thinker, 2025 a . Accessed 2026-02-17

  50. [50]

    Fast-llm

    ServiceNow Research . Fast-llm. https://github.com/ServiceNow/Fast-LLM, 2025 b . Accessed 2026-02-17

  51. [51]

    FlashAttention-3 : Fast and accurate attention with asynchrony and low-precision

    Jay Shah, Nikhil Bikshandi, et al. FlashAttention-3 : Fast and accurate attention with asynchrony and low-precision. 2024

  52. [52]

    Adams, and Nando de Freitas

    Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104 0 (1): 0 148--175, 2016

  53. [53]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. 2024

  54. [54]

    RoFormer : Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer : Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

  55. [55]

    Retentive network: A successor to transformer for large language models

    Yutao Sun, Li Dong, Shaohan Huang, et al. Retentive network: A successor to transformer for large language models. 2023

  56. [56]

    Falcon-h1r: Pushing the reasoning frontiers with a hybrid model for efficient test-time scaling

    Technology Innovation Institute . Falcon-h1r: Pushing the reasoning frontiers with a hybrid model for efficient test-time scaling. arXiv preprint arXiv:2601.02346, 2026

  57. [57]

    Automating first-principles phase diagram calculations

    Axel van de Walle and Gerbrand Ceder. Automating first-principles phase diagram calculations. Journal of Phase Equilibria, 23 0 (4): 0 348--359, 2002

  58. [58]

    Rush, and Tri Dao

    Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, and Tri Dao. The Mamba in the Llama : Distilling and accelerating hybrid models. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, 2024 a

  59. [59]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. MMLU-Pro : A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024 b

  60. [60]

    Neural predictor for neural architecture search

    Wei Wen, Hanxiao Liu, Yiran Chen, Hai Li, Gabriel Bender, and Pieter-Jan Kindermans. Neural predictor for neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp.\ 660--676, 2020

  61. [61]

    BANANAS : Bayesian optimization with neural architectures for neural architecture search

    Colin White, Willie Neiswanger, and Yash Savani. BANANAS : Bayesian optimization with neural architectures for neural architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.\ 10293--10301, 2021

  62. [62]

    Bernard Widrow and Marcian E. Hoff. Adaptive switching circuits. IRE WESCON Convention Record, pp.\ 96--104, 1960

  63. [63]

    Layer-condensed KV cache for efficient inference of large language models

    Haoyi Wu and Kewei Tu. Layer-condensed KV cache for efficient inference of large language models. 2024

  64. [64]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024 a

  65. [65]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a

  66. [66]

    Parallelizing linear transformers with the delta rule over sequence length

    Songlin Yang, Bailin Wang, Yu Zhang, Rameswar Panda, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, 2024 b

  67. [67]

    Gated delta networks: Improving mamba2 with delta rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. In International Conference on Learning Representations (ICLR), 2025 b

  68. [68]

    Evaluating the search phase of neural architecture search

    Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudia Muber, and Mathieu Salzmann. Evaluating the search phase of neural architecture search. In International Conference on Learning Representations, 2020

  69. [69]

    Big bird: Transformers for longer sequences

    Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Onta \ n \'o n, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, 2020

  70. [70]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023

  71. [71]

    First-principles statistical mechanics of semiconductor alloys and intermetallic compounds

    Alex Zunger. First-principles statistical mechanics of semiconductor alloys and intermetallic compounds. In P. E. A. Turchi and A. Gonis (eds.), Statics and Dynamics of Alloy Phase Transformations, pp.\ 361--419. Springer, 1994

  72. [72]

    Falcon-H1: A family of hybrid-head language models redefining efficiency and performance.arXiv preprint arXiv:2507.22448,

    Jingwei Zuo, Maksim Velikanov, Ilyas Chahed, Younes Belkada, Dhia Eddine Rhayem, Guillaume Kunsch, Hakim Hacid, Hamza Yous, Brahim Farhat, Ibrahim Khadraoui, et al. Falcon-h1: A family of hybrid-head language models redefining efficiency and performance. arXiv preprint arXiv:2507.22448, 2025