arxiv: 2604.19877 · v1 · submitted 2026-04-21 · 💻 cs.LG

Recognition: unknown

Super Apriel: One Checkpoint, Many Speeds

Alireza Mousavi-Hosseini, Aman Tiwari, Denis Kocetkov, Joel Lamy Poirier, Kelechi Ogueji, Nanda H Krishna, Rafael Pardinas, Raymond Li, Sathwik Tejaswi Madhusudhan, Shruthan Radhakrishna, SLAM Labs: Oleksiy Ostapenko, Srinivas Sunkara, Torsten Scholak, Valerie Becaert

Pith reviewed 2026-05-10 02:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords supernetattention mixersinference servingspeed-quality tradeoffmodel distillationtransformer layersspeculative decoding

0 comments

The pith

A single 15B supernet checkpoint supports multiple inference speed presets by switching per-layer mixer choices at serving time without reloading weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that training one supernet model with four mixer options available in every decoder layer creates a shared set of weights from which any combination of those mixers can be selected per layer to form a placement. These placements can be changed between requests during serving because no weights need to be swapped, producing a range of speed-quality operating points from the identical checkpoint. The full-attention placement performs on par with the original teacher model across benchmarks, while hybrid placements deliver 2.9 times to 10.7 times higher decode throughput at 96 percent down to 77 percent quality retention, with larger gains at longer contexts. A surrogate predictor is used to navigate the enormous space of possible placements and locate strong tradeoffs at each target speed.

Core claim

Super Apriel is a 15B-parameter supernet in which every decoder layer provides four trained mixer choices—Full Attention, Sliding Window Attention, Kimi Delta Attention, and Gated DeltaNet—so that a placement formed by selecting one mixer per layer can be activated at inference time without any weight reloading, thereby enabling multiple speed presets from one checkpoint; the all-Full-Attention placement matches the Apriel 1.6 teacher on all reported benchmarks, and recommended hybrid placements span 2.9× to 10.7× decode throughput at 96% to 77% quality retention.

What carries the argument

A supernet placement: the assignment of one mixer out of the four trained options to each decoder layer, which can be switched at serving time because all mixers share the same underlying weights.

If this is right

The all-full-attention placement matches the teacher model on every benchmark reported.
Hybrid placements achieve 2.9× to 10.7× decode throughput gains while retaining 77% to 96% of quality.
The throughput advantage over the all-full-attention baseline grows as context length increases.
The shared checkpoint enables speculative decoding without requiring a separate draft model.
The surrogate predictor reduces the search over the large space of possible layer assignments to a tractable problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A single deployed checkpoint could replace several separately trained models for different latency targets, lowering memory usage in serving clusters.
The observed instability of efficient placements at 15B scale suggests that scaling laws for supernet training may need explicit stabilization methods at larger sizes.
The same per-layer choice mechanism could be applied to other model components such as feed-forward networks to create even more operating points.
Combining the runtime mixer switching with KV-cache optimizations or quantization would likely compound the speed benefits further.

Load-bearing premise

That placements selected by the surrogate predictor at 15B scale will deliver the stated quality retention and throughput numbers when used with real user inputs and production serving hardware.

What would settle it

Measure decode throughput and benchmark scores for one of the paper's recommended hybrid placements on a long-context task outside the original evaluation set and check whether the reported multipliers and retention percentages still hold.

read the original abstract

We release Super Apriel, a 15B-parameter supernet in which every decoder layer provides four trained mixer choices -- Full Attention (FA), Sliding Window Attention (SWA), Kimi Delta Attention (KDA), and Gated DeltaNet (GDN). A placement selects one mixer per layer; placements can be switched between requests at serving time without reloading weights, enabling multiple speed presets from a single checkpoint. The shared checkpoint also enables speculative decoding without a separate draft model. The all-FA preset matches the Apriel 1.6 teacher on all reported benchmarks; recommended hybrid presets span $2.9\times$ to $10.7\times$ decode throughput at 96% to 77% quality retention, with throughput advantages that compound at longer context lengths. With four mixer types across 48 layers, the configuration space is vast. A surrogate that predicts placement quality from the per-layer mixer assignment makes the speed-quality landscape tractable and identifies the best tradeoffs at each speed level. We investigate whether the best configurations at each speed level can be identified early in training or only after convergence. Rankings stabilize quickly at 0.5B scale, but the most efficient configurations exhibit higher instability at 15B, cautioning against extrapolation from smaller models. Super Apriel is trained by stochastic distillation from a frozen Apriel 1.6 teacher, followed by supervised fine-tuning. We release the supernet weights, Fast-LLM training code, vLLM serving code, and a placement optimization toolkit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Super Apriel gives a practical supernet for runtime speed switching from one checkpoint via per-layer mixer choices, but the surrogate's reliability for the hybrid presets at 15B scale is the main open question.

read the letter

The core contribution is a 15B supernet where every layer offers four trained mixers—full attention, sliding window, Kimi Delta Attention, and Gated DeltaNet—and a placement picks one per layer. These placements switch between requests at serving time without weight reloads, so one checkpoint supports multiple speed presets plus speculative decoding. The all-FA version matches the Apriel 1.6 teacher across the reported benchmarks, and they release the weights, Fast-LLM training code, vLLM serving code, and a placement toolkit. That artifact release is the strongest part; it lets others test the claims directly rather than just reading numbers. The surrogate that scores placements from the mixer assignment makes the 4^48 space usable and surfaces the 2.9×–10.7× decode throughput points at 96–77% quality retention. The observation that rankings stabilize early at 0.5B but efficient configs become unstable at 15B is also useful to see in print. The soft spot is exactly where the stress-test flagged: the recommended hybrids depend on the surrogate picking placements that actually deliver those speed-quality numbers at full scale, yet the paper notes higher instability for the most efficient configurations there. Without detailed validation of the surrogate against held-out 15B runs or error bars on the quality retention figures, those operating points remain provisional. The quality numbers are also tied to the specific benchmarks and serving setup, so generalization to other tasks or longer contexts is not yet shown. This is the kind of engineering paper that deployment teams will want to try, even if the optimization claims need tighter evidence. It deserves a serious referee because the released code and weights make verification straightforward and the runtime-switching idea is concrete. I would bring it to a reading group to discuss the surrogate design and the instability results.

Referee Report

3 major / 2 minor

Summary. The paper introduces Super Apriel, a 15B-parameter supernet in which each of 48 decoder layers offers four trained mixer choices (Full Attention, Sliding Window Attention, Kimi Delta Attention, Gated DeltaNet). A placement assigns one mixer per layer; these placements can be switched between requests at serving time without reloading weights, yielding multiple speed presets from a single checkpoint. The all-FA preset matches the Apriel 1.6 teacher on reported benchmarks. Recommended hybrid placements deliver 2.9×–10.7× decode throughput at 96–77% quality retention, with gains compounding at longer contexts. A surrogate model predicts placement quality from the per-layer assignment to navigate the configuration space. The supernet is obtained via stochastic distillation from the frozen teacher followed by supervised fine-tuning. The authors release the weights, Fast-LLM training code, vLLM serving code, and a placement optimization toolkit.

Significance. If the central claims hold, the work offers a practical mechanism for serving a single large-model checkpoint under multiple latency/quality operating points by runtime mixer switching, which is valuable for production inference pipelines with heterogeneous request profiles. The release of weights, training/serving code, and the optimization toolkit is a clear strength that supports reproducibility and extension. The approach also demonstrates that speculative decoding can reuse the same checkpoint without a separate draft model. These contributions would be of interest to the efficient-inference community provided the surrogate-driven speed-quality points are shown to be reliable at the target scale.

major comments (3)

[Abstract and surrogate-experiments section] The headline hybrid-preset claims (2.9×–10.7× throughput at 96–77% quality retention) rest on the surrogate identifying high-quality placements at 15B scale. The manuscript itself reports that “the most efficient configurations exhibit higher instability at 15B,” yet provides no direct benchmark results or error bars for the specific recommended hybrids; only surrogate predictions are cited. This gap directly undermines in the operating points.
[Investigation of early identification of configurations] The claim that rankings “stabilize quickly at 0.5B scale” is used to justify early identification of good placements, but the same paragraph notes elevated instability precisely for the efficient (high-speed) configurations at 15B. No ablation or correlation study is presented showing that the surrogate’s 15B predictions remain accurate for the low-quality-retention hybrids that deliver the largest speedups.
[Benchmark and serving-results section] Quality-retention percentages are reported relative to the teacher on “all reported benchmarks,” but the manuscript supplies neither the exact task suite, number of evaluation runs, nor variance estimates. Without these, it is impossible to judge whether the 77% floor for the fastest preset is statistically distinguishable from the surrogate’s prediction error.

minor comments (2)

[Mixer definitions] The definitions and architectural details of the newly introduced Kimi Delta Attention (KDA) and Gated DeltaNet (GDN) mixers are referenced but not fully specified in the main text; a concise appendix equation or diagram would improve clarity.
[Serving implementation] The description of how placements are switched inside the vLLM serving path (kernel selection, KV-cache handling) is high-level; pseudocode or a short code snippet would help readers reproduce the “no weight reload” property.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications based on the experiments reported in the manuscript. Where revisions are warranted, we indicate them explicitly.

read point-by-point responses

Referee: [Abstract and surrogate-experiments section] The headline hybrid-preset claims (2.9×–10.7× throughput at 96–77% quality retention) rest on the surrogate identifying high-quality placements at 15B scale. The manuscript itself reports that “the most efficient configurations exhibit higher instability at 15B,” yet provides no direct benchmark results or error bars for the specific recommended hybrids; only surrogate predictions are cited. This gap directly undermines in the operating points.

Authors: We acknowledge the concern regarding the reliance on surrogate predictions for the headline hybrid claims at 15B scale. The manuscript explicitly notes the higher instability observed for the most efficient configurations at this scale, which is why we present the surrogate as a practical navigation tool rather than a perfect substitute. Direct evaluation of every recommended hybrid at full 15B scale was not performed due to computational cost; the surrogate was validated through smaller-scale ablations where direct comparisons were feasible. We will revise the abstract and surrogate-experiments section to more prominently discuss the surrogate's limitations, the observed instability, and the fact that the reported operating points are surrogate-derived predictions. This will include clearer caveats on confidence intervals. revision: partial
Referee: [Investigation of early identification of configurations] The claim that rankings “stabilize quickly at 0.5B scale” is used to justify early identification of good placements, but the same paragraph notes elevated instability precisely for the efficient (high-speed) configurations at 15B. No ablation or correlation study is presented showing that the surrogate’s 15B predictions remain accurate for the low-quality-retention hybrids that deliver the largest speedups.

Authors: The stabilization of rankings at 0.5B scale is an empirical observation from our scaling experiments, while the elevated instability at 15B for efficient placements is separately reported as a caution against naive extrapolation. We agree that an explicit correlation or ablation study focused on the low-quality-retention (high-speed) hybrids would strengthen the justification for using the surrogate at 15B. We will add a targeted discussion and, where data permits, a small-scale correlation analysis in the revised manuscript to address this gap directly. revision: yes
Referee: [Benchmark and serving-results section] Quality-retention percentages are reported relative to the teacher on “all reported benchmarks,” but the manuscript supplies neither the exact task suite, number of evaluation runs, nor variance estimates. Without these, it is impossible to judge whether the 77% floor for the fastest preset is statistically distinguishable from the surrogate’s prediction error.

Authors: We will revise the benchmark and serving-results section to explicitly enumerate the task suite (the same benchmarks used to validate the Apriel 1.6 teacher), the number of evaluation runs performed, and any available variance estimates. This will allow readers to assess the statistical reliability of the quality-retention figures relative to both the teacher and the surrogate's prediction uncertainty. revision: yes

standing simulated objections not resolved

Direct 15B-scale benchmark results and error bars for the specific recommended hybrid placements are unavailable, as these configurations were evaluated exclusively via the surrogate model owing to computational constraints.

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical training and evaluation

full rationale

The paper presents an empirical supernet trained via stochastic distillation from a frozen teacher followed by supervised fine-tuning. Placements are selected using a surrogate model trained on per-layer assignments, but the headline speed-quality numbers (2.9×–10.7× throughput at 96–77% retention) and benchmark matches are obtained from direct vLLM serving measurements and evaluations on reported tasks, not derived from the surrogate by construction. No equations reduce fitted parameters to predictions, no uniqueness theorems are invoked via self-citation, and no ansatz or renaming of known results occurs. Self-citation to the Apriel 1.6 teacher provides context but is not load-bearing for the new measurements, which remain independently verifiable via released weights and code. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claims depend on the empirical success of stochastic distillation into a shared-weight supernet and the predictive accuracy of the surrogate; the design assumes mixer compatibility within one set of weights.

free parameters (1)

surrogate model parameters
The surrogate that predicts placement quality from mixer assignments is trained on data and therefore contains fitted parameters.

axioms (1)

domain assumption Different mixer choices can be activated at inference time within the same shared weights without additional training or reloading.
This is required for the runtime switching benefit described in the abstract.

invented entities (2)

Kimi Delta Attention (KDA) no independent evidence
purpose: One of the four mixer options per layer.
Presented as a distinct attention mechanism available for placement selection.
Gated DeltaNet (GDN) no independent evidence
purpose: One of the four mixer options per layer.
Presented as a distinct attention mechanism available for placement selection.

pith-pipeline@v0.9.0 · 5650 in / 1542 out tokens · 60947 ms · 2026-05-10T02:22:59.173869+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 12 canonical work pages · 9 internal anchors

[1]

GQA : Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. GQA : Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 4895--4901, 2023

2023
[2]

Language models enable simple systems for generating structured views of heterogeneous data lakes

Simran Arora, Brandon Yang, Sabri Eyber, Avanika Rose, Aditya Narayan, Ines Chami, Dorsa Sadigh, and Christopher R \'e . Language models enable simple systems for generating structured views of heterogeneous data lakes. 2023

2023
[3]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. ^2 -bench: Evaluating conversational agents in a dual-control environment, 2025. URL https://arxiv.org/abs/2506.07982

work page internal anchor Pith review arXiv 2025
[4]

Peters, and Arman Cohan

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. 2020

2020
[5]

Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vaswani, and Quoc V. Le. Understanding and simplifying one-shot architecture search. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp.\ 549--558, 2018

2018
[6]

Li, Eric P

Aviv Bick, Kevin Y. Li, Eric P. Xing, J. Zico Kolter, and Albert Gu. Transformers to SSM s: Distilling quadratic knowledge to subquadratic models. 2024

2024
[7]

Llamba: Scaling distilled recurrent models for efficient language processing

Aviv Bick et al. Llamba: Scaling distilled recurrent models for efficient language processing. 2025

2025
[8]

Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models

Andrew Blakeman et al. Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models. 2025

2025
[9]

Once-for-all: Train one network and specialize it for efficient deployment

Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize it for efficient deployment. In Proceedings of the International Conference on Learning Representations (ICLR), 2020

2020
[10]

MiniMax-M1 : Scaling test-time compute efficiently with Lightning Attention

An Chen et al. MiniMax-M1 : Scaling test-time compute efficiently with Lightning Attention . 2025

2025
[11]

Generating long sequences with sparse transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. 2019

2019
[12]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. 2021

2021
[13]

Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality. 2024

2024
[14]

Smith, Anushan Fernando, et al

Soham De, Samuel L. Smith, Anushan Fernando, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models. 2024

2024
[15]

Cluster approach to order-disorder transformations in alloys

Didier de Fontaine. Cluster approach to order-disorder transformations in alloys. In Solid State Physics, volume 47, pp.\ 33--176. 1994

1994
[16]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. 2021. Transformer Circuits Thread. Accessed 2026-02-17

2021
[17]

Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page arXiv 2024
[18]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. 2023

2023
[19]

Jet-nemotron: Efficient language model with post neural architecture search

Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search. 2025

2025
[20]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

work page internal anchor Pith review arXiv 2025
[21]

Single path one-shot neural architecture search with uniform path sampling

Zichao Guo, Xiangyu Zhang, Haoyuan Mu, et al. Single path one-shot neural architecture search with uniform path sampling. In Proceedings of the European Conference on Computer Vision (ECCV), pp.\ 544--560, 2020

2020
[22]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. 2021 a

2021
[23]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. 2021 b

2021
[24]

RULER : What's the real context size of your long-context language models? 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Joshua Ainslie, Nelson Elhage, Albert Gu, Beidi Kim, and Tri Dao. RULER : What's the real context size of your long-context language models? 2024

2024
[25]

Hoos, and Kevin Leyton-Brown

Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Proceedings of the 5th International Conference on Learning and Intelligent Optimization (LION), pp.\ 507--523, 2011

2011
[26]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review arXiv 2024
[27]

Jones, Matthias Schonlau, and William J

Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13 0 (4): 0 455--492, 1998

1998
[28]

LLMTest\_NeedleInAHaystack : Pressure testing LLMs

Greg Kamradt. LLMTest\_NeedleInAHaystack : Pressure testing LLMs . https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2023

2023
[29]

Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnab \'a s Poczos, and Eric P. Xing. Neural architecture search with Bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems (NeurIPS), volume 31, 2018

2018
[30]

Kimi linear: An expressive, efficient attention architecture

Kimi Team , Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture. 2025

2025
[31]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention . 2023

2023
[32]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023

2023
[33]

OpenCeres : When open information extraction meets the semi-structured web

Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar. OpenCeres : When open information extraction meets the semi-structured web. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp.\ 3564--3577, 2019

2019
[34]

Olmo hybrid: From theory to practice

William Merrill, Yanhong Li, et al. Olmo hybrid: From theory to practice. Allen Institute for AI, 2025. URL https://allenai.org/papers/olmo-hybrid

2025
[35]

Kimi linear (code and artifacts)

Moonshot AI . Kimi linear (code and artifacts). https://github.com/MoonshotAI/Kimi-Linear, 2025. Accessed 2026-02-17

2025
[36]

Nemo skills: A project to improve skills of large language models

NVIDIA. Nemo skills: A project to improve skills of large language models. https://github.com/NVIDIA-NeMo/Skills, 2024. Accessed: 2026-04-07

2024
[37]

NVIDIA Nemotron Nano 2: An accurate and efficient hybrid Mamba - Transformer reasoning model

NVIDIA . NVIDIA Nemotron Nano 2: An accurate and efficient hybrid Mamba - Transformer reasoning model. 2025 a . Technical report

2025
[38]

Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning

NVIDIA . Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning. Technical report, NVIDIA Research, 2025 b . URL https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

2025
[39]

Apriel- H1 : Towards efficient enterprise reasoning models

Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier, Shruthan Radhakrishna, Soham Parikh, Shambhavi Mishra, Sebastien Paquet, Srinivas Sunkara, Val \'e rie B \'e caert, Sathwik Tejaswi Madhusudhan, and Torsten Scholak. Apriel- H1 : Towards efficient enterprise reasoning models. 2025

2025
[40]

RWKV : Reinventing RNN s for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, et al. RWKV : Reinventing RNN s for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP, 2023

2023
[41]

Guan, Barret Zoph, Quoc V

Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp.\ 4095--4104, 2018

2018
[42]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity's last exam. arXiv preprint arXiv:2501.14249, 2025

work page internal anchor Pith review arXiv 2025
[43]

Qwen3.5 : Towards native multimodal agents, February 2026

Qwen Team . Qwen3.5 : Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

2026
[44]

Apriel-1.5-15b-thinker, 2025

Shruthan Radhakrishna, Aman Tiwari, Aanjaneya Shukla, Masoud Hashemi, et al. Apriel-1.5-15b-thinker, 2025

2025
[45]

Evalchemy , 2025

Negin Raoof, Etash Kumar Guha, Ryan Marten, Jean Mercat, Eric Frankel, Sedrick Keh, Hritik Bansal, Georgios Smyrnis, Marianna Nezhurina, Trung Vu, Zayne Rea Sprague, Mike A Merrill, Liangyu Chen, Caroline Choi, Zaid Khan, Sachin Grover, Benjamin Feuer, Ashima Suvarna, Shiye Su, Wanjia Zhao, Kartik Sharma, Charlie Cheng-Jie Ji, Kushal Arora, Jeffrey Li, Aa...

2025
[46]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level google-proof Q&A benchmark. arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review arXiv 2023
[47]

Sanchez, Fran c ois Ducastelle, and Denis Gratias

Juan M. Sanchez, Fran c ois Ducastelle, and Denis Gratias. Generalized cluster description of multicomponent systems. Physica A, 128 0 (1--2): 0 334--350, 1984

1984
[48]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and J \"u rgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021

2021
[49]

Apriel-1.6-15b-thinker

ServiceNow Research . Apriel-1.6-15b-thinker. https://huggingface.co/blog/ServiceNow-AI/apriel-1p6-15b-thinker, 2025 a . Accessed 2026-02-17

2025
[50]

Fast-llm

ServiceNow Research . Fast-llm. https://github.com/ServiceNow/Fast-LLM, 2025 b . Accessed 2026-02-17

2025
[51]

FlashAttention-3 : Fast and accurate attention with asynchrony and low-precision

Jay Shah, Nikhil Bikshandi, et al. FlashAttention-3 : Fast and accurate attention with asynchrony and low-precision. 2024

2024
[52]

Adams, and Nando de Freitas

Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104 0 (1): 0 148--175, 2016

2016
[53]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. 2024

2024
[54]

RoFormer : Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer : Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

2024
[55]

Retentive network: A successor to transformer for large language models

Yutao Sun, Li Dong, Shaohan Huang, et al. Retentive network: A successor to transformer for large language models. 2023

2023
[56]

Falcon-h1r: Pushing the reasoning frontiers with a hybrid model for efficient test-time scaling

Technology Innovation Institute . Falcon-h1r: Pushing the reasoning frontiers with a hybrid model for efficient test-time scaling. arXiv preprint arXiv:2601.02346, 2026

work page arXiv 2026
[57]

Automating first-principles phase diagram calculations

Axel van de Walle and Gerbrand Ceder. Automating first-principles phase diagram calculations. Journal of Phase Equilibria, 23 0 (4): 0 348--359, 2002

2002
[58]

Rush, and Tri Dao

Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, and Tri Dao. The Mamba in the Llama : Distilling and accelerating hybrid models. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, 2024 a

2024
[59]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. MMLU-Pro : A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024 b

work page internal anchor Pith review arXiv 2024
[60]

Neural predictor for neural architecture search

Wei Wen, Hanxiao Liu, Yiran Chen, Hai Li, Gabriel Bender, and Pieter-Jan Kindermans. Neural predictor for neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp.\ 660--676, 2020

2020
[61]

BANANAS : Bayesian optimization with neural architectures for neural architecture search

Colin White, Willie Neiswanger, and Yash Savani. BANANAS : Bayesian optimization with neural architectures for neural architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.\ 10293--10301, 2021

2021
[62]

Bernard Widrow and Marcian E. Hoff. Adaptive switching circuits. IRE WESCON Convention Record, pp.\ 96--104, 1960

1960
[63]

Layer-condensed KV cache for efficient inference of large language models

Haoyi Wu and Kewei Tu. Layer-condensed KV cache for efficient inference of large language models. 2024

2024
[64]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024 a

work page internal anchor Pith review arXiv 2024
[65]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Rameswar Panda, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, 2024 b

2024
[67]

Gated delta networks: Improving mamba2 with delta rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. In International Conference on Learning Representations (ICLR), 2025 b

2025
[68]

Evaluating the search phase of neural architecture search

Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudia Muber, and Mathieu Salzmann. Evaluating the search phase of neural architecture search. In International Conference on Learning Representations, 2020

2020
[69]

Big bird: Transformers for longer sequences

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Onta \ n \'o n, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, 2020

2020
[70]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

First-principles statistical mechanics of semiconductor alloys and intermetallic compounds

Alex Zunger. First-principles statistical mechanics of semiconductor alloys and intermetallic compounds. In P. E. A. Turchi and A. Gonis (eds.), Statics and Dynamics of Alloy Phase Transformations, pp.\ 361--419. Springer, 1994

1994
[72]

Falcon-H1: A family of hybrid-head language models redefining efficiency and performance.arXiv preprint arXiv:2507.22448,

Jingwei Zuo, Maksim Velikanov, Ilyas Chahed, Younes Belkada, Dhia Eddine Rhayem, Guillaume Kunsch, Hakim Hacid, Hamza Yous, Brahim Farhat, Ibrahim Khadraoui, et al. Falcon-h1: A family of hybrid-head language models redefining efficiency and performance. arXiv preprint arXiv:2507.22448, 2025

work page arXiv 2025