Recognition: unknown
Super Apriel: One Checkpoint, Many Speeds
Pith reviewed 2026-05-10 02:22 UTC · model grok-4.3
The pith
A single 15B supernet checkpoint supports multiple inference speed presets by switching per-layer mixer choices at serving time without reloading weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Super Apriel is a 15B-parameter supernet in which every decoder layer provides four trained mixer choices—Full Attention, Sliding Window Attention, Kimi Delta Attention, and Gated DeltaNet—so that a placement formed by selecting one mixer per layer can be activated at inference time without any weight reloading, thereby enabling multiple speed presets from one checkpoint; the all-Full-Attention placement matches the Apriel 1.6 teacher on all reported benchmarks, and recommended hybrid placements span 2.9× to 10.7× decode throughput at 96% to 77% quality retention.
What carries the argument
A supernet placement: the assignment of one mixer out of the four trained options to each decoder layer, which can be switched at serving time because all mixers share the same underlying weights.
If this is right
- The all-full-attention placement matches the teacher model on every benchmark reported.
- Hybrid placements achieve 2.9× to 10.7× decode throughput gains while retaining 77% to 96% of quality.
- The throughput advantage over the all-full-attention baseline grows as context length increases.
- The shared checkpoint enables speculative decoding without requiring a separate draft model.
- The surrogate predictor reduces the search over the large space of possible layer assignments to a tractable problem.
Where Pith is reading between the lines
- A single deployed checkpoint could replace several separately trained models for different latency targets, lowering memory usage in serving clusters.
- The observed instability of efficient placements at 15B scale suggests that scaling laws for supernet training may need explicit stabilization methods at larger sizes.
- The same per-layer choice mechanism could be applied to other model components such as feed-forward networks to create even more operating points.
- Combining the runtime mixer switching with KV-cache optimizations or quantization would likely compound the speed benefits further.
Load-bearing premise
That placements selected by the surrogate predictor at 15B scale will deliver the stated quality retention and throughput numbers when used with real user inputs and production serving hardware.
What would settle it
Measure decode throughput and benchmark scores for one of the paper's recommended hybrid placements on a long-context task outside the original evaluation set and check whether the reported multipliers and retention percentages still hold.
read the original abstract
We release Super Apriel, a 15B-parameter supernet in which every decoder layer provides four trained mixer choices -- Full Attention (FA), Sliding Window Attention (SWA), Kimi Delta Attention (KDA), and Gated DeltaNet (GDN). A placement selects one mixer per layer; placements can be switched between requests at serving time without reloading weights, enabling multiple speed presets from a single checkpoint. The shared checkpoint also enables speculative decoding without a separate draft model. The all-FA preset matches the Apriel 1.6 teacher on all reported benchmarks; recommended hybrid presets span $2.9\times$ to $10.7\times$ decode throughput at 96% to 77% quality retention, with throughput advantages that compound at longer context lengths. With four mixer types across 48 layers, the configuration space is vast. A surrogate that predicts placement quality from the per-layer mixer assignment makes the speed-quality landscape tractable and identifies the best tradeoffs at each speed level. We investigate whether the best configurations at each speed level can be identified early in training or only after convergence. Rankings stabilize quickly at 0.5B scale, but the most efficient configurations exhibit higher instability at 15B, cautioning against extrapolation from smaller models. Super Apriel is trained by stochastic distillation from a frozen Apriel 1.6 teacher, followed by supervised fine-tuning. We release the supernet weights, Fast-LLM training code, vLLM serving code, and a placement optimization toolkit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Super Apriel, a 15B-parameter supernet in which each of 48 decoder layers offers four trained mixer choices (Full Attention, Sliding Window Attention, Kimi Delta Attention, Gated DeltaNet). A placement assigns one mixer per layer; these placements can be switched between requests at serving time without reloading weights, yielding multiple speed presets from a single checkpoint. The all-FA preset matches the Apriel 1.6 teacher on reported benchmarks. Recommended hybrid placements deliver 2.9×–10.7× decode throughput at 96–77% quality retention, with gains compounding at longer contexts. A surrogate model predicts placement quality from the per-layer assignment to navigate the configuration space. The supernet is obtained via stochastic distillation from the frozen teacher followed by supervised fine-tuning. The authors release the weights, Fast-LLM training code, vLLM serving code, and a placement optimization toolkit.
Significance. If the central claims hold, the work offers a practical mechanism for serving a single large-model checkpoint under multiple latency/quality operating points by runtime mixer switching, which is valuable for production inference pipelines with heterogeneous request profiles. The release of weights, training/serving code, and the optimization toolkit is a clear strength that supports reproducibility and extension. The approach also demonstrates that speculative decoding can reuse the same checkpoint without a separate draft model. These contributions would be of interest to the efficient-inference community provided the surrogate-driven speed-quality points are shown to be reliable at the target scale.
major comments (3)
- [Abstract and surrogate-experiments section] The headline hybrid-preset claims (2.9×–10.7× throughput at 96–77% quality retention) rest on the surrogate identifying high-quality placements at 15B scale. The manuscript itself reports that “the most efficient configurations exhibit higher instability at 15B,” yet provides no direct benchmark results or error bars for the specific recommended hybrids; only surrogate predictions are cited. This gap directly undermines in the operating points.
- [Investigation of early identification of configurations] The claim that rankings “stabilize quickly at 0.5B scale” is used to justify early identification of good placements, but the same paragraph notes elevated instability precisely for the efficient (high-speed) configurations at 15B. No ablation or correlation study is presented showing that the surrogate’s 15B predictions remain accurate for the low-quality-retention hybrids that deliver the largest speedups.
- [Benchmark and serving-results section] Quality-retention percentages are reported relative to the teacher on “all reported benchmarks,” but the manuscript supplies neither the exact task suite, number of evaluation runs, nor variance estimates. Without these, it is impossible to judge whether the 77% floor for the fastest preset is statistically distinguishable from the surrogate’s prediction error.
minor comments (2)
- [Mixer definitions] The definitions and architectural details of the newly introduced Kimi Delta Attention (KDA) and Gated DeltaNet (GDN) mixers are referenced but not fully specified in the main text; a concise appendix equation or diagram would improve clarity.
- [Serving implementation] The description of how placements are switched inside the vLLM serving path (kernel selection, KV-cache handling) is high-level; pseudocode or a short code snippet would help readers reproduce the “no weight reload” property.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications based on the experiments reported in the manuscript. Where revisions are warranted, we indicate them explicitly.
read point-by-point responses
-
Referee: [Abstract and surrogate-experiments section] The headline hybrid-preset claims (2.9×–10.7× throughput at 96–77% quality retention) rest on the surrogate identifying high-quality placements at 15B scale. The manuscript itself reports that “the most efficient configurations exhibit higher instability at 15B,” yet provides no direct benchmark results or error bars for the specific recommended hybrids; only surrogate predictions are cited. This gap directly undermines in the operating points.
Authors: We acknowledge the concern regarding the reliance on surrogate predictions for the headline hybrid claims at 15B scale. The manuscript explicitly notes the higher instability observed for the most efficient configurations at this scale, which is why we present the surrogate as a practical navigation tool rather than a perfect substitute. Direct evaluation of every recommended hybrid at full 15B scale was not performed due to computational cost; the surrogate was validated through smaller-scale ablations where direct comparisons were feasible. We will revise the abstract and surrogate-experiments section to more prominently discuss the surrogate's limitations, the observed instability, and the fact that the reported operating points are surrogate-derived predictions. This will include clearer caveats on confidence intervals. revision: partial
-
Referee: [Investigation of early identification of configurations] The claim that rankings “stabilize quickly at 0.5B scale” is used to justify early identification of good placements, but the same paragraph notes elevated instability precisely for the efficient (high-speed) configurations at 15B. No ablation or correlation study is presented showing that the surrogate’s 15B predictions remain accurate for the low-quality-retention hybrids that deliver the largest speedups.
Authors: The stabilization of rankings at 0.5B scale is an empirical observation from our scaling experiments, while the elevated instability at 15B for efficient placements is separately reported as a caution against naive extrapolation. We agree that an explicit correlation or ablation study focused on the low-quality-retention (high-speed) hybrids would strengthen the justification for using the surrogate at 15B. We will add a targeted discussion and, where data permits, a small-scale correlation analysis in the revised manuscript to address this gap directly. revision: yes
-
Referee: [Benchmark and serving-results section] Quality-retention percentages are reported relative to the teacher on “all reported benchmarks,” but the manuscript supplies neither the exact task suite, number of evaluation runs, nor variance estimates. Without these, it is impossible to judge whether the 77% floor for the fastest preset is statistically distinguishable from the surrogate’s prediction error.
Authors: We will revise the benchmark and serving-results section to explicitly enumerate the task suite (the same benchmarks used to validate the Apriel 1.6 teacher), the number of evaluation runs performed, and any available variance estimates. This will allow readers to assess the statistical reliability of the quality-retention figures relative to both the teacher and the surrogate's prediction uncertainty. revision: yes
- Direct 15B-scale benchmark results and error bars for the specific recommended hybrid placements are unavailable, as these configurations were evaluated exclusively via the surrogate model owing to computational constraints.
Circularity Check
No significant circularity; claims rest on empirical training and evaluation
full rationale
The paper presents an empirical supernet trained via stochastic distillation from a frozen teacher followed by supervised fine-tuning. Placements are selected using a surrogate model trained on per-layer assignments, but the headline speed-quality numbers (2.9×–10.7× throughput at 96–77% retention) and benchmark matches are obtained from direct vLLM serving measurements and evaluations on reported tasks, not derived from the surrogate by construction. No equations reduce fitted parameters to predictions, no uniqueness theorems are invoked via self-citation, and no ansatz or renaming of known results occurs. Self-citation to the Apriel 1.6 teacher provides context but is not load-bearing for the new measurements, which remain independently verifiable via released weights and code. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- surrogate model parameters
axioms (1)
- domain assumption Different mixer choices can be activated at inference time within the same shared weights without additional training or reloading.
invented entities (2)
-
Kimi Delta Attention (KDA)
no independent evidence
-
Gated DeltaNet (GDN)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
GQA : Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. GQA : Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 4895--4901, 2023
2023
-
[2]
Language models enable simple systems for generating structured views of heterogeneous data lakes
Simran Arora, Brandon Yang, Sabri Eyber, Avanika Rose, Aditya Narayan, Ines Chami, Dorsa Sadigh, and Christopher R \'e . Language models enable simple systems for generating structured views of heterogeneous data lakes. 2023
2023
-
[3]
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. ^2 -bench: Evaluating conversational agents in a dual-control environment, 2025. URL https://arxiv.org/abs/2506.07982
work page internal anchor Pith review arXiv 2025
-
[4]
Peters, and Arman Cohan
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. 2020
2020
-
[5]
Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vaswani, and Quoc V. Le. Understanding and simplifying one-shot architecture search. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp.\ 549--558, 2018
2018
-
[6]
Li, Eric P
Aviv Bick, Kevin Y. Li, Eric P. Xing, J. Zico Kolter, and Albert Gu. Transformers to SSM s: Distilling quadratic knowledge to subquadratic models. 2024
2024
-
[7]
Llamba: Scaling distilled recurrent models for efficient language processing
Aviv Bick et al. Llamba: Scaling distilled recurrent models for efficient language processing. 2025
2025
-
[8]
Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models
Andrew Blakeman et al. Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models. 2025
2025
-
[9]
Once-for-all: Train one network and specialize it for efficient deployment
Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize it for efficient deployment. In Proceedings of the International Conference on Learning Representations (ICLR), 2020
2020
-
[10]
MiniMax-M1 : Scaling test-time compute efficiently with Lightning Attention
An Chen et al. MiniMax-M1 : Scaling test-time compute efficiently with Lightning Attention . 2025
2025
-
[11]
Generating long sequences with sparse transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. 2019
2019
-
[12]
Training verifiers to solve math word problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. 2021
2021
-
[13]
Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality
Tri Dao and Albert Gu. Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality. 2024
2024
-
[14]
Smith, Anushan Fernando, et al
Soham De, Samuel L. Smith, Anushan Fernando, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models. 2024
2024
-
[15]
Cluster approach to order-disorder transformations in alloys
Didier de Fontaine. Cluster approach to order-disorder transformations in alloys. In Solid State Physics, volume 47, pp.\ 33--176. 1994
1994
-
[16]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. 2021. Transformer Circuits Thread. Accessed 2026-02-17
2021
-
[17]
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...
-
[18]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. 2023
2023
-
[19]
Jet-nemotron: Efficient language model with post neural architecture search
Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search. 2025
2025
-
[20]
OpenThoughts: Data Recipes for Reasoning Models
Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...
work page internal anchor Pith review arXiv 2025
-
[21]
Single path one-shot neural architecture search with uniform path sampling
Zichao Guo, Xiangyu Zhang, Haoyuan Mu, et al. Single path one-shot neural architecture search with uniform path sampling. In Proceedings of the European Conference on Computer Vision (ECCV), pp.\ 544--560, 2020
2020
-
[22]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. 2021 a
2021
-
[23]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. 2021 b
2021
-
[24]
RULER : What's the real context size of your long-context language models? 2024
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Joshua Ainslie, Nelson Elhage, Albert Gu, Beidi Kim, and Tri Dao. RULER : What's the real context size of your long-context language models? 2024
2024
-
[25]
Hoos, and Kevin Leyton-Brown
Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Proceedings of the 5th International Conference on Learning and Intelligent Optimization (LION), pp.\ 507--523, 2011
2011
-
[26]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review arXiv 2024
-
[27]
Jones, Matthias Schonlau, and William J
Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13 0 (4): 0 455--492, 1998
1998
-
[28]
LLMTest\_NeedleInAHaystack : Pressure testing LLMs
Greg Kamradt. LLMTest\_NeedleInAHaystack : Pressure testing LLMs . https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2023
2023
-
[29]
Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnab \'a s Poczos, and Eric P. Xing. Neural architecture search with Bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems (NeurIPS), volume 31, 2018
2018
-
[30]
Kimi linear: An expressive, efficient attention architecture
Kimi Team , Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture. 2025
2025
-
[31]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention . 2023
2023
-
[32]
Fast inference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023
2023
-
[33]
OpenCeres : When open information extraction meets the semi-structured web
Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar. OpenCeres : When open information extraction meets the semi-structured web. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp.\ 3564--3577, 2019
2019
-
[34]
Olmo hybrid: From theory to practice
William Merrill, Yanhong Li, et al. Olmo hybrid: From theory to practice. Allen Institute for AI, 2025. URL https://allenai.org/papers/olmo-hybrid
2025
-
[35]
Kimi linear (code and artifacts)
Moonshot AI . Kimi linear (code and artifacts). https://github.com/MoonshotAI/Kimi-Linear, 2025. Accessed 2026-02-17
2025
-
[36]
Nemo skills: A project to improve skills of large language models
NVIDIA. Nemo skills: A project to improve skills of large language models. https://github.com/NVIDIA-NeMo/Skills, 2024. Accessed: 2026-04-07
2024
-
[37]
NVIDIA Nemotron Nano 2: An accurate and efficient hybrid Mamba - Transformer reasoning model
NVIDIA . NVIDIA Nemotron Nano 2: An accurate and efficient hybrid Mamba - Transformer reasoning model. 2025 a . Technical report
2025
-
[38]
Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning
NVIDIA . Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning. Technical report, NVIDIA Research, 2025 b . URL https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf
2025
-
[39]
Apriel- H1 : Towards efficient enterprise reasoning models
Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier, Shruthan Radhakrishna, Soham Parikh, Shambhavi Mishra, Sebastien Paquet, Srinivas Sunkara, Val \'e rie B \'e caert, Sathwik Tejaswi Madhusudhan, and Torsten Scholak. Apriel- H1 : Towards efficient enterprise reasoning models. 2025
2025
-
[40]
RWKV : Reinventing RNN s for the transformer era
Bo Peng, Eric Alcaide, Quentin Anthony, et al. RWKV : Reinventing RNN s for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP, 2023
2023
-
[41]
Guan, Barret Zoph, Quoc V
Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp.\ 4095--4104, 2018
2018
-
[42]
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity's last exam. arXiv preprint arXiv:2501.14249, 2025
work page internal anchor Pith review arXiv 2025
-
[43]
Qwen3.5 : Towards native multimodal agents, February 2026
Qwen Team . Qwen3.5 : Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5
2026
-
[44]
Apriel-1.5-15b-thinker, 2025
Shruthan Radhakrishna, Aman Tiwari, Aanjaneya Shukla, Masoud Hashemi, et al. Apriel-1.5-15b-thinker, 2025
2025
-
[45]
Evalchemy , 2025
Negin Raoof, Etash Kumar Guha, Ryan Marten, Jean Mercat, Eric Frankel, Sedrick Keh, Hritik Bansal, Georgios Smyrnis, Marianna Nezhurina, Trung Vu, Zayne Rea Sprague, Mike A Merrill, Liangyu Chen, Caroline Choi, Zaid Khan, Sachin Grover, Benjamin Feuer, Ashima Suvarna, Shiye Su, Wanjia Zhao, Kartik Sharma, Charlie Cheng-Jie Ji, Kushal Arora, Jeffrey Li, Aa...
2025
-
[46]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level google-proof Q&A benchmark. arXiv preprint arXiv:2311.12022, 2023
work page internal anchor Pith review arXiv 2023
-
[47]
Sanchez, Fran c ois Ducastelle, and Denis Gratias
Juan M. Sanchez, Fran c ois Ducastelle, and Denis Gratias. Generalized cluster description of multicomponent systems. Physica A, 128 0 (1--2): 0 334--350, 1984
1984
-
[48]
Linear transformers are secretly fast weight programmers
Imanol Schlag, Kazuki Irie, and J \"u rgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021
2021
-
[49]
Apriel-1.6-15b-thinker
ServiceNow Research . Apriel-1.6-15b-thinker. https://huggingface.co/blog/ServiceNow-AI/apriel-1p6-15b-thinker, 2025 a . Accessed 2026-02-17
2025
-
[50]
Fast-llm
ServiceNow Research . Fast-llm. https://github.com/ServiceNow/Fast-LLM, 2025 b . Accessed 2026-02-17
2025
-
[51]
FlashAttention-3 : Fast and accurate attention with asynchrony and low-precision
Jay Shah, Nikhil Bikshandi, et al. FlashAttention-3 : Fast and accurate attention with asynchrony and low-precision. 2024
2024
-
[52]
Adams, and Nando de Freitas
Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104 0 (1): 0 148--175, 2016
2016
-
[53]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. 2024
2024
-
[54]
RoFormer : Enhanced transformer with rotary position embedding
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer : Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024
2024
-
[55]
Retentive network: A successor to transformer for large language models
Yutao Sun, Li Dong, Shaohan Huang, et al. Retentive network: A successor to transformer for large language models. 2023
2023
-
[56]
Falcon-h1r: Pushing the reasoning frontiers with a hybrid model for efficient test-time scaling
Technology Innovation Institute . Falcon-h1r: Pushing the reasoning frontiers with a hybrid model for efficient test-time scaling. arXiv preprint arXiv:2601.02346, 2026
-
[57]
Automating first-principles phase diagram calculations
Axel van de Walle and Gerbrand Ceder. Automating first-principles phase diagram calculations. Journal of Phase Equilibria, 23 0 (4): 0 348--359, 2002
2002
-
[58]
Rush, and Tri Dao
Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, and Tri Dao. The Mamba in the Llama : Distilling and accelerating hybrid models. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, 2024 a
2024
-
[59]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. MMLU-Pro : A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024 b
work page internal anchor Pith review arXiv 2024
-
[60]
Neural predictor for neural architecture search
Wei Wen, Hanxiao Liu, Yiran Chen, Hai Li, Gabriel Bender, and Pieter-Jan Kindermans. Neural predictor for neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp.\ 660--676, 2020
2020
-
[61]
BANANAS : Bayesian optimization with neural architectures for neural architecture search
Colin White, Willie Neiswanger, and Yash Savani. BANANAS : Bayesian optimization with neural architectures for neural architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.\ 10293--10301, 2021
2021
-
[62]
Bernard Widrow and Marcian E. Hoff. Adaptive switching circuits. IRE WESCON Convention Record, pp.\ 96--104, 1960
1960
-
[63]
Layer-condensed KV cache for efficient inference of large language models
Haoyi Wu and Kewei Tu. Layer-condensed KV cache for efficient inference of large language models. 2024
2024
-
[64]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024 a
work page internal anchor Pith review arXiv 2024
-
[65]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Parallelizing linear transformers with the delta rule over sequence length
Songlin Yang, Bailin Wang, Yu Zhang, Rameswar Panda, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, 2024 b
2024
-
[67]
Gated delta networks: Improving mamba2 with delta rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. In International Conference on Learning Representations (ICLR), 2025 b
2025
-
[68]
Evaluating the search phase of neural architecture search
Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudia Muber, and Mathieu Salzmann. Evaluating the search phase of neural architecture search. In International Conference on Learning Representations, 2020
2020
-
[69]
Big bird: Transformers for longer sequences
Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Onta \ n \'o n, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, 2020
2020
-
[70]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
First-principles statistical mechanics of semiconductor alloys and intermetallic compounds
Alex Zunger. First-principles statistical mechanics of semiconductor alloys and intermetallic compounds. In P. E. A. Turchi and A. Gonis (eds.), Statics and Dynamics of Alloy Phase Transformations, pp.\ 361--419. Springer, 1994
1994
-
[72]
Jingwei Zuo, Maksim Velikanov, Ilyas Chahed, Younes Belkada, Dhia Eddine Rhayem, Guillaume Kunsch, Hakim Hacid, Hamza Yous, Brahim Farhat, Ibrahim Khadraoui, et al. Falcon-h1: A family of hybrid-head language models redefining efficiency and performance. arXiv preprint arXiv:2507.22448, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.