pith. sign in

arxiv: 2605.27358 · v1 · pith:QO7HYVEZnew · submitted 2026-05-26 · 💻 cs.LG · cs.AI· cs.CL

MobileMoE: Scaling On-Device Mixture of Experts

Pith reviewed 2026-06-29 18:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords MobileMoEMixture of Expertson-device LLMsscaling lawefficient inferencesmartphone deploymentsub-billion parametersPareto frontier
0
0 comments X

The pith

MobileMoE models match leading dense LLMs on benchmarks while using 2-4 times fewer inference FLOPs via a mobile-optimized MoE scaling law.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to demonstrate that Mixture-of-Experts models can be scaled effectively to sub-billion active parameters for on-device deployment, delivering better efficiency than dense alternatives under mobile constraints. It derives an on-device scaling law that identifies moderate sparsity with fine-grained and shared experts as the joint memory and compute optimum, then trains models through a four-stage process on open datasets. These models achieve comparable or superior results to dense LLMs and other MoEs while supporting practical inference on smartphones. A reader would care because current on-device LLMs are constrained by compute and memory, and this approach could expand accessible AI capabilities without relying on servers.

Core claim

MobileMoE establishes a new Pareto frontier for on-device LLMs with models having 0.3-0.9B active parameters and 1.3-5.3B total parameters that match or exceed leading dense models with 2-4× fewer inference FLOPs and match or surpass OLMoE-1B-7B with up to 60% fewer parameters; this is achieved by formulating an on-device MoE scaling law that jointly optimizes architecture under mobile memory and compute constraints to identify moderate sparsity with fine-grained and shared experts as the sweet spot, followed by a four-stage training recipe of pre-training, mid-training, instruction fine-tuning, and quantization-aware training on open-source datasets, culminating in the first efficient MoE i

What carries the argument

The on-device MoE scaling law, which jointly optimizes MoE architecture under mobile memory and compute constraints to identify moderate sparsity with fine-grained and shared experts as the memory- and compute-optimal configuration.

If this is right

  • MobileMoE models match or exceed leading dense on-device LLMs across 14 benchmarks with 2-4× fewer inference FLOPs.
  • They match or surpass the state-of-the-art MoE OLMoE-1B-7B while using up to 60% fewer parameters.
  • At comparable INT4 weight memory, MobileMoE-S delivers 1.8-3.8× faster prefill and 2.2-3.4× faster decode than the dense baseline on smartphones.
  • The four-stage training recipe enables efficient deployment of sub-billion active parameter MoEs on commodity mobile devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The scaling law could be applied to derive architectures for other edge devices such as tablets or wearables with different memory hierarchies.
  • The moderate-sparsity design might reduce peak power draw in battery-constrained settings compared to dense models of similar accuracy.
  • Quantization-aware training combined with MoE routing could be extended to support dynamic expert selection based on real-time device load.

Load-bearing premise

The identified sweet spot of moderate sparsity with fine-grained and shared experts in the scaling law remains optimal and generalizable beyond the specific model sizes and datasets tested.

What would settle it

Training a set of on-device MoE variants with varying sparsity levels on identical mobile hardware and datasets, then measuring that a high-sparsity or low-sparsity configuration achieves strictly better benchmark accuracy per inference FLOP or per watt than the moderate-sparsity models.

read the original abstract

Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4$\times$ fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers $1.8$-$3.8\times$ faster prefill and $2.2$-$3.4\times$ faster decode than the dense baseline MobileLLM-Pro.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MobileMoE, a family of sub-billion active-parameter (0.3-0.9B active, 1.3-5.3B total) Mixture-of-Experts language models for on-device deployment. It formulates an on-device MoE scaling law that jointly optimizes architecture under mobile memory/compute constraints, identifies moderate sparsity with fine-grained and shared experts as the sweet spot, trains the resulting models via a four-stage recipe on open datasets, and reports that the models match or exceed leading dense on-device LLMs with 2-4× fewer inference FLOPs while surpassing OLMoE-1B-7B with up to 60% fewer parameters; it further demonstrates the first efficient MoE inference on commodity smartphones with 1.8-3.8× faster prefill and 2.2-3.4× faster decode than a dense baseline at comparable INT4 memory.

Significance. If the scaling law and empirical results hold under broader validation, the work would be significant for on-device LLM design by providing a principled route to sparse architectures that improve the memory-compute Pareto frontier. The explicit four-stage training recipe on open data and the smartphone profiling results are practical strengths that could be directly useful to practitioners.

major comments (3)
  1. [scaling-law section] Scaling-law section (near the start of the technical development, referenced in the abstract): the claim that the scaling law 'jointly optimizes MoE architecture under mobile memory and compute constraints' and 'identifies' the moderate-sparsity sweet spot is load-bearing for the central Pareto-frontier claim, yet the manuscript provides no functional form, no count or diversity of architectures swept, and no held-out prediction test; without these, it is impossible to determine whether the identified optimum is general or an artifact of the particular search band.
  2. [results section] Results section (the paragraph reporting 'across 14 benchmarks'): the statement that MobileMoE 'matches or exceeds leading on-device dense LLMs with 2-4× fewer inference FLOPs' and 'matches or surpasses OLMoE-1B-7B with up to 60% fewer parameters' is presented without per-benchmark tables, error bars, or statistical tests; this weakens the ability to verify the claimed frontier and is directly tied to the optimality conclusion.
  3. [inference-profiling paragraph] Inference-profiling paragraph: the reported 1.8-3.8× prefill and 2.2-3.4× decode speedups for MobileMoE-S versus MobileLLM-Pro at comparable INT4 weight memory rest on a single dense baseline; a broader set of dense and MoE comparators at matched memory/compute envelopes would be needed to substantiate the 'first efficient MoE inference on commodity smartphones' claim.
minor comments (2)
  1. [abstract and scaling-law section] The abstract and scaling-law description use 'parameter-free' or 'jointly optimizes' phrasing that should be qualified once the exact functional form and search scope are stated.
  2. [figures and tables] Figure captions and table headers should explicitly list the exact sparsity ratios, expert granularity, and shared-expert counts for each MobileMoE variant to allow direct reproduction of the claimed sweet spot.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to improve clarity and verifiability of the claims.

read point-by-point responses
  1. Referee: [scaling-law section] Scaling-law section (near the start of the technical development, referenced in the abstract): the claim that the scaling law 'jointly optimizes MoE architecture under mobile memory and compute constraints' and 'identifies' the moderate-sparsity sweet spot is load-bearing for the central Pareto-frontier claim, yet the manuscript provides no functional form, no count or diversity of architectures swept, and no held-out prediction test; without these, it is impossible to determine whether the identified optimum is general or an artifact of the particular search band.

    Authors: We agree that additional methodological transparency is warranted. The scaling law was obtained via an empirical sweep over mobile-constrained architectures. In the revision we will add an appendix with the explicit functional form (a sparsity-adjusted extension of compute-optimal scaling), the exact count and diversity of the >40 architectures evaluated (sparsity ratios 2-8x, expert granularities, shared-expert ratios), and held-out prediction accuracy on a disjoint set of configurations to demonstrate that the moderate-sparsity sweet spot generalizes beyond the search band. revision: yes

  2. Referee: [results section] Results section (the paragraph reporting 'across 14 benchmarks'): the statement that MobileMoE 'matches or exceeds leading on-device dense LLMs with 2-4× fewer inference FLOPs' and 'matches or surpasses OLMoE-1B-7B with up to 60% fewer parameters' is presented without per-benchmark tables, error bars, or statistical tests; this weakens the ability to verify the claimed frontier and is directly tied to the optimality conclusion.

    Authors: We concur that aggregate claims benefit from granular support. The revised results section will include a full per-benchmark table for all 14 tasks, with standard deviations across three random seeds and paired statistical tests (e.g., Wilcoxon) against the dense and OLMoE baselines to substantiate the reported FLOPs and parameter advantages. revision: yes

  3. Referee: [inference-profiling paragraph] Inference-profiling paragraph: the reported 1.8-3.8× prefill and 2.2-3.4× decode speedups for MobileMoE-S versus MobileLLM-Pro at comparable INT4 weight memory rest on a single dense baseline; a broader set of dense and MoE comparators at matched memory/compute envelopes would be needed to substantiate the 'first efficient MoE inference on commodity smartphones' claim.

    Authors: The profiling was performed against the strongest publicly documented dense baseline at matched INT4 memory. We will expand the section with additional dense models (e.g., Phi-2, Gemma-2B) and any accessible MoE variants at equivalent memory/compute envelopes, while retaining the original comparison; this will provide a more complete validation of the smartphone speedups. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the on-device MoE scaling law or Pareto claims.

full rationale

The paper formulates the scaling law via joint empirical optimization of architecture under stated mobile constraints, then trains and evaluates the resulting models on open datasets across benchmarks. No quoted equations or steps reduce a prediction to a fitted input by construction, invoke self-citation as the sole justification for a uniqueness claim, or rename a known result as a derivation. The central claims rest on external benchmarks and hardware profiling rather than tautological reparameterization.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new scaling law but no free parameters are explicitly fitted in the abstract; relies on standard ML assumptions.

axioms (1)
  • domain assumption Standard assumptions in LLM training such as the validity of the scaling law form.
    The paper relies on formulating a scaling law, which assumes certain relationships hold for MoE under constraints.

pith-pipeline@v0.9.1-grok · 5876 in / 905 out tokens · 41610 ms · 2026-06-29T18:46:06.938266+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 40 canonical work pages · 29 internal anchors

  1. [1]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  2. [2]

    Llemma: An Open Language Model For Mathematics

    Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics.arXiv preprint arXiv:2310.10631, 2023

  3. [3]

    Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025

    Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025

  4. [4]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024

  5. [5]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

  6. [6]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020

  7. [7]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  8. [8]

    Unified scaling laws for routed language models

    Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. InInternational conference on machine learning, 2022

  9. [9]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), ...

  10. [10]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  11. [11]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  12. [12]

    Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models

    Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024

  13. [13]

    Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)...

  14. [14]

    Llama guard 3-1b-int4: Compact and efficient safeguard for human-ai conversations.arXiv, 2024

    Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, Eric Smith, Hongyuan Zhan, Jianfeng Chi, Yuriy Hulovatyy, Kimish Patel, Zechun Liu, Changsheng Zhao, Yangyang Shi, Tijmen Blankevoort, Mahesh Pasupuleti, Bilge Soran, Zacharie Delpierre Coudert, Rachad Alao, Raghuraman Krishnamoorthi, and Vikas Chandra. Llama guard 3-1b-int4: Compact and e...

  15. [15]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 2022. 19

  16. [16]

    Nemotron-flash: Towards latency-optimal hybrid small language models.arXiv preprint arXiv:2511.18890, 2025

    Yonggan Fu, Xin Dong, Shizhe Diao, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Hannah Zhang, Nikolaus Binder, Maksim Khadkevich, et al. Nemotron-flash: Towards latency-optimal hybrid small language models.arXiv preprint arXiv:2511.18890, 2025

  17. [17]

    The language model evaluation harness, 07 2024.https://zenodo.org/records/12608602

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  18. [18]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  19. [19]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  20. [20]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  21. [21]

    Training compute-optimal large language models.Advances in Neural Information Processing Systems, 2022

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.Advances in Neural Information Processing Systems, 2022

  22. [22]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

  23. [23]

    MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale Deployment

    Hanxian Huang, Igor Fedorov, Andrey Gromov, Bernard Beckerman, Naveen Suda, David Eriksson, Maximilian Balandat, Rylan Conway, Patrick Huber, Chinnadhurai Sankar, et al. Mobilellm-flash: Latency-guided on-device llm design for industry scale.arXiv preprint arXiv:2603.15954, 2026

  24. [24]

    Mobilellm-pro technical report.arXiv preprint arXiv:2511.06719, 2025

    Patrick Huber, Ernie Chang, Wei Wen, Igor Fedorov, Tarek Elgamal, Hanxian Huang, Naveen Suda, Chinnadhurai Sankar, Vish Vogeti, Yanghan Wang, et al. Mobilellm-pro technical report.arXiv preprint arXiv:2511.06719, 2025

  25. [25]

    Quantization and training of neural networks for efficient integer-arithmetic-only inference

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018

  26. [26]

    Adaptive mixtures of local experts

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 1991

  27. [27]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  28. [28]

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017

  29. [29]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  30. [30]

    Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871, 2024

    Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, et al. Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871, 2024

  31. [31]

    Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

  32. [32]

    Aria: An open multimodal native mixture-of-experts model.arXiv preprint arXiv:2410.05993, 2024

    Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, et al. Aria: An open multimodal native mixture-of-experts model.arXiv preprint arXiv:2410.05993, 2024. 20

  33. [33]

    Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

  34. [34]

    Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers

    Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, and Wei Bi. Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2961–2984, 2024

  35. [35]

    StarCoder: may the source be with you!

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161, 2023

  36. [36]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  37. [37]

    Yan Liu, Renren Jin, Ling Shi, Zheng Yao, and Deyi Xiong. Finemath: A fine-grained mathematical evaluation benchmarkforchineselargelanguagemodels.ACM Transactions on Asian and Low-Resource Language Information Processing, 24(12):1–15, 2025

  38. [38]

    Mobilellm: Optimizing sub-billion parameter language models for on-device use cases

    Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, et al. Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. InForty-first International Conference on Machine Learning, 2024

  39. [39]

    The flan collection: Designing data and methods for effective instruction tuning

    Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. In International conference on machine learning, pages 22631–22648. PMLR, 2023

  40. [40]

    Smollm2: When smol goes big—data-centric training of a fully open small language model

    Anton Lozhkov, Elie Bakouch, Gabriel Martin Blazquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Agustín Piqueres Lajarín, Hynek Kydlíček, Vaibhav Srivastav, Joshua Lochner, et al. Smollm2: When smol goes big—data-centric training of a fully open small language model. InSecond Conference on Language Modeling

  41. [41]

    Joint moe scaling laws: Mixture of experts can be memory efficient.arXiv preprint arXiv:2502.05172, 2025

    Jan Ludziejewski, Maciej Pióro, Jakub Krajewski, Maciej Stefaniak, Michał Krutul, Jan Małaśnicki, Marek Cygan, Piotr Sankowski, Kamil Adamczewski, Piotr Miłoś, et al. Joint moe scaling laws: Mixture of experts can be memory efficient.arXiv preprint arXiv:2502.05172, 2025

  42. [42]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018

  43. [43]

    OLMoE: Open Mixture-of-Experts Language Models

    Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. Olmoe: Open mixture-of-experts language models.arXiv preprint arXiv:2409.02060, 2024

  44. [44]

    Olmo 3

    Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025

  45. [45]

    Openwebmath: An open dataset of high-quality mathematical web text

    Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text. InThe Twelfth International Conference on Learning Representations, 2023

  46. [46]

    Fineweb: decanting the web for the finest text data at scale.HuggingFace

    Guilherme Penedo, Hynek Kydlícek, Loubna Ben Allal, and Thomas Wolf. Fineweb: decanting the web for the finest text data at scale.HuggingFace. Accessed: Jul, 12, 2024

  47. [47]

    olmocr: Unlocking trillions of tokens in pdfs with vision language models

    Jake Poznanski, Aman Rangapur, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models. arXiv preprint arXiv:2502.18443, 2025

  48. [48]

    Generalizing Verifiable Instruction Following

    Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833, 2025

  49. [49]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

  50. [50]

    Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  51. [51]

    Social iqa: Commonsense reasoning about social interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 conference on empirical methods in natural language processing 21 and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4463–4473, 2019

  52. [52]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  53. [53]

    Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404, 2025

    Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, and Pierre Ablin. Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404, 2025

  54. [54]

    Challenging big-bench tasks and whether chain-of-thought can solve them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, 2023

  55. [55]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  56. [56]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

  57. [57]

    Gemma Team. Gemma 3. 2025.https://arxiv.org/abs/2503.19786

  58. [58]

    Qwen3.5-Omni Technical Report

    Qwen Team. Qwen3.5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

  59. [59]

    Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

    Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664, 2024

  60. [60]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

  61. [61]

    Open release of grok-1.https://x.ai/news/grok-os, 2024

    xAI. Open release of grok-1.https://x.ai/news/grok-os, 2024

  62. [62]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

  63. [63]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  64. [64]

    Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions

    Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Ilia Kulikov, Kyunghyun Cho, Dong Wang, Yuandong Tian, Jason E Weston, et al. Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions. arXiv preprint arXiv:2502.13124, 2025

  65. [65]

    HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

  66. [66]

    Mobilellm-r1: Exploring the limits of sub-billion language model reasoners with open training recipes.arXiv preprint arXiv:2509.24945, 2025

    Changsheng Zhao, Ernie Chang, Zechun Liu, Chia-Jung Chang, Wei Wen, Chen Lai, Sheng Cao, Yuandong Tian, Raghuraman Krishnamoorthi, Yangyang Shi, et al. Mobilellm-r1: Exploring the limits of sub-billion language model reasoners with open training recipes.arXiv preprint arXiv:2509.24945, 2025

  67. [67]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

  68. [68]

    Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 2022

    Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 2022

  69. [69]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. 22 Appendix A Scaling Law Ablation Details This appendix provides the detailed configurations, parametric fitting procedure, and training efficiency...