MobileMoE: Scaling On-Device Mixture of Experts
Pith reviewed 2026-06-29 18:46 UTC · model grok-4.3
The pith
MobileMoE models match leading dense LLMs on benchmarks while using 2-4 times fewer inference FLOPs via a mobile-optimized MoE scaling law.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MobileMoE establishes a new Pareto frontier for on-device LLMs with models having 0.3-0.9B active parameters and 1.3-5.3B total parameters that match or exceed leading dense models with 2-4× fewer inference FLOPs and match or surpass OLMoE-1B-7B with up to 60% fewer parameters; this is achieved by formulating an on-device MoE scaling law that jointly optimizes architecture under mobile memory and compute constraints to identify moderate sparsity with fine-grained and shared experts as the sweet spot, followed by a four-stage training recipe of pre-training, mid-training, instruction fine-tuning, and quantization-aware training on open-source datasets, culminating in the first efficient MoE i
What carries the argument
The on-device MoE scaling law, which jointly optimizes MoE architecture under mobile memory and compute constraints to identify moderate sparsity with fine-grained and shared experts as the memory- and compute-optimal configuration.
If this is right
- MobileMoE models match or exceed leading dense on-device LLMs across 14 benchmarks with 2-4× fewer inference FLOPs.
- They match or surpass the state-of-the-art MoE OLMoE-1B-7B while using up to 60% fewer parameters.
- At comparable INT4 weight memory, MobileMoE-S delivers 1.8-3.8× faster prefill and 2.2-3.4× faster decode than the dense baseline on smartphones.
- The four-stage training recipe enables efficient deployment of sub-billion active parameter MoEs on commodity mobile devices.
Where Pith is reading between the lines
- The scaling law could be applied to derive architectures for other edge devices such as tablets or wearables with different memory hierarchies.
- The moderate-sparsity design might reduce peak power draw in battery-constrained settings compared to dense models of similar accuracy.
- Quantization-aware training combined with MoE routing could be extended to support dynamic expert selection based on real-time device load.
Load-bearing premise
The identified sweet spot of moderate sparsity with fine-grained and shared experts in the scaling law remains optimal and generalizable beyond the specific model sizes and datasets tested.
What would settle it
Training a set of on-device MoE variants with varying sparsity levels on identical mobile hardware and datasets, then measuring that a high-sparsity or low-sparsity configuration achieves strictly better benchmark accuracy per inference FLOP or per watt than the moderate-sparsity models.
read the original abstract
Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4$\times$ fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers $1.8$-$3.8\times$ faster prefill and $2.2$-$3.4\times$ faster decode than the dense baseline MobileLLM-Pro.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MobileMoE, a family of sub-billion active-parameter (0.3-0.9B active, 1.3-5.3B total) Mixture-of-Experts language models for on-device deployment. It formulates an on-device MoE scaling law that jointly optimizes architecture under mobile memory/compute constraints, identifies moderate sparsity with fine-grained and shared experts as the sweet spot, trains the resulting models via a four-stage recipe on open datasets, and reports that the models match or exceed leading dense on-device LLMs with 2-4× fewer inference FLOPs while surpassing OLMoE-1B-7B with up to 60% fewer parameters; it further demonstrates the first efficient MoE inference on commodity smartphones with 1.8-3.8× faster prefill and 2.2-3.4× faster decode than a dense baseline at comparable INT4 memory.
Significance. If the scaling law and empirical results hold under broader validation, the work would be significant for on-device LLM design by providing a principled route to sparse architectures that improve the memory-compute Pareto frontier. The explicit four-stage training recipe on open data and the smartphone profiling results are practical strengths that could be directly useful to practitioners.
major comments (3)
- [scaling-law section] Scaling-law section (near the start of the technical development, referenced in the abstract): the claim that the scaling law 'jointly optimizes MoE architecture under mobile memory and compute constraints' and 'identifies' the moderate-sparsity sweet spot is load-bearing for the central Pareto-frontier claim, yet the manuscript provides no functional form, no count or diversity of architectures swept, and no held-out prediction test; without these, it is impossible to determine whether the identified optimum is general or an artifact of the particular search band.
- [results section] Results section (the paragraph reporting 'across 14 benchmarks'): the statement that MobileMoE 'matches or exceeds leading on-device dense LLMs with 2-4× fewer inference FLOPs' and 'matches or surpasses OLMoE-1B-7B with up to 60% fewer parameters' is presented without per-benchmark tables, error bars, or statistical tests; this weakens the ability to verify the claimed frontier and is directly tied to the optimality conclusion.
- [inference-profiling paragraph] Inference-profiling paragraph: the reported 1.8-3.8× prefill and 2.2-3.4× decode speedups for MobileMoE-S versus MobileLLM-Pro at comparable INT4 weight memory rest on a single dense baseline; a broader set of dense and MoE comparators at matched memory/compute envelopes would be needed to substantiate the 'first efficient MoE inference on commodity smartphones' claim.
minor comments (2)
- [abstract and scaling-law section] The abstract and scaling-law description use 'parameter-free' or 'jointly optimizes' phrasing that should be qualified once the exact functional form and search scope are stated.
- [figures and tables] Figure captions and table headers should explicitly list the exact sparsity ratios, expert granularity, and shared-expert counts for each MobileMoE variant to allow direct reproduction of the claimed sweet spot.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to improve clarity and verifiability of the claims.
read point-by-point responses
-
Referee: [scaling-law section] Scaling-law section (near the start of the technical development, referenced in the abstract): the claim that the scaling law 'jointly optimizes MoE architecture under mobile memory and compute constraints' and 'identifies' the moderate-sparsity sweet spot is load-bearing for the central Pareto-frontier claim, yet the manuscript provides no functional form, no count or diversity of architectures swept, and no held-out prediction test; without these, it is impossible to determine whether the identified optimum is general or an artifact of the particular search band.
Authors: We agree that additional methodological transparency is warranted. The scaling law was obtained via an empirical sweep over mobile-constrained architectures. In the revision we will add an appendix with the explicit functional form (a sparsity-adjusted extension of compute-optimal scaling), the exact count and diversity of the >40 architectures evaluated (sparsity ratios 2-8x, expert granularities, shared-expert ratios), and held-out prediction accuracy on a disjoint set of configurations to demonstrate that the moderate-sparsity sweet spot generalizes beyond the search band. revision: yes
-
Referee: [results section] Results section (the paragraph reporting 'across 14 benchmarks'): the statement that MobileMoE 'matches or exceeds leading on-device dense LLMs with 2-4× fewer inference FLOPs' and 'matches or surpasses OLMoE-1B-7B with up to 60% fewer parameters' is presented without per-benchmark tables, error bars, or statistical tests; this weakens the ability to verify the claimed frontier and is directly tied to the optimality conclusion.
Authors: We concur that aggregate claims benefit from granular support. The revised results section will include a full per-benchmark table for all 14 tasks, with standard deviations across three random seeds and paired statistical tests (e.g., Wilcoxon) against the dense and OLMoE baselines to substantiate the reported FLOPs and parameter advantages. revision: yes
-
Referee: [inference-profiling paragraph] Inference-profiling paragraph: the reported 1.8-3.8× prefill and 2.2-3.4× decode speedups for MobileMoE-S versus MobileLLM-Pro at comparable INT4 weight memory rest on a single dense baseline; a broader set of dense and MoE comparators at matched memory/compute envelopes would be needed to substantiate the 'first efficient MoE inference on commodity smartphones' claim.
Authors: The profiling was performed against the strongest publicly documented dense baseline at matched INT4 memory. We will expand the section with additional dense models (e.g., Phi-2, Gemma-2B) and any accessible MoE variants at equivalent memory/compute envelopes, while retaining the original comparison; this will provide a more complete validation of the smartphone speedups. revision: partial
Circularity Check
No significant circularity in the on-device MoE scaling law or Pareto claims.
full rationale
The paper formulates the scaling law via joint empirical optimization of architecture under stated mobile constraints, then trains and evaluates the resulting models on open datasets across benchmarks. No quoted equations or steps reduce a prediction to a fitted input by construction, invoke self-citation as the sole justification for a uniqueness claim, or rename a known result as a derivation. The central claims rest on external benchmarks and hardware profiling rather than tautological reparameterization.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions in LLM training such as the validity of the scaling law form.
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Llemma: An Open Language Model For Mathematics
Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics.arXiv preprint arXiv:2310.10631, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025
Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025
-
[4]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020
2020
-
[6]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020
2020
-
[7]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Unified scaling laws for routed language models
Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. InInternational conference on machine learning, 2022
2022
-
[9]
Boolq: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), ...
2019
-
[10]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models
Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
2024
-
[13]
Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)...
2019
-
[14]
Llama guard 3-1b-int4: Compact and efficient safeguard for human-ai conversations.arXiv, 2024
Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, Eric Smith, Hongyuan Zhan, Jianfeng Chi, Yuriy Hulovatyy, Kimish Patel, Zechun Liu, Changsheng Zhao, Yangyang Shi, Tijmen Blankevoort, Mahesh Pasupuleti, Bilge Soran, Zacharie Delpierre Coudert, Rachad Alao, Raghuraman Krishnamoorthi, and Vikas Chandra. Llama guard 3-1b-int4: Compact and e...
2024
-
[15]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 2022
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 2022. 19
2022
-
[16]
Yonggan Fu, Xin Dong, Shizhe Diao, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Hannah Zhang, Nikolaus Binder, Maksim Khadkevich, et al. Nemotron-flash: Towards latency-optimal hybrid small language models.arXiv preprint arXiv:2511.18890, 2025
-
[17]
The language model evaluation harness, 07 2024.https://zenodo.org/records/12608602
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...
-
[18]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[20]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Training compute-optimal large language models.Advances in Neural Information Processing Systems, 2022
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.Advances in Neural Information Processing Systems, 2022
2022
-
[22]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale Deployment
Hanxian Huang, Igor Fedorov, Andrey Gromov, Bernard Beckerman, Naveen Suda, David Eriksson, Maximilian Balandat, Rylan Conway, Patrick Huber, Chinnadhurai Sankar, et al. Mobilellm-flash: Latency-guided on-device llm design for industry scale.arXiv preprint arXiv:2603.15954, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Mobilellm-pro technical report.arXiv preprint arXiv:2511.06719, 2025
Patrick Huber, Ernie Chang, Wei Wen, Igor Fedorov, Tarek Elgamal, Hanxian Huang, Naveen Suda, Chinnadhurai Sankar, Vish Vogeti, Yanghan Wang, et al. Mobilellm-pro technical report.arXiv preprint arXiv:2511.06719, 2025
-
[25]
Quantization and training of neural networks for efficient integer-arithmetic-only inference
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018
2018
-
[26]
Adaptive mixtures of local experts
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 1991
1991
-
[27]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017
2017
-
[29]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[30]
Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871, 2024
Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, et al. Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871, 2024
-
[31]
Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019
2019
-
[32]
Aria: An open multimodal native mixture-of-experts model.arXiv preprint arXiv:2410.05993, 2024
Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, et al. Aria: An open multimodal native mixture-of-experts model.arXiv preprint arXiv:2410.05993, 2024. 20
-
[33]
Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024
2024
-
[34]
Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers
Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, and Wei Bi. Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2961–2984, 2024
2024
-
[35]
StarCoder: may the source be with you!
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Yan Liu, Renren Jin, Ling Shi, Zheng Yao, and Deyi Xiong. Finemath: A fine-grained mathematical evaluation benchmarkforchineselargelanguagemodels.ACM Transactions on Asian and Low-Resource Language Information Processing, 24(12):1–15, 2025
2025
-
[38]
Mobilellm: Optimizing sub-billion parameter language models for on-device use cases
Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, et al. Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. InForty-first International Conference on Machine Learning, 2024
2024
-
[39]
The flan collection: Designing data and methods for effective instruction tuning
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. In International conference on machine learning, pages 22631–22648. PMLR, 2023
2023
-
[40]
Smollm2: When smol goes big—data-centric training of a fully open small language model
Anton Lozhkov, Elie Bakouch, Gabriel Martin Blazquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Agustín Piqueres Lajarín, Hynek Kydlíček, Vaibhav Srivastav, Joshua Lochner, et al. Smollm2: When smol goes big—data-centric training of a fully open small language model. InSecond Conference on Language Modeling
-
[41]
Jan Ludziejewski, Maciej Pióro, Jakub Krajewski, Maciej Stefaniak, Michał Krutul, Jan Małaśnicki, Marek Cygan, Piotr Sankowski, Kamil Adamczewski, Piotr Miłoś, et al. Joint moe scaling laws: Mixture of experts can be memory efficient.arXiv preprint arXiv:2502.05172, 2025
-
[42]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018
2018
-
[43]
OLMoE: Open Mixture-of-Experts Language Models
Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. Olmoe: Open mixture-of-experts language models.arXiv preprint arXiv:2409.02060, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Openwebmath: An open dataset of high-quality mathematical web text
Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text. InThe Twelfth International Conference on Learning Representations, 2023
2023
-
[46]
Fineweb: decanting the web for the finest text data at scale.HuggingFace
Guilherme Penedo, Hynek Kydlícek, Loubna Ben Allal, and Thomas Wolf. Fineweb: decanting the web for the finest text data at scale.HuggingFace. Accessed: Jul, 12, 2024
2024
-
[47]
olmocr: Unlocking trillions of tokens in pdfs with vision language models
Jake Poznanski, Aman Rangapur, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models. arXiv preprint arXiv:2502.18443, 2025
-
[48]
Generalizing Verifiable Instruction Following
Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
2021
-
[51]
Social iqa: Commonsense reasoning about social interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 conference on empirical methods in natural language processing 21 and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4463–4473, 2019
2019
-
[52]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[53]
Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404, 2025
Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, and Pierre Ablin. Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404, 2025
-
[54]
Challenging big-bench tasks and whether chain-of-thought can solve them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, 2023
2023
-
[55]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Gemma Team. Gemma 3. 2025.https://arxiv.org/abs/2503.19786
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Qwen Team. Qwen3.5-omni technical report.arXiv preprint arXiv:2604.15804, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[59]
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024
2024
-
[61]
Open release of grok-1.https://x.ai/news/grok-os, 2024
xAI. Open release of grok-1.https://x.ai/news/grok-os, 2024
2024
-
[62]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions
Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Ilia Kulikov, Kyunghyun Cho, Dong Wang, Yuandong Tian, Jason E Weston, et al. Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions. arXiv preprint arXiv:2502.13124, 2025
-
[65]
HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
2019
-
[66]
Changsheng Zhao, Ernie Chang, Zechun Liu, Chia-Jung Chang, Wei Wen, Chen Lai, Sheng Cao, Yuandong Tian, Raghuraman Krishnamoorthi, Yangyang Shi, et al. Mobilellm-r1: Exploring the limits of sub-billion language model reasoners with open training recipes.arXiv preprint arXiv:2509.24945, 2025
-
[67]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 2022
Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 2022
2022
-
[69]
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. 22 Appendix A Scaling Law Ablation Details This appendix provides the detailed configurations, parametric fitting procedure, and training efficiency...
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.