pith. machine review for the scientific record. sign in

arxiv: 2604.25578 · v1 · submitted 2026-04-28 · 💻 cs.CL · cs.AI

Recognition: unknown

Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling

Chenyang Lyu, Fan Jiang, Feihu Jiang, Longyue Wang, Tianqi Shi, Weihua Luo, Yichao Du, Yu Zhao

Pith reviewed 2026-05-07 16:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multilingual language modelsmixture of expertssparse modelsmodel upcyclingefficient pretraininginstruction tuningopen modelslanguage expansion
0
0 comments X

The pith

Sparse multilingual MoE models upcycled from dense checkpoints outperform similar-sized and larger competitors while activating only 5 percent of parameters per token.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors establish that extreme sparsity in Mixture-of-Experts architectures, combined with upcycling from dense models, permits efficient pre-training on 5 trillion tokens while delivering strong English and multilingual results. Their models exceed similarly sized competitors on standard benchmarks and, after instruction tuning, surpass models that activate three to fourteen times more parameters. The work further shows that these models develop shared expert activation patterns across related languages and support adding new languages without the interference that dense models typically exhibit.

Core claim

Marco-MoE models employ a highly sparse MoE design in which only around 5% of total parameters activate per input token. Upcycling from dense models enables efficient pre-training on 5T tokens. The resulting models surpass similarly sized competitors on English and multilingual benchmarks and achieve a leading performance-to-compute ratio. Their post-trained Instruct variants exceed the performance of competing models that possess 3--14 times more activated parameters. Analysis indicates structured expert activation shared across related languages, highly specialized use for isolated languages, and scalable language expansion without typical dense-model interference.

What carries the argument

Upcycling dense models into a sparse MoE architecture with extreme sparsity (approximately 5% parameter activation per token) that preserves or improves capability while enabling efficient training and language expansion.

If this is right

  • Training compute can be reduced substantially while still reaching competitive multilingual performance.
  • Instruction-tuned versions can exceed the results of much larger dense models in activated parameters.
  • New languages can be added to the model with less cross-language interference than dense architectures allow.
  • Expert utilization becomes structured by linguistic relatedness, with sharing among related languages and specialization for isolated ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pattern of shared versus specialized experts suggests a natural way to organize capacity across language families that dense models cannot match.
  • Open release of the full training datasets and recipes makes it possible to test whether the performance-to-compute advantage holds under different data mixtures.
  • The upcycling route may lower the barrier to experimenting with larger sparse models for communities without massive compute budgets.
  • If the low-interference language expansion holds, the same recipe could be applied to add domain-specific or low-resource capabilities without retraining from scratch.

Load-bearing premise

The benchmark scores reflect genuine generalization across languages rather than overlap with training data or particular evaluation choices.

What would settle it

Evaluating the released models on a fresh set of multilingual benchmarks constructed entirely from sources absent from the 5T training corpus would determine whether the reported gains persist.

read the original abstract

We present Marco-MoE, a suite of fully open multilingual sparse Mixture-of-Experts (MoE) models. Marco-MoE features a highly sparse design in which only around 5\% of the total parameters are activated per input token. This extreme sparsity, combined with upcycling from dense models, enables efficient pre-training on 5T tokens. Our models surpass similarly-sized competitors on English and multilingual benchmarks, achieving a best-in-class performance-to-compute ratio. We further post-train these models to create Marco-MoE-\textsc{Instruct} variants, which surpass the performance of competing models possessing $3$--$14\times$ more activated parameters. Our analysis reveals that Marco-MoE learns structured expert activation patterns shared across related languages, while maintaining highly specialized utilization for linguistically isolated ones. We further show that Marco-MoE allows for scalable language expansion without the interference typical of dense models. To support the community, we disclose our full training datasets, recipes, and model weights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces Marco-MoE, a suite of open multilingual sparse Mixture-of-Experts language models with ~5% parameter activation per token, obtained via upcycling from dense checkpoints and pretrained on 5T tokens. The central claims are that these models achieve superior performance on English and multilingual benchmarks relative to similarly sized competitors (best-in-class performance-to-compute ratio), that the post-trained Instruct variants outperform models with 3–14× more activated parameters, and that expert routing exhibits structured cross-lingual patterns with no interference during language expansion. Full training data, recipes, and weights are released.

Significance. If the benchmark results hold under standardized protocols, the work would advance efficient multilingual scaling via extreme sparsity and upcycling, while the open release of datasets, recipes, and weights constitutes a clear community benefit for reproducibility. The expert-pattern analysis offers potentially useful insights into MoE behavior across languages.

major comments (3)
  1. [§5.1, Table 2] §5.1 and Table 2: The performance comparisons to similarly sized baselines do not specify the exact few-shot counts, prompting templates, or normalization methods applied to each benchmark (e.g., MMLU, XNLI). Without these details the claim that Marco-MoE 'surpasses similarly-sized competitors' cannot be verified and is load-bearing for the performance-to-compute assertion.
  2. [§3.1] §3.1 (Training Data): No quantitative decontamination analysis or overlap statistics between the 5T-token corpus and the evaluation benchmarks are reported. This omission directly affects the validity of the generalization and 'best-in-class' claims.
  3. [§5.3] §5.3 (Instruct Variants): The statement that Instruct models surpass competitors possessing 3–14× more activated parameters lacks an explicit definition of how activated parameters are counted for the baselines and confirmation that identical post-training data and evaluation protocols were used.
minor comments (3)
  1. [Figure 4] Figure 4 (expert activation heatmaps): The color scale and axis labels are difficult to read at printed size; adding a legend and higher-resolution rendering would improve clarity.
  2. [Eq. (2)] The sparsity ratio is introduced in Eq. (2) but the precise definition of 'activated parameters' used throughout the paper is only clarified in a footnote; moving this definition to the main text would aid readability.
  3. [§2] A small number of citations in §2 are missing DOIs or have inconsistent formatting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of reproducibility and rigor in our evaluations and data reporting. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications.

read point-by-point responses
  1. Referee: [§5.1, Table 2] §5.1 and Table 2: The performance comparisons to similarly sized baselines do not specify the exact few-shot counts, prompting templates, or normalization methods applied to each benchmark (e.g., MMLU, XNLI). Without these details the claim that Marco-MoE 'surpasses similarly-sized competitors' cannot be verified and is load-bearing for the performance-to-compute assertion.

    Authors: We agree that explicit evaluation details are necessary to allow independent verification of the benchmark results. In the revised manuscript, we will expand §5.1 (and add an appendix if needed) to specify the exact few-shot counts, prompting templates, and normalization methods used for each benchmark, including MMLU and XNLI. This will directly support the performance comparisons and the performance-to-compute claims. revision: yes

  2. Referee: [§3.1] §3.1 (Training Data): No quantitative decontamination analysis or overlap statistics between the 5T-token corpus and the evaluation benchmarks are reported. This omission directly affects the validity of the generalization and 'best-in-class' claims.

    Authors: This is a valid concern for assessing potential contamination and generalization. Although the full training datasets are released for community inspection, we will add quantitative decontamination analysis and overlap statistics (e.g., n-gram overlap metrics) to the revised §3.1 to strengthen the validity of our claims. revision: yes

  3. Referee: [§5.3] §5.3 (Instruct Variants): The statement that Instruct models surpass competitors possessing 3–14× more activated parameters lacks an explicit definition of how activated parameters are counted for the baselines and confirmation that identical post-training data and evaluation protocols were used.

    Authors: We will revise §5.3 to provide an explicit definition: activated parameters refer to the parameters engaged per token (∼5% for our MoE models versus the full count for dense baselines). We will also confirm that post-training data and evaluation protocols were aligned across all models in the comparisons to ensure fairness. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model training and benchmarking

full rationale

The paper reports training of sparse MoE models via upcycling, followed by direct benchmark evaluations on English and multilingual tasks. No mathematical derivation, first-principles prediction, or theorem is claimed; performance superiority is presented as an observed outcome of the training recipe and evaluation protocol rather than a quantity derived from fitted parameters or self-referential definitions. Self-citations (if present) concern prior MoE or upcycling techniques but do not serve as load-bearing justification for the headline results. The analysis of expert activation patterns is likewise post-hoc empirical observation, not a reduction to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

No theoretical derivation; the contribution rests on empirical training and evaluation choices rather than axioms or invented entities.

free parameters (1)
  • activation sparsity ratio
    Design choice of approximately 5% parameters activated per token.
axioms (1)
  • domain assumption Upcycling dense models into MoE preserves or improves downstream performance
    Invoked as the basis for efficient pre-training.

pith-pipeline@v0.9.0 · 5495 in / 1155 out tokens · 46417 ms · 2026-05-07T16:04:30.665626+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    URL https://arxiv.org/abs/2511.23404. P. Apertus, A. Hernández-Cano, A. Hägele, A. H. Huang, A. Romanou, A.-J. Solergibert, B. Pasztor, B. Messmer, D. Garbaya, E. F. Ďurech, I. Hakimi, J. G. Giraldo, M. Ismayilzada, N. Foroutan, S. Moalla, T. Chen, V. Sabolčec, Y. Xu, M. Aerni, B. AlKhamissi, I. A. Mariñas, M. H. Amani, M. Ansaripour, I. Badanin, H. Benoi...

  2. [2]

    URL https://arxiv.org/abs/2210.11416. C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short...

  3. [3]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https:// aclanthology.org/N19-1300. P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https://arxiv.org/abs/ 1803.05457. K. Cobbe, V. Kosaraju, M. Bavarian, ...

  4. [4]

    Joshi, S

    URL https://openreview.net/forum?id=fOrm2rGX2r. IBM Research. Granite 4.0 language models. https://github.com/ibm-granite/granite-4. 0-language-models, 2025. Accessed: 2025-10-01. A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, ...

  5. [5]

    Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

    Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-demo.28. URL https://aclanthology.org/2023.emnlp-demo.28/. H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measuring massive multitask language understanding in Chinese. In L.-W. Ku, A. Martins, and V. Srikumar, editors,Findings of the Association f...

  6. [6]

    URL https://openreview.net/forum?id=xXTkbTBmqq. T. Nakamura, T. Akiba, K. Fujii, Y. Oda, R. Yokota, and J. Suzuki. Drop-upcycling: Training sparse mixture of experts with partial re-initialization. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=gx1wHnf5Vp. D. Nathawani, I. Gitman, S. Majumd...

  7. [7]

    Social IQa: Commonsense Reasoning about Social Interactions

    URL https://openreview.net/forum?id=k3gCieTXeY. 32 Marco-MoE : Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641, 2019. A. R. Salamanca, D. Abagyan, D. D’souza, A. Khairi, D. Mora, ...

  8. [8]

    URL https://aclanthology.org/2023.findings-acl.824. A. Talmor, J. Herzig, N. Lourie, and J. Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)...

  9. [10]

    Haoran Luo, Haihong E, Zichen Tang, Shiyao Peng, Yikai Guo, Wentai Zhang, Chenghao Ma, Guanting Dong, Meina Song, Wei Lin, Yifan Zhu, and Anh Tuan Luu

    URL https://arxiv.org/abs/2506.05176. Y. Zhang, M. Konomi, C. Xypolopoulos, K. Divriotis, K. Skianis, G. Nikolentzos, G. Stamou, G. Shang, and M. Vazirgiannis. Greekmmlu: A native-sourced multitask benchmark for evaluating language models in greek, 2026. URL https://arxiv.org/abs/2602.05150. W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W....

  10. [11]

    Text Quality Score (1-10):

  11. [12]

    intermediate

    Country/Region: Only generate the final result without additional descriptions. 41 Marco-MoE : Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling Regional MCQs Generation Prompt Write a question on [country], more specifically on [sub_topic], with up to four choices. Desired difficulty level: [difficulty]. The required language: ...