pith. sign in

arxiv: 2605.31268 · v1 · pith:IQ2BHWN2new · submitted 2026-05-29 · 💻 cs.CL

Mellum2 Technical Report

Pith reviewed 2026-06-28 22:57 UTC · model grok-4.3

classification 💻 cs.CL
keywords Mixture of Expertslanguage modelcode generationsoftware engineeringinference efficiencyopen-weight modeltool usemulti-token prediction
0
0 comments X

The pith

Mellum 2 is a 12B MoE model with 2.5B active parameters per token that matches open-weight baselines in the 4B-14B range on code, math, tool-use and safety tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mellum 2 as a general-purpose language model specialized in software engineering tasks such as code generation, editing, debugging, multi-step reasoning, tool use and agentic coding. It builds a 12B-parameter Mixture-of-Experts architecture that activates only 2.5B parameters per token through 64 experts with 8 active, Grouped-Query Attention, Sliding Window Attention on most layers, and a single Multi-Token Prediction head. Pre-training runs on 10.6 trillion tokens via a three-phase curriculum that shifts toward code and math data, followed by context extension and two-stage post-training that produces both direct-answer and explicit-reasoning variants. The design choices were selected through ablations that treat inference efficiency on commodity GPUs as a primary constraint. The result is benchmark competitiveness at the per-token cost of a much smaller dense model, with all checkpoints released under Apache 2.0.

Core claim

Mellum 2 is a 12B-parameter MoE model with 2.5B active parameters per token whose architecture, three-phase data curriculum, and post-training produce performance competitive with open-weight models in the 4B-14B range across code generation, math and reasoning, tool use, knowledge, and safety benchmarks while operating at the compute cost of a 2.5B dense model.

What carries the argument

Mixture-of-Experts with 64 experts and 8 active per token, combined with GQA (4 KV heads), Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that also serves as a draft model.

If this is right

  • The model can be deployed on hardware that would normally support only 2.5B dense models while still handling complex coding and reasoning workloads.
  • The same MoE plus MTP design supplies built-in speculative decoding without an external draft model.
  • The three-phase curriculum demonstrates a practical way to specialize a general pre-training run toward code and math without restarting from scratch.
  • Release of both instruct and thinking variants shows that a single base can support both direct answers and explicit reasoning traces after the same post-training stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dual-use MTP head suggests that auxiliary prediction objectives can be engineered to serve inference acceleration as well, reducing the need for separate draft models in other architectures.
  • If the efficiency pattern holds at larger scales, similar MoE ratios could make high-performance coding agents accessible on consumer hardware.
  • The curriculum shift from broad web data to curated code and math may generalize to other narrow domains where high-quality data is limited.

Load-bearing premise

Ablations that prioritized inference efficiency on commodity GPUs correctly identified architecture and curriculum choices that deliver the claimed benchmark competitiveness.

What would settle it

Head-to-head evaluation on the reported benchmarks where Mellum 2 falls materially behind the 4B-14B open-weight baselines when both are measured at equivalent per-token compute.

read the original abstract

We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on the Mixture-of-Experts (64 experts, 8 active) and combines Grouped-Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that doubles as both an auxiliary pre-training objective and a built-in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon under FP8 hybrid precision and a Warmup-Hold-Decay schedule with linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selective YaRN and then post-trained in two stages (supervised fine-tuning followed by RLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents Mellum 2, an open-weight 12B-parameter Mixture-of-Experts language model with 2.5B active parameters per token, specialized in software engineering tasks including code generation, editing, debugging, reasoning, tool use, and agentic coding. It details an architecture using 64 experts (8 active), GQA with 4 KV heads, sliding-window attention on three of every four layers, and a single multi-token prediction head; pre-training on ~10.6T tokens via a three-phase curriculum shifting toward code/math data, optimized with Muon in FP8; context extension to 128K via layer-selective YaRN; and post-training with SFT followed by RLVR to produce Instruct and Thinking variants. The central claim is that the model matches open-weight 4B-14B baselines on code, math, reasoning, tool-use, knowledge, and safety benchmarks at 2.5B dense-model per-token cost. Checkpoints and the report are released under Apache 2.0.

Significance. If the benchmark competitiveness and ablation results hold, the work would supply a practical open-weight model optimized for coding workflows at reduced inference cost on commodity GPUs, with explicit architecture and curriculum choices that could inform efficient MoE design. The release of multiple variants plus the training recipe supports reproducibility. The absence of any supporting numerical data, however, prevents evaluation of whether these contributions are realized.

major comments (2)
  1. [Abstract] Abstract: The claim that 'Mellum 2 is competitive with open-weight baselines in the 4B-14B range' across code generation, math/reasoning, tool use, knowledge, and safety benchmarks is stated without any scores, tables, baseline citations, error bars, or evaluation details, rendering the central empirical assertion unverifiable from the supplied text.
  2. [Architecture and Training sections] Architecture section: The statement that 'each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint' for the 64-expert/8-active, GQA-4, SWA-3/4, and single-MTP configuration is unsupported by any ablation tables, metrics, or comparisons; likewise, the three-phase data curriculum is described but not accompanied by validation results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need for explicit empirical support. We agree that the submitted manuscript text lacks the necessary benchmark scores, citations, and ablation results to substantiate the central claims, and we will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'Mellum 2 is competitive with open-weight baselines in the 4B-14B range' across code generation, math/reasoning, tool use, knowledge, and safety benchmarks is stated without any scores, tables, baseline citations, error bars, or evaluation details, rendering the central empirical assertion unverifiable from the supplied text.

    Authors: We agree that the abstract as submitted contains no numerical results, table references, or baseline citations, making the competitiveness claim impossible to verify from the text alone. In the revised version we will insert a concise summary of key benchmark scores (with citations to the full evaluation tables and baseline papers) directly into the abstract so that the central empirical assertion is immediately verifiable. revision: yes

  2. Referee: [Architecture and Training sections] Architecture section: The statement that 'each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint' for the 64-expert/8-active, GQA-4, SWA-3/4, and single-MTP configuration is unsupported by any ablation tables, metrics, or comparisons; likewise, the three-phase data curriculum is described but not accompanied by validation results.

    Authors: The referee is correct that the submitted manuscript describes the architectural decisions and three-phase curriculum without presenting the supporting ablation tables, performance metrics, or validation curves. We will add a dedicated ablation subsection (or appendix) that reports the relevant metrics for the expert count/active experts, GQA head count, sliding-window pattern, multi-token prediction head, and the data-mixture curriculum, including the commodity-GPU inference measurements that informed the final design. revision: yes

Circularity Check

0 steps flagged

Technical report of model training with no mathematical derivation or fitted prediction presented as a result.

full rationale

The paper is a technical report describing an MoE model architecture, data curriculum, and training recipe. All load-bearing claims (architecture choices validated by ablation, benchmark competitiveness) rest on empirical training runs and unreported ablation results rather than any equations, first-principles derivations, or predictions that could reduce to their own inputs by construction. No self-definitional loops, fitted inputs renamed as predictions, or self-citation chains appear in the supplied text. The absence of benchmark tables or ablation numbers is a separate evidentiary gap, not a circularity issue.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

This is an engineering report on model training and release; the central claim rests on empirical benchmark results and internal ablations rather than derivations. Design choices such as expert count and token volume are stated but not fitted parameters in the statistical sense. No new physical or mathematical entities are postulated.

free parameters (3)
  • Expert count
    64 experts chosen as part of MoE design for the target active-parameter budget.
  • Active experts per token
    8 active experts selected to achieve 2.5B active parameters.
  • Pre-training token count
    Approximately 10.6 trillion tokens used in the three-phase curriculum.
axioms (2)
  • domain assumption Mixture-of-Experts with sparse activation preserves modeling capacity while reducing compute
    Invoked as the basis for the 64-expert, 8-active design.
  • domain assumption Sliding window attention and grouped-query attention maintain quality at lower cost
    Used to justify the attention configuration on three of four layers.

pith-pipeline@v0.9.1-grok · 5917 in / 1636 out tokens · 34806 ms · 2026-06-28T22:57:00.990680+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 61 canonical work pages · 46 internal anchors

  1. [1]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebron, and S. Sanghai. “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”. In:arXiv preprint arXiv:2305.13245 (2023)

  2. [2]

    SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    L. B. Allal, A. Lozhkov, et al. “SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model”. In:arXiv preprint arXiv:2502.02737 (2025)

  3. [3]

    Stop Unnecessary Reflection: ARLCP for Concision-Aware Reward Shaping in Rea- soning Models

    Anonymous. “Stop Unnecessary Reflection: ARLCP for Concision-Aware Reward Shaping in Rea- soning Models”. In:arXiv preprint arXiv:2602.12113 (2026)

  4. [4]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. “Program Synthesis with Large Language Models”. In:arXiv preprint arXiv:2108.07732 (2021)

  5. [5]

    Efficient Training of Language Models to Fill in the Middle

    M. Bavarian, H. Jun, N. Tezak, et al. “Efficient Training of Language Models to Fill in the Middle”. In: arXiv preprint arXiv:2207.14255 (2022)

  6. [6]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan. “Longformer: The Long-Document Transformer”. In:arXiv preprint arXiv:2004.05150 (2020)

  7. [7]

    Seed-Coder: Let the Code Model Curate Data for Itself

    ByteDance Seed, Y. Zhang, J. Su, Y. Sun, C. Xi, et al. “Seed-Coder: Let the Code Model Curate Data for Itself”. In:arXiv preprint arXiv:2506.03524 (2025)

  8. [8]

    MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation

    F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman, A. Guha, M. Greenberg, and A. Jangda. “MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation”. In:IEEE Transactions on Software Engineering 49.7 (2023), pp. 3675–3691

  9. [9]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, et al. “Evaluating Large Language Models Trained on Code”. In:arXiv preprint arXiv:2107.03374 (2021)

  10. [10]

    Unified Scaling Laws for Routed Language Models

    A. Clark, D. de las Casas, A. Guy, A. Mensch, M. Paganini, J. Hoffmann, B. Damoc, B. Hechtman, T. Cai, S. Borgeaud, G. van den Driessche, E. Rutherford, T. Hennigan, M. Johnson, K. Millican, A. Cassirer, C. Jones, E. Buchatskaya, D. Budden, L. Sifre, S. Osindero, O. Vinyals, J. Rae, E. Elsen, K. Kavukcuoglu, and K. Simonyan. “Unified Scaling Laws for Rout...

  11. [11]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. “Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In:arXiv preprint arXiv:1803.05457 (2018)

  12. [12]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. “Training Verifiers to Solve Math Word Problems”. In:arXiv preprint arXiv:2110.14168 (2021)

  13. [13]

    Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM

    Codefuse, Ling Team, W. Cai, Y. Cao, C. Chen, C. Chen, S. Chen, Q. Cui, P. Di, J. Fang, Z. Gong, T. Guo, Z. He, Y. Huang, C. Li, J. Li, Z. Li, S. Lian, B. Liu, S. Luo, S. Mao, M. Shen, J. Wu, J. Yang, W. Yang, T. Ye, H. Yu, W. Zhang, Z. Zhang, H. Zhao, X. Zheng, and J. Zhou. “Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Ef...

  14. [14]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    D. Dai, C. Deng, C. Zhao, et al. “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture- of-Experts Language Models”. In:arXiv preprint arXiv:2401.06066 (2024)

  15. [15]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI. “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model”. In:arXiv preprint arXiv:2405.04434 (2024)

  16. [16]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. “DeepSeek-V3 Technical Report”. In:arXiv preprint arXiv:2412.19437 (2025)

  17. [17]

    Fewer Truncations Im- prove Language Modeling

    H. Ding, Z. Wang, G. Paolini, V. Kumar, A. Deoras, D. Roth, and S. Soatto. “Fewer Truncations Im- prove Language Modeling”. In:Proceedings of the 41st International Conference on Machine Learning (ICML). 2024. arXiv: 2404.10830 [cs.CL]. 27 Mellum 2 Technical RepoRt v1.0 · May 2026

  18. [18]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    W. Fedus, B. Zoph, and N. Shazeer. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”. In:Journal of Machine Learning Research 23.120 (2022), pp. 1– 40

  19. [19]

    MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

    T. Gale, D. Narayanan, C. Young, and M. Zaharia. “MegaBlocks: Efficient Sparse Training with Mixture-of-Experts”. In:Proceedings of the Sixth Conference on Machine Learning and Systems (ML- Sys). 2023

  20. [20]

    Are we done with mmlu?arXiv preprint arXiv:2406.04127,

    A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao, X. Du, M. R. G. Madani, C. Barale, R. McHardy, J. Harris, J. Kaddour, E. van Krieken, and P. Minervini. “Are We Done with MMLU?” In:arXiv preprint arXiv:2406.04127 (2024)

  21. [21]

    Gemma 3 Technical Report

    Gemma Team. “Gemma 3 Technical Report”. In:arXiv preprint arXiv:2503.19786 (2025)

  22. [22]

    Better & Faster Large Language Models via Multi-token Prediction

    F. Gloeckle, B. Y. Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve. “Better & Faster Large Lan- guage Models via Multi-token Prediction”. In:arXiv preprint arXiv:2404.19737 (2024)

  23. [23]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, et al. “The Llama 3 Herd of Models”. In:arXiv preprint arXiv:2407.21783 (2024)

  24. [24]

    CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

    A. Gu, B. Rozière, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang. “CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution”. In:arXiv preprint arXiv:2401.03065 (2024)

  25. [25]

    Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

    A. Hägele, E. Bakouch, A. Kosson, L. B. Allal, L. Von Werra, and M. Jaggi. “Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations”. In:arXiv preprint arXiv:2405.18392 (2024)

  26. [26]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. “Measuring Massive Multitask Language Understanding”. In:arXiv preprint arXiv:2009.03300 (2021)

  27. [27]

    Measuring Mathematical Problem Solving With the MATH Dataset

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. “Mea- suring Mathematical Problem Solving With the MATH Dataset”. In:arXiv preprint arXiv:2103.03874 (2021)

  28. [28]

    Query-Key Normalization for Transform- ers

    A. Henry, P. R. Dachapally, S. S. Pawar, and Y. Chen. “Query-Key Normalization for Transform- ers”. In: Findings of the Association for Computational Linguistics: EMNLP 2020 . Association for Computational Linguistics, 2020, pp. 4246–4253

  29. [29]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y. Zhang, and B. Ginsburg. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In:arXiv preprint arXiv:2404.06654 (2024)

  30. [30]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    S. Hu, Y. Tu, X. Han, et al. “MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies”. In:arXiv preprint arXiv:2404.06395 (2024)

  31. [31]

    Qwen2.5-Coder Technical Report

    B. Hui, J. Yang, Z. Cui, et al. “Qwen2.5-Coder Technical Report”. In:arXiv preprint arXiv:2409.12186 (2024)

  32. [32]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code”. In: arXiv preprint arXiv:2403.07974 (2024)

  33. [33]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, et al. “Mistral 7B”. In:arXiv preprint arXiv:2310.06825 (2023)

  34. [34]

    Jordan, Y

    K. Jordan, Y. Jin, V. Boza, J. You, F. Cesista, L. Newhouse, and J. Bernstein.Muon: An optimizer for hidden layers in neural networks . https://kellerjordan.github.io/posts/muon/. 2024

  35. [35]

    Scaling Laws for Fine-Grained Mixture of Experts

    J. Krajewski, J. Ludziejewski, K. Adamczewski, M. Pióro, M. Krutul, S. Antoniak, K. Ciebiera, K. Król, T. Odrzygóźdź, P. Sankowski, M. Cygan, and S. Jaszczur. “Scaling Laws for Fine-Grained Mixture of Experts”. In:arXiv preprint arXiv:2402.07871 (2024)

  36. [36]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. “Efficient Memory Management for Large Language Model Serving with PagedAttention”. In:Pro- ceedings of the 29th Symposium on Operating Systems Principles (SOSP) . ACM, 2023, pp. 611–626. 28 Mellum 2 Technical RepoRt v1.0 · May 2026

  37. [37]

    Deduplicating Training Data Makes Language Models Better

    K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini. “Deduplicating Training Data Makes Language Models Better”. In:Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, 2022, pp. 8424–8445

  38. [38]

    Leviathan, M

    Y. Leviathan, M. Kalman, and Y. Matias.Fast Inference from Transformers via Speculative Decoding

  39. [39]

    Fast Inference from Transformers via Speculative Decoding

    arXiv: 2211.17192 [cs.LG]. uRl: https://arxiv.org/abs/2211.17192

  40. [40]

    GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers

    Q. Li, L. Cui, X. Zhao, L. Kong, and W. Bi. “GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers”. In:arXiv preprint arXiv:2402.19255 (2024)

  41. [41]

    StarCoder: may the source be with you!

    R. Li et al. “StarCoder: May the Source Be with You!” In:arXiv preprint arXiv:2305.06161 (2023)

  42. [42]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    S. Lin, J. Hilton, and O. Evans. “TruthfulQA: Measuring How Models Mimic Human Falsehoods”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, 2022, pp. 3214–3252

  43. [43]

    Ring-1T Technical Report

    Ling Team. “Ring-1T Technical Report”. In:arXiv preprint arXiv:2510.18855 (2025)

  44. [44]

    Ministral 3

    A. H. Liu et al. “Ministral 3”. In:arXiv preprint arXiv:2601.08584 (2026)

  45. [45]

    Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

    J. Liu, C. S. Xia, Y. Wang, and L. Zhang. “Is Your Code Generated by ChatGPT Really Correct? Rigor- ous Evaluation of Large Language Models for Code Generation”. In:arXiv preprint arXiv:2305.01210 (2023)

  46. [46]

    Muon is Scalable for LLM Training

    J. Liu, J. Su, X. Yao, et al. “Muon is Scalable for LLM Training”. In:arXiv preprint arXiv:2502.16982 (2025)

  47. [47]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. “Understanding R1-Zero-Like Training: A Critical Perspective”. In:Conference on Language Modeling (COLM) . 2025

  48. [48]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter. “Decoupled Weight Decay Regularization”. In:International Confer- ence on Learning Representations (ICLR) . 2019. uRl: https : / / openreview . net / forum ? id = Bkg6RiCqY7

  49. [49]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. Forsyth, and D. Hendrycks. “HarmBench: A Standardized Evaluation Framework for Automated Red Team- ing and Robust Refusal”. In:arXiv preprint arXiv:2402.04249 (2024)

  50. [50]

    FP8 Formats for Deep Learning

    P. Micikevicius, D. Stosic, N. Burgess, et al. “FP8 Formats for Deep Learning”. In:arXiv preprint arXiv:2209.05433 (2022)

  51. [51]

    MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

    J. Ni, F. Xue, X. Yue, Y. Deng, M. Shah, K. Jain, G. Neubig, and Y. You. “MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures”. In:arXiv preprint arXiv:2406.06565 (2024)

  52. [52]

    NeMo Gym: An Open Source Framework for Scaling Reinforcement Learning Environments for LLM

    NVIDIA. NeMo Gym: An Open Source Framework for Scaling Reinforcement Learning Environments for LLM. https://github.com/NVIDIA-NeMo/Gym. GitHub repository. 2025

  53. [53]

    NeMo RL: A Scalable and Efficient Post-Training Library

    NVIDIA. NeMo RL: A Scalable and Efficient Post-Training Library . https://github.com/NVIDIA- NeMo/RL. GitHub repository. 2025

  54. [54]

    NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

    NVIDIA. “NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Rea- soning Model”. In:arXiv preprint arXiv:2508.14444 (2025)

  55. [55]

    The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models

    S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V. Suresh, I. Stoica, and J. E. Gonzalez. “The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models”. In: Proceedings of the 42nd International Conference on Machine Learning . 2025, pp. 48371–48392

  56. [56]

    Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding

    N. Pavlichenko, I. Nazarov, I. Dolgov, E. Garanina, D. Ustalov, I. Bondyrev, K. Lysaniuk, E. Vu, K. Chekmenev, J. Shtok, Y. Golubev, A. Semenkin, and U. Sazanovich. “Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding”. In:arXiv preprint arXiv:2510.05788 (2025). 29 Mellum 2 Technical RepoRt v1.0 · May 2026

  57. [57]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    G. Penedo, H. Kydlicek, L. B. Allal, et al. “The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale”. In:arXiv preprint arXiv:2406.17557 (2024)

  58. [58]

    YaRN: Efficient Context Window Extension of Large Language Models

    B. Peng, J. Quesnelle, H. Fan, and E. Shippole. “YaRN: Efficient Context Window Extension of Large Language Models”. In:arXiv preprint arXiv:2309.00071 (2024)

  59. [59]

    Qwen2.5 Technical Report

    Qwen Team. “Qwen2.5 Technical Report”. In:arXiv preprint arXiv:2412.15115 (2024)

  60. [60]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman. “GPQA: A Graduate-Level Google-Proof Q&A Benchmark”. In:arXiv preprint arXiv:2311.12022 (2023)

  61. [61]

    XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    P. Röttger, H. R. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy. “XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models”. In:Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) . Association for ...

  62. [62]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y. Choi. “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”. In:Communications of the ACM 64.9 (2021), pp. 99–106

  63. [63]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models”. In: arXiv preprint arXiv:2402.03300 (2024)

  64. [64]

    GLU Variants Improve Transformer

    N. Shazeer. “GLU Variants Improve Transformer”. In:arXiv preprint arXiv:2002.05202 (2020)

  65. [65]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. “Megatron-LM: Train- ing Multi-Billion Parameter Language Models Using Model Parallelism”. In:arXiv preprint arXiv:1909.08053 (2020)

  66. [66]

    Arcee Trinity Large Technical Report

    V. Singh, L. Krauss, S. Jaghouar, M. Sirovatka, C. Goddard, F. Obied, J. M. Ong, J. Straube, A. Harley, C. Stewart, C. Kealty, M. Panahi, S. Kirsten, A. Deshpande, A. Vij, A. Bresnu, P. Veldurthi, R. Rav- ishankar, H. Bishnoi, M. McQuade, J. Hagemann, and L. Atkins. “Arcee Trinity Large Technical Report”. In:arXiv preprint arXiv:2602.17004 (2026)

  67. [67]

    In The Eleventh International Conference on Learning Repre- sentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023

    Z. Stojanovski, O. Stanley, J. Sharratt, R. Jones, A. Adefioye, J. Kaddour, and A. Köpf. Reason- ing Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards . 2025. arXiv: 2505.24760 [cs.LG]. uRl: https://arxiv.org/abs/2505.24760

  68. [68]

    Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

    D. Su, K. Kong, Y. Lin, J. Jennings, B. Norick, M. Kliegl, M. Patwary, M. Shoeybi, and B. Catanzaro. “Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset”. In: arXiv preprint arXiv:2412.02595 (2024)

  69. [69]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    J. Su, M. Ahmed, Y. Lu, S. Pan, B. Wen, and Y. Liu. “RoFormer: Enhanced Transformer with Rotary Position Embedding”. In:Neurocomputing 568 (2024), p. 127063

  70. [70]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, and J. Wei. “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”. In:arXiv preprint arXiv:2210.09261 (2022)

  71. [71]

    Qwen3.5: Towards Native Multimodal Agents

    Q. Team. “Qwen3.5: Towards Native Multimodal Agents”. In: (Feb. 2026)

  72. [72]

    Olmo 3

    Team Olmo, A. Ettinger, A. Bertsch, B. Kuehl, et al. “Olmo 3”. In:arXiv preprint arXiv:2512.13961 (2025)

  73. [73]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, et al. “Llama 2: Open Foundation and Fine-Tuned Chat Models”. In: arXiv preprint arXiv:2307.09288 (2023)

  74. [74]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. “MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark”. In:arXiv preprint arXiv:2406.01574 (2024). 30 Mellum 2 Technical RepoRt v1.0 · May 2026

  75. [75]

    Qwen3 Technical Report

    A. Yang, A. Yang, B. Yang, et al. “Qwen3 Technical Report”. In:arXiv preprint arXiv:2505.09388 (2025)

  76. [76]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    S. Yang, J. Kautz, and A. Hatamizadeh. “Gated Delta Networks: Improving Mamba2 with Delta Rule”. In:International Conference on Learning Representations (ICLR) . arXiv:2412.06464. 2025

  77. [77]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. “DAPO: An Open-Source LLM Reinforcement Learning System at Scale”. In:arXiv preprint arXiv:2503.14476 (2025)

  78. [78]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. “HellaSwag: Can a Machine Really Finish Your Sentence?” In:Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019, pp. 4791–4800

  79. [79]

    Root Mean Square Layer Normalization

    B. Zhang and R. Sennrich. “Root Mean Square Layer Normalization”. In:Advances in Neural Infor- mation Processing Systems. Vol. 32. 2019, pp. 12360–12371

  80. [80]

    Instruction-Following Evaluation for Large Language Models

    J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. “Instruction-Following Evaluation for Large Language Models”. In:arXiv preprint arXiv:2311.07911 (2023)

Showing first 80 references.