Mellum2 Technical Report

Aral de Moor; Ivan Bondyrev; Ivan Dolgov; Joseph Shtok; Kseniia Lysaniuk; Madeeswaran Kannan; Marko Kojic; Nikita Pavlichenko; Petr Borovlev

arxiv: 2605.31268 · v1 · pith:IQ2BHWN2new · submitted 2026-05-29 · 💻 cs.CL

Mellum2 Technical Report

Marko Kojic , Ivan Bondyrev , Aral de Moor , Joseph Shtok , Petr Borovlev , Kseniia Lysaniuk , Madeeswaran Kannan , Ivan Dolgov

show 1 more author

Nikita Pavlichenko

This is my paper

Pith reviewed 2026-06-28 22:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords Mixture of Expertslanguage modelcode generationsoftware engineeringinference efficiencyopen-weight modeltool usemulti-token prediction

0 comments

The pith

Mellum 2 is a 12B MoE model with 2.5B active parameters per token that matches open-weight baselines in the 4B-14B range on code, math, tool-use and safety tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mellum 2 as a general-purpose language model specialized in software engineering tasks such as code generation, editing, debugging, multi-step reasoning, tool use and agentic coding. It builds a 12B-parameter Mixture-of-Experts architecture that activates only 2.5B parameters per token through 64 experts with 8 active, Grouped-Query Attention, Sliding Window Attention on most layers, and a single Multi-Token Prediction head. Pre-training runs on 10.6 trillion tokens via a three-phase curriculum that shifts toward code and math data, followed by context extension and two-stage post-training that produces both direct-answer and explicit-reasoning variants. The design choices were selected through ablations that treat inference efficiency on commodity GPUs as a primary constraint. The result is benchmark competitiveness at the per-token cost of a much smaller dense model, with all checkpoints released under Apache 2.0.

Core claim

Mellum 2 is a 12B-parameter MoE model with 2.5B active parameters per token whose architecture, three-phase data curriculum, and post-training produce performance competitive with open-weight models in the 4B-14B range across code generation, math and reasoning, tool use, knowledge, and safety benchmarks while operating at the compute cost of a 2.5B dense model.

What carries the argument

Mixture-of-Experts with 64 experts and 8 active per token, combined with GQA (4 KV heads), Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that also serves as a draft model.

If this is right

The model can be deployed on hardware that would normally support only 2.5B dense models while still handling complex coding and reasoning workloads.
The same MoE plus MTP design supplies built-in speculative decoding without an external draft model.
The three-phase curriculum demonstrates a practical way to specialize a general pre-training run toward code and math without restarting from scratch.
Release of both instruct and thinking variants shows that a single base can support both direct answers and explicit reasoning traces after the same post-training stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dual-use MTP head suggests that auxiliary prediction objectives can be engineered to serve inference acceleration as well, reducing the need for separate draft models in other architectures.
If the efficiency pattern holds at larger scales, similar MoE ratios could make high-performance coding agents accessible on consumer hardware.
The curriculum shift from broad web data to curated code and math may generalize to other narrow domains where high-quality data is limited.

Load-bearing premise

Ablations that prioritized inference efficiency on commodity GPUs correctly identified architecture and curriculum choices that deliver the claimed benchmark competitiveness.

What would settle it

Head-to-head evaluation on the reported benchmarks where Mellum 2 falls materially behind the 4B-14B open-weight baselines when both are measured at equivalent per-token compute.

read the original abstract

We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on the Mixture-of-Experts (64 experts, 8 active) and combines Grouped-Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that doubles as both an auxiliary pre-training objective and a built-in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon under FP8 hybrid precision and a Warmup-Hold-Decay schedule with linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selective YaRN and then post-trained in two stages (supervised fine-tuning followed by RLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mellum 2 is a new open 12B MoE code model with claimed efficiency gains, but the report supplies no benchmark numbers or ablation tables to support the competitiveness claim.

read the letter

Hi,

The main thing to know is that this is a technical report releasing Mellum 2, a 12B-parameter MoE model with 2.5B active parameters per token, built for code generation, editing, tool use, and related tasks. It combines standard pieces like 64 experts with 8 active, GQA with 4 KV heads, sliding window attention on three-quarters of layers, and a single MTP head, trained on a 10.6T token curriculum that shifts toward code and math, then extended to 128K context and post-trained into instruct and thinking variants. The report itself adds no new principles.

What is actually new is the specific artifact: the exact architecture choices, the three-phase data mix, the Muon optimizer setup in FP8, the layer-selective YaRN extension, and the dual post-trained checkpoints, all released under Apache 2.0. The focus on inference cost for commodity GPUs and the decision to make the MTP head serve both as auxiliary loss and speculative draft model are practical engineering details that could be reused.

The report does a reasonable job describing the training recipe, data pipeline, and the rationale for each component with efficiency as a constraint. That kind of concrete documentation is useful when people want to replicate or adapt similar setups.

The clear soft spot is the missing evidence. The abstract states that Mellum 2 is competitive with 4B-14B open baselines across code, math, reasoning, tool use, knowledge, and safety benchmarks, yet the text contains zero scores, no baseline references, no error bars, and no ablation results despite claiming those ablations were run. The stress-test note is accurate on this point; without the numbers the central claim has no observable support in the document. This turns the report into more of an announcement than a verifiable record.

This is mainly for practitioners who want an open code-specialized model to experiment with or for groups studying training recipes at this scale. Readers looking for new methods or falsifiable empirical results will not find enough here.

I would not recommend sending it for peer review. It functions as a model release note rather than a paper with testable claims.

Referee Report

2 major / 0 minor

Summary. The manuscript presents Mellum 2, an open-weight 12B-parameter Mixture-of-Experts language model with 2.5B active parameters per token, specialized in software engineering tasks including code generation, editing, debugging, reasoning, tool use, and agentic coding. It details an architecture using 64 experts (8 active), GQA with 4 KV heads, sliding-window attention on three of every four layers, and a single multi-token prediction head; pre-training on ~10.6T tokens via a three-phase curriculum shifting toward code/math data, optimized with Muon in FP8; context extension to 128K via layer-selective YaRN; and post-training with SFT followed by RLVR to produce Instruct and Thinking variants. The central claim is that the model matches open-weight 4B-14B baselines on code, math, reasoning, tool-use, knowledge, and safety benchmarks at 2.5B dense-model per-token cost. Checkpoints and the report are released under Apache 2.0.

Significance. If the benchmark competitiveness and ablation results hold, the work would supply a practical open-weight model optimized for coding workflows at reduced inference cost on commodity GPUs, with explicit architecture and curriculum choices that could inform efficient MoE design. The release of multiple variants plus the training recipe supports reproducibility. The absence of any supporting numerical data, however, prevents evaluation of whether these contributions are realized.

major comments (2)

[Abstract] Abstract: The claim that 'Mellum 2 is competitive with open-weight baselines in the 4B-14B range' across code generation, math/reasoning, tool use, knowledge, and safety benchmarks is stated without any scores, tables, baseline citations, error bars, or evaluation details, rendering the central empirical assertion unverifiable from the supplied text.
[Architecture and Training sections] Architecture section: The statement that 'each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint' for the 64-expert/8-active, GQA-4, SWA-3/4, and single-MTP configuration is unsupported by any ablation tables, metrics, or comparisons; likewise, the three-phase data curriculum is described but not accompanied by validation results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need for explicit empirical support. We agree that the submitted manuscript text lacks the necessary benchmark scores, citations, and ablation results to substantiate the central claims, and we will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'Mellum 2 is competitive with open-weight baselines in the 4B-14B range' across code generation, math/reasoning, tool use, knowledge, and safety benchmarks is stated without any scores, tables, baseline citations, error bars, or evaluation details, rendering the central empirical assertion unverifiable from the supplied text.

Authors: We agree that the abstract as submitted contains no numerical results, table references, or baseline citations, making the competitiveness claim impossible to verify from the text alone. In the revised version we will insert a concise summary of key benchmark scores (with citations to the full evaluation tables and baseline papers) directly into the abstract so that the central empirical assertion is immediately verifiable. revision: yes
Referee: [Architecture and Training sections] Architecture section: The statement that 'each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint' for the 64-expert/8-active, GQA-4, SWA-3/4, and single-MTP configuration is unsupported by any ablation tables, metrics, or comparisons; likewise, the three-phase data curriculum is described but not accompanied by validation results.

Authors: The referee is correct that the submitted manuscript describes the architectural decisions and three-phase curriculum without presenting the supporting ablation tables, performance metrics, or validation curves. We will add a dedicated ablation subsection (or appendix) that reports the relevant metrics for the expert count/active experts, GQA head count, sliding-window pattern, multi-token prediction head, and the data-mixture curriculum, including the commodity-GPU inference measurements that informed the final design. revision: yes

Circularity Check

0 steps flagged

Technical report of model training with no mathematical derivation or fitted prediction presented as a result.

full rationale

The paper is a technical report describing an MoE model architecture, data curriculum, and training recipe. All load-bearing claims (architecture choices validated by ablation, benchmark competitiveness) rest on empirical training runs and unreported ablation results rather than any equations, first-principles derivations, or predictions that could reduce to their own inputs by construction. No self-definitional loops, fitted inputs renamed as predictions, or self-citation chains appear in the supplied text. The absence of benchmark tables or ablation numbers is a separate evidentiary gap, not a circularity issue.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

This is an engineering report on model training and release; the central claim rests on empirical benchmark results and internal ablations rather than derivations. Design choices such as expert count and token volume are stated but not fitted parameters in the statistical sense. No new physical or mathematical entities are postulated.

free parameters (3)

Expert count
64 experts chosen as part of MoE design for the target active-parameter budget.
Active experts per token
8 active experts selected to achieve 2.5B active parameters.
Pre-training token count
Approximately 10.6 trillion tokens used in the three-phase curriculum.

axioms (2)

domain assumption Mixture-of-Experts with sparse activation preserves modeling capacity while reducing compute
Invoked as the basis for the 64-expert, 8-active design.
domain assumption Sliding window attention and grouped-query attention maintain quality at lower cost
Used to justify the attention configuration on three of four layers.

pith-pipeline@v0.9.1-grok · 5917 in / 1636 out tokens · 34806 ms · 2026-06-28T22:57:00.990680+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 61 canonical work pages · 46 internal anchors

[1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebron, and S. Sanghai. “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”. In:arXiv preprint arXiv:2305.13245 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

L. B. Allal, A. Lozhkov, et al. “SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model”. In:arXiv preprint arXiv:2502.02737 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Stop Unnecessary Reflection: ARLCP for Concision-Aware Reward Shaping in Rea- soning Models

Anonymous. “Stop Unnecessary Reflection: ARLCP for Concision-Aware Reward Shaping in Rea- soning Models”. In:arXiv preprint arXiv:2602.12113 (2026)

work page arXiv 2026
[4]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. “Program Synthesis with Large Language Models”. In:arXiv preprint arXiv:2108.07732 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Efficient Training of Language Models to Fill in the Middle

M. Bavarian, H. Jun, N. Tezak, et al. “Efficient Training of Language Models to Fill in the Middle”. In: arXiv preprint arXiv:2207.14255 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan. “Longformer: The Long-Document Transformer”. In:arXiv preprint arXiv:2004.05150 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2004
[7]

Seed-Coder: Let the Code Model Curate Data for Itself

ByteDance Seed, Y. Zhang, J. Su, Y. Sun, C. Xi, et al. “Seed-Coder: Let the Code Model Curate Data for Itself”. In:arXiv preprint arXiv:2506.03524 (2025)

work page arXiv 2025
[8]

MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation

F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman, A. Guha, M. Greenberg, and A. Jangda. “MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation”. In:IEEE Transactions on Software Engineering 49.7 (2023), pp. 3675–3691

2023
[9]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, et al. “Evaluating Large Language Models Trained on Code”. In:arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Unified Scaling Laws for Routed Language Models

A. Clark, D. de las Casas, A. Guy, A. Mensch, M. Paganini, J. Hoffmann, B. Damoc, B. Hechtman, T. Cai, S. Borgeaud, G. van den Driessche, E. Rutherford, T. Hennigan, M. Johnson, K. Millican, A. Cassirer, C. Jones, E. Buchatskaya, D. Budden, L. Sifre, S. Osindero, O. Vinyals, J. Rae, E. Elsen, K. Kavukcuoglu, and K. Simonyan. “Unified Scaling Laws for Rout...

work page arXiv 2022
[11]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. “Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In:arXiv preprint arXiv:1803.05457 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. “Training Verifiers to Solve Math Word Problems”. In:arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM

Codefuse, Ling Team, W. Cai, Y. Cao, C. Chen, C. Chen, S. Chen, Q. Cui, P. Di, J. Fang, Z. Gong, T. Guo, Z. He, Y. Huang, C. Li, J. Li, Z. Li, S. Lian, B. Liu, S. Luo, S. Mao, M. Shen, J. Wu, J. Yang, W. Yang, T. Ye, H. Yu, W. Zhang, Z. Zhang, H. Zhao, X. Zheng, and J. Zhou. “Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Ef...

work page arXiv 2025
[14]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

D. Dai, C. Deng, C. Zhao, et al. “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture- of-Experts Language Models”. In:arXiv preprint arXiv:2401.06066 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model”. In:arXiv preprint arXiv:2405.04434 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

DeepSeek-V3 Technical Report

DeepSeek-AI. “DeepSeek-V3 Technical Report”. In:arXiv preprint arXiv:2412.19437 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Fewer Truncations Im- prove Language Modeling

H. Ding, Z. Wang, G. Paolini, V. Kumar, A. Deoras, D. Roth, and S. Soatto. “Fewer Truncations Im- prove Language Modeling”. In:Proceedings of the 41st International Conference on Machine Learning (ICML). 2024. arXiv: 2404.10830 [cs.CL]. 27 Mellum 2 Technical RepoRt v1.0 · May 2026

work page arXiv 2024
[18]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

W. Fedus, B. Zoph, and N. Shazeer. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”. In:Journal of Machine Learning Research 23.120 (2022), pp. 1– 40

2022
[19]

MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

T. Gale, D. Narayanan, C. Young, and M. Zaharia. “MegaBlocks: Efficient Sparse Training with Mixture-of-Experts”. In:Proceedings of the Sixth Conference on Machine Learning and Systems (ML- Sys). 2023

2023
[20]

Are we done with mmlu?arXiv preprint arXiv:2406.04127,

A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao, X. Du, M. R. G. Madani, C. Barale, R. McHardy, J. Harris, J. Kaddour, E. van Krieken, and P. Minervini. “Are We Done with MMLU?” In:arXiv preprint arXiv:2406.04127 (2024)

work page arXiv 2024
[21]

Gemma 3 Technical Report

Gemma Team. “Gemma 3 Technical Report”. In:arXiv preprint arXiv:2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Better & Faster Large Language Models via Multi-token Prediction

F. Gloeckle, B. Y. Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve. “Better & Faster Large Lan- guage Models via Multi-token Prediction”. In:arXiv preprint arXiv:2404.19737 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, et al. “The Llama 3 Herd of Models”. In:arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

A. Gu, B. Rozière, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang. “CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution”. In:arXiv preprint arXiv:2401.03065 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

A. Hägele, E. Bakouch, A. Kosson, L. B. Allal, L. Von Werra, and M. Jaggi. “Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations”. In:arXiv preprint arXiv:2405.18392 (2024)

work page arXiv 2024
[26]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. “Measuring Massive Multitask Language Understanding”. In:arXiv preprint arXiv:2009.03300 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2009
[27]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. “Mea- suring Mathematical Problem Solving With the MATH Dataset”. In:arXiv preprint arXiv:2103.03874 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Query-Key Normalization for Transform- ers

A. Henry, P. R. Dachapally, S. S. Pawar, and Y. Chen. “Query-Key Normalization for Transform- ers”. In: Findings of the Association for Computational Linguistics: EMNLP 2020 . Association for Computational Linguistics, 2020, pp. 4246–4253

2020
[29]

RULER: What's the Real Context Size of Your Long-Context Language Models?

C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y. Zhang, and B. Ginsburg. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In:arXiv preprint arXiv:2404.06654 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

S. Hu, Y. Tu, X. Han, et al. “MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies”. In:arXiv preprint arXiv:2404.06395 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Qwen2.5-Coder Technical Report

B. Hui, J. Yang, Z. Cui, et al. “Qwen2.5-Coder Technical Report”. In:arXiv preprint arXiv:2409.12186 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code”. In: arXiv preprint arXiv:2403.07974 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, et al. “Mistral 7B”. In:arXiv preprint arXiv:2310.06825 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, J. You, F. Cesista, L. Newhouse, and J. Bernstein.Muon: An optimizer for hidden layers in neural networks . https://kellerjordan.github.io/posts/muon/. 2024

2024
[35]

Scaling Laws for Fine-Grained Mixture of Experts

J. Krajewski, J. Ludziejewski, K. Adamczewski, M. Pióro, M. Krutul, S. Antoniak, K. Ciebiera, K. Król, T. Odrzygóźdź, P. Sankowski, M. Cygan, and S. Jaszczur. “Scaling Laws for Fine-Grained Mixture of Experts”. In:arXiv preprint arXiv:2402.07871 (2024)

work page arXiv 2024
[36]

Efficient Memory Management for Large Language Model Serving with PagedAttention

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. “Efficient Memory Management for Large Language Model Serving with PagedAttention”. In:Pro- ceedings of the 29th Symposium on Operating Systems Principles (SOSP) . ACM, 2023, pp. 611–626. 28 Mellum 2 Technical RepoRt v1.0 · May 2026

2023
[37]

Deduplicating Training Data Makes Language Models Better

K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini. “Deduplicating Training Data Makes Language Models Better”. In:Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, 2022, pp. 8424–8445

2022
[38]

Leviathan, M

Y. Leviathan, M. Kalman, and Y. Matias.Fast Inference from Transformers via Speculative Decoding
[39]

Fast Inference from Transformers via Speculative Decoding

arXiv: 2211.17192 [cs.LG]. uRl: https://arxiv.org/abs/2211.17192

work page internal anchor Pith review Pith/arXiv arXiv
[40]

GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers

Q. Li, L. Cui, X. Zhao, L. Kong, and W. Bi. “GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers”. In:arXiv preprint arXiv:2402.19255 (2024)

work page arXiv 2024
[41]

StarCoder: may the source be with you!

R. Li et al. “StarCoder: May the Source Be with You!” In:arXiv preprint arXiv:2305.06161 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

S. Lin, J. Hilton, and O. Evans. “TruthfulQA: Measuring How Models Mimic Human Falsehoods”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, 2022, pp. 3214–3252

2022
[43]

Ring-1T Technical Report

Ling Team. “Ring-1T Technical Report”. In:arXiv preprint arXiv:2510.18855 (2025)

work page arXiv 2025
[44]

Ministral 3

A. H. Liu et al. “Ministral 3”. In:arXiv preprint arXiv:2601.08584 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

J. Liu, C. S. Xia, Y. Wang, and L. Zhang. “Is Your Code Generated by ChatGPT Really Correct? Rigor- ous Evaluation of Large Language Models for Code Generation”. In:arXiv preprint arXiv:2305.01210 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Muon is Scalable for LLM Training

J. Liu, J. Su, X. Yao, et al. “Muon is Scalable for LLM Training”. In:arXiv preprint arXiv:2502.16982 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Understanding R1-Zero-Like Training: A Critical Perspective

Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. “Understanding R1-Zero-Like Training: A Critical Perspective”. In:Conference on Language Modeling (COLM) . 2025

2025
[48]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. “Decoupled Weight Decay Regularization”. In:International Confer- ence on Learning Representations (ICLR) . 2019. uRl: https : / / openreview . net / forum ? id = Bkg6RiCqY7

2019
[49]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. Forsyth, and D. Hendrycks. “HarmBench: A Standardized Evaluation Framework for Automated Red Team- ing and Robust Refusal”. In:arXiv preprint arXiv:2402.04249 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

FP8 Formats for Deep Learning

P. Micikevicius, D. Stosic, N. Burgess, et al. “FP8 Formats for Deep Learning”. In:arXiv preprint arXiv:2209.05433 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

J. Ni, F. Xue, X. Yue, Y. Deng, M. Shah, K. Jain, G. Neubig, and Y. You. “MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures”. In:arXiv preprint arXiv:2406.06565 (2024)

work page arXiv 2024
[52]

NeMo Gym: An Open Source Framework for Scaling Reinforcement Learning Environments for LLM

NVIDIA. NeMo Gym: An Open Source Framework for Scaling Reinforcement Learning Environments for LLM. https://github.com/NVIDIA-NeMo/Gym. GitHub repository. 2025

2025
[53]

NeMo RL: A Scalable and Efficient Post-Training Library

NVIDIA. NeMo RL: A Scalable and Efficient Post-Training Library . https://github.com/NVIDIA- NeMo/RL. GitHub repository. 2025

2025
[54]

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

NVIDIA. “NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Rea- soning Model”. In:arXiv preprint arXiv:2508.14444 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models

S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V. Suresh, I. Stoica, and J. E. Gonzalez. “The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models”. In: Proceedings of the 42nd International Conference on Machine Learning . 2025, pp. 48371–48392

2025
[56]

Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding

N. Pavlichenko, I. Nazarov, I. Dolgov, E. Garanina, D. Ustalov, I. Bondyrev, K. Lysaniuk, E. Vu, K. Chekmenev, J. Shtok, Y. Golubev, A. Semenkin, and U. Sazanovich. “Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding”. In:arXiv preprint arXiv:2510.05788 (2025). 29 Mellum 2 Technical RepoRt v1.0 · May 2026

work page arXiv 2025
[57]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

G. Penedo, H. Kydlicek, L. B. Allal, et al. “The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale”. In:arXiv preprint arXiv:2406.17557 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

YaRN: Efficient Context Window Extension of Large Language Models

B. Peng, J. Quesnelle, H. Fan, and E. Shippole. “YaRN: Efficient Context Window Extension of Large Language Models”. In:arXiv preprint arXiv:2309.00071 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Qwen2.5 Technical Report

Qwen Team. “Qwen2.5 Technical Report”. In:arXiv preprint arXiv:2412.15115 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman. “GPQA: A Graduate-Level Google-Proof Q&A Benchmark”. In:arXiv preprint arXiv:2311.12022 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

P. Röttger, H. R. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy. “XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models”. In:Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) . Association for ...

2024
[62]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y. Choi. “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”. In:Communications of the ACM 64.9 (2021), pp. 99–106

2021
[63]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models”. In: arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

GLU Variants Improve Transformer

N. Shazeer. “GLU Variants Improve Transformer”. In:arXiv preprint arXiv:2002.05202 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2002
[65]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. “Megatron-LM: Train- ing Multi-Billion Parameter Language Models Using Model Parallelism”. In:arXiv preprint arXiv:1909.08053 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 1909
[66]

Arcee Trinity Large Technical Report

V. Singh, L. Krauss, S. Jaghouar, M. Sirovatka, C. Goddard, F. Obied, J. M. Ong, J. Straube, A. Harley, C. Stewart, C. Kealty, M. Panahi, S. Kirsten, A. Deshpande, A. Vij, A. Bresnu, P. Veldurthi, R. Rav- ishankar, H. Bishnoi, M. McQuade, J. Hagemann, and L. Atkins. “Arcee Trinity Large Technical Report”. In:arXiv preprint arXiv:2602.17004 (2026)

work page arXiv 2026
[67]

In The Eleventh International Conference on Learning Repre- sentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023

Z. Stojanovski, O. Stanley, J. Sharratt, R. Jones, A. Adefioye, J. Kaddour, and A. Köpf. Reason- ing Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards . 2025. arXiv: 2505.24760 [cs.LG]. uRl: https://arxiv.org/abs/2505.24760

work page arXiv 2025
[68]

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

D. Su, K. Kong, Y. Lin, J. Jennings, B. Norick, M. Kliegl, M. Patwary, M. Shoeybi, and B. Catanzaro. “Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset”. In: arXiv preprint arXiv:2412.02595 (2024)

work page arXiv 2024
[69]

RoFormer: Enhanced Transformer with Rotary Position Embedding

J. Su, M. Ahmed, Y. Lu, S. Pan, B. Wen, and Y. Liu. “RoFormer: Enhanced Transformer with Rotary Position Embedding”. In:Neurocomputing 568 (2024), p. 127063

2024
[70]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, and J. Wei. “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”. In:arXiv preprint arXiv:2210.09261 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[71]

Qwen3.5: Towards Native Multimodal Agents

Q. Team. “Qwen3.5: Towards Native Multimodal Agents”. In: (Feb. 2026)

2026
[72]

Olmo 3

Team Olmo, A. Ettinger, A. Bertsch, B. Kuehl, et al. “Olmo 3”. In:arXiv preprint arXiv:2512.13961 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, et al. “Llama 2: Open Foundation and Fine-Tuned Chat Models”. In: arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. “MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark”. In:arXiv preprint arXiv:2406.01574 (2024). 30 Mellum 2 Technical RepoRt v1.0 · May 2026

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

Qwen3 Technical Report

A. Yang, A. Yang, B. Yang, et al. “Qwen3 Technical Report”. In:arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Gated Delta Networks: Improving Mamba2 with Delta Rule

S. Yang, J. Kautz, and A. Hatamizadeh. “Gated Delta Networks: Improving Mamba2 with Delta Rule”. In:International Conference on Learning Representations (ICLR) . arXiv:2412.06464. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. “DAPO: An Open-Source LLM Reinforcement Learning System at Scale”. In:arXiv preprint arXiv:2503.14476 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

HellaSwag: Can a Machine Really Finish Your Sentence?

R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. “HellaSwag: Can a Machine Really Finish Your Sentence?” In:Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019, pp. 4791–4800

2019
[79]

Root Mean Square Layer Normalization

B. Zhang and R. Sennrich. “Root Mean Square Layer Normalization”. In:Advances in Neural Infor- mation Processing Systems. Vol. 32. 2019, pp. 12360–12371

2019
[80]

Instruction-Following Evaluation for Large Language Models

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. “Instruction-Following Evaluation for Large Language Models”. In:arXiv preprint arXiv:2311.07911 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.

[1] [1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebron, and S. Sanghai. “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints”. In:arXiv preprint arXiv:2305.13245 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

L. B. Allal, A. Lozhkov, et al. “SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model”. In:arXiv preprint arXiv:2502.02737 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Stop Unnecessary Reflection: ARLCP for Concision-Aware Reward Shaping in Rea- soning Models

Anonymous. “Stop Unnecessary Reflection: ARLCP for Concision-Aware Reward Shaping in Rea- soning Models”. In:arXiv preprint arXiv:2602.12113 (2026)

work page arXiv 2026

[4] [4]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. “Program Synthesis with Large Language Models”. In:arXiv preprint arXiv:2108.07732 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Efficient Training of Language Models to Fill in the Middle

M. Bavarian, H. Jun, N. Tezak, et al. “Efficient Training of Language Models to Fill in the Middle”. In: arXiv preprint arXiv:2207.14255 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan. “Longformer: The Long-Document Transformer”. In:arXiv preprint arXiv:2004.05150 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2004

[7] [7]

Seed-Coder: Let the Code Model Curate Data for Itself

ByteDance Seed, Y. Zhang, J. Su, Y. Sun, C. Xi, et al. “Seed-Coder: Let the Code Model Curate Data for Itself”. In:arXiv preprint arXiv:2506.03524 (2025)

work page arXiv 2025

[8] [8]

MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation

F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin, D. Pinckney, M.-H. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman, A. Guha, M. Greenberg, and A. Jangda. “MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation”. In:IEEE Transactions on Software Engineering 49.7 (2023), pp. 3675–3691

2023

[9] [9]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, et al. “Evaluating Large Language Models Trained on Code”. In:arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Unified Scaling Laws for Routed Language Models

A. Clark, D. de las Casas, A. Guy, A. Mensch, M. Paganini, J. Hoffmann, B. Damoc, B. Hechtman, T. Cai, S. Borgeaud, G. van den Driessche, E. Rutherford, T. Hennigan, M. Johnson, K. Millican, A. Cassirer, C. Jones, E. Buchatskaya, D. Budden, L. Sifre, S. Osindero, O. Vinyals, J. Rae, E. Elsen, K. Kavukcuoglu, and K. Simonyan. “Unified Scaling Laws for Rout...

work page arXiv 2022

[11] [11]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. “Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In:arXiv preprint arXiv:1803.05457 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. “Training Verifiers to Solve Math Word Problems”. In:arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM

Codefuse, Ling Team, W. Cai, Y. Cao, C. Chen, C. Chen, S. Chen, Q. Cui, P. Di, J. Fang, Z. Gong, T. Guo, Z. He, Y. Huang, C. Li, J. Li, Z. Li, S. Lian, B. Liu, S. Luo, S. Mao, M. Shen, J. Wu, J. Yang, W. Yang, T. Ye, H. Yu, W. Zhang, Z. Zhang, H. Zhao, X. Zheng, and J. Zhou. “Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Ef...

work page arXiv 2025

[14] [14]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

D. Dai, C. Deng, C. Zhao, et al. “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture- of-Experts Language Models”. In:arXiv preprint arXiv:2401.06066 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model”. In:arXiv preprint arXiv:2405.04434 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

DeepSeek-V3 Technical Report

DeepSeek-AI. “DeepSeek-V3 Technical Report”. In:arXiv preprint arXiv:2412.19437 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Fewer Truncations Im- prove Language Modeling

H. Ding, Z. Wang, G. Paolini, V. Kumar, A. Deoras, D. Roth, and S. Soatto. “Fewer Truncations Im- prove Language Modeling”. In:Proceedings of the 41st International Conference on Machine Learning (ICML). 2024. arXiv: 2404.10830 [cs.CL]. 27 Mellum 2 Technical RepoRt v1.0 · May 2026

work page arXiv 2024

[18] [18]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

W. Fedus, B. Zoph, and N. Shazeer. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”. In:Journal of Machine Learning Research 23.120 (2022), pp. 1– 40

2022

[19] [19]

MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

T. Gale, D. Narayanan, C. Young, and M. Zaharia. “MegaBlocks: Efficient Sparse Training with Mixture-of-Experts”. In:Proceedings of the Sixth Conference on Machine Learning and Systems (ML- Sys). 2023

2023

[20] [20]

Are we done with mmlu?arXiv preprint arXiv:2406.04127,

A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao, X. Du, M. R. G. Madani, C. Barale, R. McHardy, J. Harris, J. Kaddour, E. van Krieken, and P. Minervini. “Are We Done with MMLU?” In:arXiv preprint arXiv:2406.04127 (2024)

work page arXiv 2024

[21] [21]

Gemma 3 Technical Report

Gemma Team. “Gemma 3 Technical Report”. In:arXiv preprint arXiv:2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Better & Faster Large Language Models via Multi-token Prediction

F. Gloeckle, B. Y. Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve. “Better & Faster Large Lan- guage Models via Multi-token Prediction”. In:arXiv preprint arXiv:2404.19737 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, et al. “The Llama 3 Herd of Models”. In:arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

A. Gu, B. Rozière, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang. “CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution”. In:arXiv preprint arXiv:2401.03065 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

A. Hägele, E. Bakouch, A. Kosson, L. B. Allal, L. Von Werra, and M. Jaggi. “Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations”. In:arXiv preprint arXiv:2405.18392 (2024)

work page arXiv 2024

[26] [26]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. “Measuring Massive Multitask Language Understanding”. In:arXiv preprint arXiv:2009.03300 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2009

[27] [27]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. “Mea- suring Mathematical Problem Solving With the MATH Dataset”. In:arXiv preprint arXiv:2103.03874 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[28] [28]

Query-Key Normalization for Transform- ers

A. Henry, P. R. Dachapally, S. S. Pawar, and Y. Chen. “Query-Key Normalization for Transform- ers”. In: Findings of the Association for Computational Linguistics: EMNLP 2020 . Association for Computational Linguistics, 2020, pp. 4246–4253

2020

[29] [29]

RULER: What's the Real Context Size of Your Long-Context Language Models?

C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y. Zhang, and B. Ginsburg. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In:arXiv preprint arXiv:2404.06654 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

S. Hu, Y. Tu, X. Han, et al. “MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies”. In:arXiv preprint arXiv:2404.06395 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Qwen2.5-Coder Technical Report

B. Hui, J. Yang, Z. Cui, et al. “Qwen2.5-Coder Technical Report”. In:arXiv preprint arXiv:2409.12186 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code”. In: arXiv preprint arXiv:2403.07974 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, et al. “Mistral 7B”. In:arXiv preprint arXiv:2310.06825 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, J. You, F. Cesista, L. Newhouse, and J. Bernstein.Muon: An optimizer for hidden layers in neural networks . https://kellerjordan.github.io/posts/muon/. 2024

2024

[35] [35]

Scaling Laws for Fine-Grained Mixture of Experts

J. Krajewski, J. Ludziejewski, K. Adamczewski, M. Pióro, M. Krutul, S. Antoniak, K. Ciebiera, K. Król, T. Odrzygóźdź, P. Sankowski, M. Cygan, and S. Jaszczur. “Scaling Laws for Fine-Grained Mixture of Experts”. In:arXiv preprint arXiv:2402.07871 (2024)

work page arXiv 2024

[36] [36]

Efficient Memory Management for Large Language Model Serving with PagedAttention

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. “Efficient Memory Management for Large Language Model Serving with PagedAttention”. In:Pro- ceedings of the 29th Symposium on Operating Systems Principles (SOSP) . ACM, 2023, pp. 611–626. 28 Mellum 2 Technical RepoRt v1.0 · May 2026

2023

[37] [37]

Deduplicating Training Data Makes Language Models Better

K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini. “Deduplicating Training Data Makes Language Models Better”. In:Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, 2022, pp. 8424–8445

2022

[38] [38]

Leviathan, M

Y. Leviathan, M. Kalman, and Y. Matias.Fast Inference from Transformers via Speculative Decoding

[39] [39]

Fast Inference from Transformers via Speculative Decoding

arXiv: 2211.17192 [cs.LG]. uRl: https://arxiv.org/abs/2211.17192

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers

Q. Li, L. Cui, X. Zhao, L. Kong, and W. Bi. “GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers”. In:arXiv preprint arXiv:2402.19255 (2024)

work page arXiv 2024

[41] [41]

StarCoder: may the source be with you!

R. Li et al. “StarCoder: May the Source Be with You!” In:arXiv preprint arXiv:2305.06161 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

S. Lin, J. Hilton, and O. Evans. “TruthfulQA: Measuring How Models Mimic Human Falsehoods”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, 2022, pp. 3214–3252

2022

[43] [43]

Ring-1T Technical Report

Ling Team. “Ring-1T Technical Report”. In:arXiv preprint arXiv:2510.18855 (2025)

work page arXiv 2025

[44] [44]

Ministral 3

A. H. Liu et al. “Ministral 3”. In:arXiv preprint arXiv:2601.08584 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

J. Liu, C. S. Xia, Y. Wang, and L. Zhang. “Is Your Code Generated by ChatGPT Really Correct? Rigor- ous Evaluation of Large Language Models for Code Generation”. In:arXiv preprint arXiv:2305.01210 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

Muon is Scalable for LLM Training

J. Liu, J. Su, X. Yao, et al. “Muon is Scalable for LLM Training”. In:arXiv preprint arXiv:2502.16982 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Understanding R1-Zero-Like Training: A Critical Perspective

Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. “Understanding R1-Zero-Like Training: A Critical Perspective”. In:Conference on Language Modeling (COLM) . 2025

2025

[48] [48]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. “Decoupled Weight Decay Regularization”. In:International Confer- ence on Learning Representations (ICLR) . 2019. uRl: https : / / openreview . net / forum ? id = Bkg6RiCqY7

2019

[49] [49]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. Forsyth, and D. Hendrycks. “HarmBench: A Standardized Evaluation Framework for Automated Red Team- ing and Robust Refusal”. In:arXiv preprint arXiv:2402.04249 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

FP8 Formats for Deep Learning

P. Micikevicius, D. Stosic, N. Burgess, et al. “FP8 Formats for Deep Learning”. In:arXiv preprint arXiv:2209.05433 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[51] [51]

MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

J. Ni, F. Xue, X. Yue, Y. Deng, M. Shah, K. Jain, G. Neubig, and Y. You. “MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures”. In:arXiv preprint arXiv:2406.06565 (2024)

work page arXiv 2024

[52] [52]

NeMo Gym: An Open Source Framework for Scaling Reinforcement Learning Environments for LLM

NVIDIA. NeMo Gym: An Open Source Framework for Scaling Reinforcement Learning Environments for LLM. https://github.com/NVIDIA-NeMo/Gym. GitHub repository. 2025

2025

[53] [53]

NeMo RL: A Scalable and Efficient Post-Training Library

NVIDIA. NeMo RL: A Scalable and Efficient Post-Training Library . https://github.com/NVIDIA- NeMo/RL. GitHub repository. 2025

2025

[54] [54]

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

NVIDIA. “NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Rea- soning Model”. In:arXiv preprint arXiv:2508.14444 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models

S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V. Suresh, I. Stoica, and J. E. Gonzalez. “The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models”. In: Proceedings of the 42nd International Conference on Machine Learning . 2025, pp. 48371–48392

2025

[56] [56]

Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding

N. Pavlichenko, I. Nazarov, I. Dolgov, E. Garanina, D. Ustalov, I. Bondyrev, K. Lysaniuk, E. Vu, K. Chekmenev, J. Shtok, Y. Golubev, A. Semenkin, and U. Sazanovich. “Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding”. In:arXiv preprint arXiv:2510.05788 (2025). 29 Mellum 2 Technical RepoRt v1.0 · May 2026

work page arXiv 2025

[57] [57]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

G. Penedo, H. Kydlicek, L. B. Allal, et al. “The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale”. In:arXiv preprint arXiv:2406.17557 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

YaRN: Efficient Context Window Extension of Large Language Models

B. Peng, J. Quesnelle, H. Fan, and E. Shippole. “YaRN: Efficient Context Window Extension of Large Language Models”. In:arXiv preprint arXiv:2309.00071 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [59]

Qwen2.5 Technical Report

Qwen Team. “Qwen2.5 Technical Report”. In:arXiv preprint arXiv:2412.15115 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman. “GPQA: A Graduate-Level Google-Proof Q&A Benchmark”. In:arXiv preprint arXiv:2311.12022 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

P. Röttger, H. R. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy. “XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models”. In:Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) . Association for ...

2024

[62] [62]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y. Choi. “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”. In:Communications of the ACM 64.9 (2021), pp. 99–106

2021

[63] [63]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models”. In: arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

GLU Variants Improve Transformer

N. Shazeer. “GLU Variants Improve Transformer”. In:arXiv preprint arXiv:2002.05202 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2002

[65] [65]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. “Megatron-LM: Train- ing Multi-Billion Parameter Language Models Using Model Parallelism”. In:arXiv preprint arXiv:1909.08053 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 1909

[66] [66]

Arcee Trinity Large Technical Report

V. Singh, L. Krauss, S. Jaghouar, M. Sirovatka, C. Goddard, F. Obied, J. M. Ong, J. Straube, A. Harley, C. Stewart, C. Kealty, M. Panahi, S. Kirsten, A. Deshpande, A. Vij, A. Bresnu, P. Veldurthi, R. Rav- ishankar, H. Bishnoi, M. McQuade, J. Hagemann, and L. Atkins. “Arcee Trinity Large Technical Report”. In:arXiv preprint arXiv:2602.17004 (2026)

work page arXiv 2026

[67] [67]

In The Eleventh International Conference on Learning Repre- sentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023

Z. Stojanovski, O. Stanley, J. Sharratt, R. Jones, A. Adefioye, J. Kaddour, and A. Köpf. Reason- ing Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards . 2025. arXiv: 2505.24760 [cs.LG]. uRl: https://arxiv.org/abs/2505.24760

work page arXiv 2025

[68] [68]

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

D. Su, K. Kong, Y. Lin, J. Jennings, B. Norick, M. Kliegl, M. Patwary, M. Shoeybi, and B. Catanzaro. “Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset”. In: arXiv preprint arXiv:2412.02595 (2024)

work page arXiv 2024

[69] [69]

RoFormer: Enhanced Transformer with Rotary Position Embedding

J. Su, M. Ahmed, Y. Lu, S. Pan, B. Wen, and Y. Liu. “RoFormer: Enhanced Transformer with Rotary Position Embedding”. In:Neurocomputing 568 (2024), p. 127063

2024

[70] [70]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, and J. Wei. “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”. In:arXiv preprint arXiv:2210.09261 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[71] [71]

Qwen3.5: Towards Native Multimodal Agents

Q. Team. “Qwen3.5: Towards Native Multimodal Agents”. In: (Feb. 2026)

2026

[72] [72]

Olmo 3

Team Olmo, A. Ettinger, A. Bertsch, B. Kuehl, et al. “Olmo 3”. In:arXiv preprint arXiv:2512.13961 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, et al. “Llama 2: Open Foundation and Fine-Tuned Chat Models”. In: arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[74] [74]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. “MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark”. In:arXiv preprint arXiv:2406.01574 (2024). 30 Mellum 2 Technical RepoRt v1.0 · May 2026

work page internal anchor Pith review Pith/arXiv arXiv 2024

[75] [75]

Qwen3 Technical Report

A. Yang, A. Yang, B. Yang, et al. “Qwen3 Technical Report”. In:arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[76] [76]

Gated Delta Networks: Improving Mamba2 with Delta Rule

S. Yang, J. Kautz, and A. Hatamizadeh. “Gated Delta Networks: Improving Mamba2 with Delta Rule”. In:International Conference on Learning Representations (ICLR) . arXiv:2412.06464. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [77]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. “DAPO: An Open-Source LLM Reinforcement Learning System at Scale”. In:arXiv preprint arXiv:2503.14476 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[78] [78]

HellaSwag: Can a Machine Really Finish Your Sentence?

R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. “HellaSwag: Can a Machine Really Finish Your Sentence?” In:Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019, pp. 4791–4800

2019

[79] [79]

Root Mean Square Layer Normalization

B. Zhang and R. Sennrich. “Root Mean Square Layer Normalization”. In:Advances in Neural Infor- mation Processing Systems. Vol. 32. 2019, pp. 12360–12371

2019

[80] [80]

Instruction-Following Evaluation for Large Language Models

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. “Instruction-Following Evaluation for Large Language Models”. In:arXiv preprint arXiv:2311.07911 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023