arxiv: 2604.18473 · v1 · submitted 2026-04-20 · 💻 cs.LG

Recognition: unknown

Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

Akshita Bhagia, Jacob Morrison, Matei Zaharia, Noah A. Smith, Sanjay Adhikesaven, Sewon Min

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:48 UTC · model grok-4.3

classification 💻 cs.LG

keywords modular post-trainingmixture of expertsdomain expertslanguage model extensioncatastrophic forgettingtraining scalabilityreinforcement learning

0 comments

The pith

Training domain experts independently then merging them via a lightweight router extends language models without full retraining or capability loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that monolithic approaches to adding capabilities to post-trained language models either cost too much or erode existing skills. BAR solves this by training each domain expert through its own mid-training, supervised finetuning, and reinforcement learning steps, then combining the experts in a Mixture-of-Experts setup whose router receives only lightweight training. At the 7B scale this modular route matches or exceeds the scores of full retraining baselines while keeping update costs linear rather than quadratic. It also prevents the forgetting that late-stage reinforcement learning produces when domains are mixed together. Readers would care because the method turns model extension into an incremental, non-destructive process instead of a repeated full rebuild.

Core claim

BAR branches into independent expert pipelines for separate domains, fully adapts each one, and routes among them with a small additional training step on the router. This produces an overall score of 49.1 across seven evaluation categories at the 7B scale, matching or exceeding retraining baselines of 47.8 and 50.5, while scaling update costs linearly and eliminating the cross-domain forgetting that occurs when reinforcement learning on one domain harms skills acquired earlier.

What carries the argument

Mixture-of-Experts composition of fully post-trained domain experts with a lightweight router that selects among them.

If this is right

Updating one domain requires only linear additional compute instead of reprocessing all prior data.
Adding or changing one expert leaves performance on all other domains unchanged.
Late-stage reinforcement learning on a single domain no longer erases earlier-stage capabilities.
New domains can be introduced without any need to reprocess data from existing domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Groups could train specialists on their own data and share only the resulting experts and router weights.
The method may support dozens of domains provided the router continues to route accurately as the number grows.
Performance benefits may partly stem from removing optimization conflicts that arise when domains are trained jointly.

Load-bearing premise

Training experts separately and routing among them preserves any cross-domain interactions that would only appear during joint training of all domains together.

What would settle it

A side-by-side test in which a new domain is added to an existing BAR model and the combined performance on mixed-domain tasks falls measurably below the equivalent full-retraining baseline.

Figures

Figures reproduced from arXiv: 2604.18473 by Akshita Bhagia, Jacob Morrison, Matei Zaharia, Noah A. Smith, Sanjay Adhikesaven, Sewon Min.

**Figure 1.** Figure 1: Overview of BAR. The initial model M (a dense transformer). For each target domain, a two-expert MoE is created: the anchor expert preserves M’s capabilities while the domain expert is trained on new data. Each domain follows its applicable pipeline—math and code use the full pipeline (mid-training → SFT → RLVR), while tool use and safety use SFT only. Shared parameters are progressively unfrozen across st… view at source ↗

**Figure 2.** Figure 2: Cost to add each new domain. Re-training must reprocess all domains, so cost grows linearly with the number of domains. BAR trains only the new expert, keeping cost constant. Adding experts BAR enables both adding new domain experts and upgrading existing ones without retraining the full model [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Unfreezing embedding and LM head layers is critical. Top: Frozen tool use expert fails to learn new tokens (20.3 vs. 46.4). Bottom: For math RL, freezing produces a flat reward curve. Unfreezing shared layers Prior modular approaches such as FlexOlmo (Shi et al., 2025) freeze all shared parameters during expert training, which works for pre-training, but we find it does not for post-training. We find tha… view at source ↗

read the original abstract

Extending a fully post-trained language model with new domain capabilities is fundamentally limited by monolithic training paradigms: retraining from scratch is expensive and scales poorly, while continued training often degrades existing capabilities. We present BAR (Branch-Adapt-Route), which trains independent domain experts, each through its own mid-training, supervised finetuning, and reinforcement learning pipeline, and composes them via a Mixture-of-Experts architecture with lightweight router training. Unlike retraining approaches that mix all domains and require full reprocessing for any update (with cost scaling quadratically), BAR enables updating individual experts independently with linear cost scaling and no degradation to existing domains. At the 7B scale, with experts for math, code, tool use, and safety, BAR achieves an overall score of 49.1 (averaged across 7 evaluation categories), matching or exceeding re-training baselines (47.8 without mid-training, 50.5 with). We further show that modular training provides a structural advantage: by isolating each domain, it avoids the catastrophic forgetting that occurs when late-stage RL degrades capabilities from earlier training stages, while significantly reducing the cost and complexity of updating or adding a domain. Together, these results suggest that decoupled, expert-based training is a scalable alternative to monolithic retraining for extending language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BAR trains domain experts with separate full pipelines then routes them cheaply, getting performance between the two retraining baselines while cutting update costs.

read the letter

The main thing to know is that BAR runs independent mid-training, SFT, and RL for each expert (math, code, tools, safety) before a lightweight router assembles them in an MoE. This keeps additions linear in cost and avoids the forgetting that hits when you keep training a single model on new data late in the process. At 7B the composed model scores 49.1 on their seven-category average, which beats the 47.8 no-mid-training baseline but sits below the 50.5 full mid-training one. The isolation of stages is the clearest practical win here, since each expert can be updated or swapped without touching the others. The router training is described as cheap, which matches the goal of easy composition. The direct comparison to retraining baselines is straightforward and the forgetting argument follows from the design. The gap to the stronger baseline is real, though, and it leaves open whether the router misses cross-domain cases or whether the separate pipelines simply lose some joint optimization. The abstract does not break out mixed-domain queries or report variance, so those details matter for judging if the router holds up. No circular claims or invented metrics show up in the reported numbers. This is aimed at groups that already run multi-domain models and need to add capabilities without full retrains. Readers working on maintenance and scaling of deployed LLMs will get the most from the cost and modularity points. The work is coherent enough on its own terms to go to referees who can check the router data and any hidden mixed-task results.

Referee Report

3 major / 1 minor

Summary. The paper proposes BAR (Branch-Adapt-Route), a modular post-training framework that trains independent domain experts (for math, code, tool use, and safety) via separate mid-training, SFT, and RL pipelines, then composes them in a Mixture-of-Experts architecture using only lightweight router training. At the 7B scale, BAR reports an overall average score of 49.1 across 7 evaluation categories, claimed to match or exceed monolithic retraining baselines (47.8 without mid-training, 50.5 with mid-training). The approach is positioned as avoiding catastrophic forgetting, enabling linear-cost updates, and providing a scalable alternative to full retraining.

Significance. If the empirical claims hold with full verification, the work offers a structurally advantageous alternative to monolithic post-training by decoupling domains to prevent forgetting and reduce update costs from quadratic to linear scaling. The modular expert composition via lightweight routing is a concrete contribution that could influence efficient extension of LLMs, particularly if supported by reproducible code or detailed per-task breakdowns.

major comments (3)

[Abstract] Abstract: The central performance claim states that BAR's 49.1 overall score 'matches or exceeds' the retraining baselines of 47.8 (no mid-training) and 50.5 (with mid-training). However, 49.1 falls below 50.5; the manuscript must clarify whether the comparison uses the same evaluation protocol, provide per-category breakdowns across the 7 categories, or explain why the with-mid-training baseline is not the primary comparator. This directly affects whether the 'no degradation' claim for the modular approach is supported.
[Experimental results] Experimental results (presumed §4 or equivalent): No details are provided on the evaluation setup, including the exact benchmarks comprising the 7 categories, number of evaluation runs, variance or standard deviations, statistical significance tests, or precise re-implementations of the monolithic baselines (e.g., data mixtures, training steps, or hyperparameter matching). Without these, the soundness of the 49.1 vs. baseline comparison cannot be assessed and the central empirical claim remains unverifiable.
[Router and composition] Router and composition section (presumed §3): The claim that the lightweight router enables composition 'without degradation' or loss of cross-domain synergies rests on the untested assumption that routing errors are negligible on overlapping or mixed-domain queries (e.g., math+code). No routing accuracy metrics, error analysis on integrated tasks, or ablation on router training data coverage are reported, which is load-bearing for the weakest assumption identified in the stress test.

minor comments (1)

[Abstract and Methods] The abstract and methods would benefit from explicit notation distinguishing the independent expert training pipelines from the final MoE composition stage to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity, reproducibility, and support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim states that BAR's 49.1 overall score 'matches or exceeds' the retraining baselines of 47.8 (no mid-training) and 50.5 (with mid-training). However, 49.1 falls below 50.5; the manuscript must clarify whether the comparison uses the same evaluation protocol, provide per-category breakdowns across the 7 categories, or explain why the with-mid-training baseline is not the primary comparator. This directly affects whether the 'no degradation' claim for the modular approach is supported.

Authors: We agree that the original abstract phrasing was imprecise. The 49.1 score exceeds the no-mid-training baseline (47.8) and is within 1.4 points of the with-mid-training baseline (50.5). The with-mid-training baseline represents an upper bound achievable only via full monolithic retraining, whereas BAR achieves comparable results at linear cost without forgetting. All scores use the identical evaluation protocol. We have revised the abstract to read: 'BAR achieves an overall score of 49.1, matching the no-mid-training baseline (47.8) and approaching the with-mid-training baseline (50.5)'. We have also added a new Table 3 with per-category breakdowns across all 7 evaluation categories, showing BAR matches or exceeds the no-mid baseline on 5/7 categories and remains competitive on the others. revision: yes
Referee: [Experimental results] Experimental results (presumed §4 or equivalent): No details are provided on the evaluation setup, including the exact benchmarks comprising the 7 categories, number of evaluation runs, variance or standard deviations, statistical significance tests, or precise re-implementations of the monolithic baselines (e.g., data mixtures, training steps, or hyperparameter matching). Without these, the soundness of the 49.1 vs. baseline comparison cannot be assessed and the central empirical claim remains unverifiable.

Authors: We acknowledge the need for greater transparency. In the revised manuscript, Section 4 has been expanded to list the exact benchmarks for each of the 7 categories, report results over 5 independent runs with standard deviations, include statistical significance tests (paired t-tests) against baselines, and provide full details on baseline re-implementations including data mixtures, training steps, and hyperparameter matching to ensure fair comparison. revision: yes
Referee: [Router and composition] Router and composition section (presumed §3): The claim that the lightweight router enables composition 'without degradation' or loss of cross-domain synergies rests on the untested assumption that routing errors are negligible on overlapping or mixed-domain queries (e.g., math+code). No routing accuracy metrics, error analysis on integrated tasks, or ablation on router training data coverage are reported, which is load-bearing for the weakest assumption identified in the stress test.

Authors: We agree this assumption requires empirical support. The revised Section 3.3 now includes: (1) routing accuracy of 92.3% on a held-out set of mixed-domain queries, (2) error analysis showing that misrouted mixed queries incur <1.8% average performance drop relative to oracle routing, and (3) an ablation varying router training data coverage (math+code, math+tool, etc.) demonstrating robust performance as long as at least two domains are represented. These additions directly substantiate the 'without degradation' claim for practical mixed queries. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from independent training and evaluation

full rationale

The paper reports direct experimental outcomes for BAR (49.1 average score across 7 categories) versus retraining baselines (47.8 without mid-training, 50.5 with mid-training) at 7B scale. Claims about avoiding catastrophic forgetting and linear cost scaling are supported by these comparisons and the described training pipelines, not by any derivation, fitted parameter renamed as prediction, or self-citation that reduces the result to its inputs by construction. No equations or load-bearing self-references appear in the abstract or described method.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard machine learning assumptions about expert specialization and MoE routing rather than new postulates.

axioms (1)

domain assumption A lightweight router can learn to select among independently trained domain experts without introducing substantial interference or performance loss.
Invoked in the composition step of the BAR pipeline.

pith-pipeline@v0.9.0 · 5550 in / 1160 out tokens · 33295 ms · 2026-05-10T05:48:17.140014+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 48 canonical work pages · 24 internal anchors

[1]

Opencodereasoning: Advancing data distillation for competitive coding

Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding, 2025. URL https://arxiv.org/abs/2504.01943

work page arXiv 2025
[2]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werr...

work page internal anchor Pith review arXiv 2025
[3]

Smith, Yejin Choi, and Hannaneh Hajishirzi

Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, Yulia Tsvetkov, Noah A. Smith, Yejin Choi, and Hannaneh Hajishirzi. The art of saying no: Contextual noncompliance in language models, 2024. URL https://arxiv.org/abs/2407.12043

work page arXiv 2024
[4]

AceReason-Nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400,

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron: Advancing math and code reasoning through reinforcement learning, 2025. URL https://arxiv.org/abs/2505.16400

work page arXiv 2025
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models, 2024. URL https://arxiv.org/abs/2401.06066

work page internal anchor Pith review arXiv 2024
[7]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2025. URL https://arxiv.org/abs/2404.04475

work page internal anchor Pith review arXiv 2025
[8]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2022. URL https://arxiv.org/abs/2101.03961

work page internal anchor Pith review arXiv 2022
[9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

work page internal anchor Pith review arXiv 2025
[11]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495, 2024

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms, 2024. URL https://arxiv.org/abs/2406.18495

work page arXiv 2024
[12]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021 a . URL https://arxiv.org/abs/2009.03300

work page internal anchor Pith review arXiv 2021
[13]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021 b . URL https://arxiv.org/abs/2103.03874

work page internal anchor Pith review arXiv 2021
[14]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model, 2025. URL https://arxiv.org/abs/2503.24290

work page internal anchor Pith review arXiv 2025
[15]

Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bertie Vidgen, Bhavya Kailkhura, Caiming Xiong, Chaowei Xiao, Chunyuan Li, Eric Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Manolis Kellis, Mar...

work page arXiv 2024
[16]

Editing Models with Task Arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic, 2023. URL https://arxiv.org/abs/2212.04089

work page internal anchor Pith review arXiv 2023
[17]

Numinamath

Lewis Tunstall Ben Lipkin Roman Soletskyi Shengyi Costa Huang Kashif Rasul Longhui Yu Albert Jiang Ziju Shen Zihan Qin Bin Dong Li Zhou Yann Fleureau Guillaume Lample Jia LI, Edward Beeching and Stanislas Polu. Numinamath. [https://github.com/project-numina/aimo-progress-prize](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_...

2024
[18]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

doi:10.48550/arXiv.2406.18510 , abstract =

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024 b . URL https://arxiv.org/abs/2406.18510

work page arXiv 2024
[20]

arXiv preprint arXiv:2402.07871 , year=

Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, Marek Cygan, and Sebastian Jaszczur. Scaling laws for fine-grained mixture of experts, 2024. URL https://arxiv.org/abs/2402.07871

work page arXiv 2024
[21]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi...

work page internal anchor Pith review arXiv 2025
[22]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding, 2020. URL https://arxiv.org/abs/2006.16668

work page internal anchor Pith review arXiv 2020
[23]

Branch-train-merge: Embarrassingly parallel training of ex- pert language models.arXiv preprint arXiv:2208.03306,

Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, and Luke Zettlemoyer. Branch-train-merge: Embarrassingly parallel training of expert language models, 2022. URL https://arxiv.org/abs/2208.03306

work page arXiv 2022
[24]

StarCoder: may the source be with you!

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...

work page internal anchor Pith review arXiv 2023
[25]

Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100,

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning, 2025. URL https://arxiv.org/abs/2502.01100

work page arXiv 2025
[26]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023. URL https://arxiv.org/abs/2305.01210

work page internal anchor Pith review arXiv 2023
[27]

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories, 2023. URL https://arxiv.org/abs/2212.10511

work page arXiv 2023
[28]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024. URL https://arxiv.org/abs/2402.04249

work page internal anchor Pith review arXiv 2024
[29]

Smith, Hannaneh Hajishirzi, Pang Wei Koh, Jesse Dodge, and Pradeep Dasigi

Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, Pang Wei Koh, Jesse Dodge, and Pradeep Dasigi. Merge to learn: Efficiently adding skills to language models with model merging, 2024. URL https://arxiv.org/abs/2410.12937

work page arXiv 2024
[30]

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Shan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Ji...

work page internal anchor Pith review arXiv 2025
[32]

Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025

2025
[33]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv.org/abs/2305.18290

work page internal anchor Pith review arXiv 2024
[34]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022

work page internal anchor Pith review arXiv 2023
[35]

Prism: Demystifying retention and interaction in mid-training, 2026

Bharat Runwal, Ashish Agrawal, Anurag Roy, and Rameswar Panda. Prism: Demystifying retention and interaction in mid-training, 2026. URL https://arxiv.org/abs/2603.17074

work page arXiv 2026
[36]

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, and Pang Wei Koh. Dr tulu: Reinforcement learning with ev...

work page arXiv 2025
[37]

do anything now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models, 2024. URL https://arxiv.org/abs/2308.03825

work page arXiv 2024
[38]

Flexolmo: Open language models for flexible data use.arXiv preprint arXiv:2507.07024,

Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Pete Walsh, Jacob Morrison, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Dirk Groeneveld, Mike Lewis, Wen tau Yih, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, and Sewon Min. Flexolmo: Open...

work page arXiv 2025
[39]

Unchosen experts can contribute too: Unleashing moe models’ power by self-contrast.Advances in Neural Information Processing Systems, 37:136897–136921, 2024a

Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen tau Yih, Jason Weston, and Xian Li. Branch-train-mix: Mixing expert llms into a mixture-of-experts llm, 2024. URL https://arxiv.org/abs/2403.07816

work page arXiv 2024
[40]

Sun, Y., Hu, S., Zhou, G., Zheng, K., Hajishirzi, H., Dziri, N., and Song, D

Yiyou Sun, Shawn Hu, Georgia Zhou, Ken Zheng, Hannaneh Hajishirzi, Nouha Dziri, and Dawn Song. Omega: Can llms reason outside the box in math? evaluating exploratory, compositional, and transformative generalization, 2025. URL https://arxiv.org/abs/2506.18880

work page arXiv 2025
[41]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. URL https://arxiv.org/abs/2210.09261

work page internal anchor Pith review arXiv 2022
[42]

Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data

Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman. Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data, 2024. URL https://arxiv.org/abs/2410.01560

work page arXiv 2024
[43]

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models, 2024. URL https://arxiv.org/abs/2411.04368

work page arXiv 2024
[44]

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S

Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022. URL https://arxiv.org/abs/2203.05482

work page arXiv 2022
[45]

TIES-merging: Resolving interference when merging models,

Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models, 2023. URL https://arxiv.org/abs/2306.01708

work page arXiv 2023
[46]

DARE: Language model weights can be pruned by 90% without retraining,

Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch, 2024. URL https://arxiv.org/abs/2311.03099

work page arXiv 2024
[47]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Acecoder: Acing coder rl via automated test-case synthesis, 2025

Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. Acecoder: Acing coder rl via automated test-case synthesis, 2025. URL https://arxiv.org/abs/2502.01718

work page arXiv 2025
[49]

[[rating]]

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023. URL https://arxiv.org/abs/2304.06364

work page arXiv 2023
[50]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911

work page internal anchor Pith review Pith/arXiv arXiv 2023