pith. sign in

arxiv: 2507.15640 · v2 · submitted 2025-07-21 · 💻 cs.LG · cs.AI· cs.CL

Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

Pith reviewed 2026-05-19 03:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords continual pre-trainingdata mixingdomain reweightingreinforcement learninglarge language modelscatastrophic forgettingmath reasoningcode generation
0
0 comments X

The pith

A reinforcement learning agent learns to automatically re-weight data domains for balanced continual pre-training of language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that heuristics for re-weighting source and target domain data in continual pre-training can be learned by a model-based agent. The agent is trained with reinforcement learning using large numbers of data mixing trajectories and feedback from an evaluation environment that measures performance on both original and new field benchmarks. This matters if true because it replaces manual or empirical heuristics with a generalizable learned policy that maintains balanced capabilities and avoids catastrophic forgetting more effectively. Experiments demonstrate outperformance on math reasoning tasks and successful transfer to unseen settings and other domains like code generation.

Core claim

Data Mixing Agent is the first model-based, end-to-end framework that learns to re-weight domains through reinforcement learning on data mixing trajectories with feedback from an evaluation environment, outperforming strong baselines on math reasoning continual pre-training and generalizing across unseen source fields, target models, and domain spaces.

What carries the argument

The Data Mixing Agent, which parameterizes domain re-weighting heuristics and optimizes them end-to-end via reinforcement learning guided by evaluation feedback.

If this is right

  • Outperforms strong baselines in achieving balanced performance across source and target field benchmarks.
  • Generalizes well across unseen source fields, target models, and domain spaces without retraining.
  • Adapts directly to the code generation field.
  • Learned heuristics align with human intuitions.
  • Achieves superior model performance with less source-field data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be extended to optimize mixtures involving more than two domains or additional data types.
  • It opens the possibility of meta-learning domain mixing policies across multiple tasks simultaneously.
  • Future experiments might replace full benchmark evaluations with faster proxy metrics to reduce training cost for the agent.
  • This RL approach to data curation may influence how training data is selected in other stages of model development.

Load-bearing premise

Feedback signals from the evaluation environment on source and target benchmarks are reliable and generalizable enough to train an agent that transfers to new domains and models.

What would settle it

A test where the Data Mixing Agent trained on math reasoning data is applied to a new target domain such as biology and underperforms standard manual mixing strategies would challenge the generalization claim.

Figures

Figures reproduced from arXiv: 2507.15640 by Hao Li, Kailai Yang, Lei Ji, Mao Yang, Peng Cheng, Xiao Liang, Xiao Liu, Yeyun Gong, Zhiwei Liu.

Figure 1
Figure 1. Figure 1: Four averaged distributions drawn from 20 randomly generated data mixing trajectories. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the training and domain reweighting pipeline of the data mixing agent. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The KL divergence between the estimated start state by sampled data from the target model [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The two data mixing agents’ output domain reweighting trajectories based on the 2- [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: DataAgentSF T ’s domain reweighting trajectories based on the 52-dimensional domain space, training on the LLaMA-3B-DCLM-100B model and the math reasoning field. The legends within each sub-figure are the same as those of [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: DataAgentRL’s domain reweighting trajectories based on the 52-dimensional domain space, training on the LLaMA-3B-DCLM-100B model and the math reasoning field. The legends within each sub-figure are the same as those of [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The performance dynamics of the target model on the evaluation environment with [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

Continual pre-training on small-scale task-specific data is an effective method for improving large language models in new target fields, yet it risks catastrophic forgetting of their original capabilities. A common solution is to re-weight training data mixtures from source and target fields on a domain space to achieve balanced performance. Previous domain reweighting strategies rely on manual designation with certain heuristics based on human intuition or empirical results. In this work, we prove that more general heuristics can be parameterized by proposing Data Mixing Agent, the first model-based, end-to-end framework that learns to re-weight domains. The agent learns generalizable heuristics through reinforcement learning on large quantities of data mixing trajectories with corresponding feedback from an evaluation environment. Experiments in continual pre-training on math reasoning show that Data Mixing Agent outperforms strong baselines in achieving balanced performance across source and target field benchmarks. Furthermore, it generalizes well across unseen source fields, target models, and domain spaces without retraining. Direct application to the code generation field also indicates its adaptability across target domains. Further analysis showcases the agents' well-aligned heuristics with human intuitions and their efficiency in achieving superior model performance with less source-field data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Data Mixing Agent, the first model-based end-to-end RL framework for learning to re-weight domains during continual pre-training of LLMs. The agent is trained on large numbers of data-mixing trajectories whose rewards are supplied by an evaluation environment on source and target benchmarks; the central claims are that the resulting policy outperforms strong baselines on math-reasoning continual pre-training, produces balanced performance, and generalizes without retraining to unseen source fields, target models, and domain spaces, with additional evidence of adaptability to code generation.

Significance. If the generalization results are robust, the work supplies an automated, learned alternative to the manual or heuristic-based domain-reweighting strategies currently used to mitigate catastrophic forgetting. The RL formulation with external benchmark feedback is a concrete step toward parameterizing more general mixing heuristics, and the reported cross-domain and cross-model transfer experiments, if quantitatively detailed, would constitute a useful empirical contribution to continual pre-training.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the claim of outperformance and generalization is stated without quantitative details on baselines, exact metrics, statistical significance, or number of runs. The central empirical support for both the balanced-performance and no-retraining claims therefore cannot yet be evaluated.
  2. [§5] §5 (Generalization experiments): the no-retraining transfer to unseen source fields and target models rests on the assumption that rewards from the fixed math-reasoning evaluation environment encode domain-invariant balancing rules rather than benchmark-specific optima. If the agent overfits to the metric surfaces of the training benchmarks, performance on truly novel fields or models would be expected to degrade; the manuscript should provide an ablation that isolates this risk (e.g., training on one set of metrics and testing on an orthogonal held-out metric suite).
minor comments (2)
  1. [§3] Clarify the exact formulation of the reward function and any scaling hyperparameters in the RL objective; these appear among the free parameters listed in the axiom ledger.
  2. [Table 3] Ensure that tables reporting cross-period or cross-model results include both mean and standard deviation across seeds so that the magnitude of improvement can be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving the presentation of our empirical results and the robustness of our generalization claims. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim of outperformance and generalization is stated without quantitative details on baselines, exact metrics, statistical significance, or number of runs. The central empirical support for both the balanced-performance and no-retraining claims therefore cannot yet be evaluated.

    Authors: We agree that the abstract and Section 4 would benefit from greater quantitative specificity. In the revised manuscript we have expanded the abstract to report concrete performance deltas versus the strongest baselines, the primary metrics (e.g., accuracy on source and target benchmarks), and the number of evaluation runs. Section 4 now includes full tables with mean results and standard deviations over five independent random seeds, together with statistical significance tests (paired t-tests with p-values) against each baseline. These additions make the claims of outperformance and balanced performance directly verifiable from the text. revision: yes

  2. Referee: [§5] §5 (Generalization experiments): the no-retraining transfer to unseen source fields and target models rests on the assumption that rewards from the fixed math-reasoning evaluation environment encode domain-invariant balancing rules rather than benchmark-specific optima. If the agent overfits to the metric surfaces of the training benchmarks, performance on truly novel fields or models would be expected to degrade; the manuscript should provide an ablation that isolates this risk (e.g., training on one set of metrics and testing on an orthogonal held-out metric suite).

    Authors: We acknowledge the possibility that the learned policy could overfit to the particular metric surfaces of the math-reasoning benchmarks. While the current experiments already show transfer to unseen source fields, target models, and the distinct code-generation domain, we have added a targeted ablation in the revised Section 5. The agent is retrained using only a subset of the original metrics and then evaluated on a held-out orthogonal metric suite drawn from different reasoning tasks. The policy continues to produce balanced performance, indicating that the learned re-weighting heuristics capture more general balancing principles rather than benchmark-specific optima. We have also clarified the composition of the evaluation environment to emphasize its diversity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in RL-based domain reweighting derivation

full rationale

The paper's core derivation introduces a Data Mixing Agent trained end-to-end via reinforcement learning on mixing trajectories, with rewards drawn from an external evaluation environment on source/target benchmarks. Generalization claims to unseen fields, models, and domain spaces rest on reported experimental outcomes rather than any reduction of the learned policy to a self-defined fit or tautological renaming. No load-bearing step equates a prediction to its own inputs by construction, and the framework remains self-contained against the stated benchmarks without invoking unverified self-citations or ansatzes as the sole justification.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the assumption that benchmark feedback is a faithful proxy for real performance and introduces the Data Mixing Agent as a new learned component; standard RL training details are treated as background.

free parameters (1)
  • RL training hyperparameters and reward scaling
    Chosen or tuned to produce the reported trajectories and performance; not derived from first principles.
axioms (1)
  • domain assumption Evaluation environment feedback accurately reflects downstream model utility on source and target tasks.
    Central to the RL training loop described in the abstract.
invented entities (1)
  • Data Mixing Agent no independent evidence
    purpose: Learns re-weighting policy via RL on mixing trajectories.
    New component introduced by the paper; no independent evidence outside this work.

pith-pipeline@v0.9.0 · 5754 in / 1260 out tokens · 37687 ms · 2026-05-19T03:32:38.787215+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Data Mixing for Large Language Models Pretraining: A Survey and Outlook

    cs.CL 2026-03 accept novelty 4.0

    A survey that taxonomizes data mixing strategies for LLM pretraining into static rule-based, learning-based, and dynamic adaptive families while highlighting transferability challenges and evaluation gaps.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 22 internal anchors

  1. [1]

    MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

    Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319 (2019). Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al

  2. [2]

    Program Synthesis with Large Language Models

    Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021). Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al

  3. [3]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021). Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord

  4. [4]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018). Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021). Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Bel- cak, Yoshi Suhara, Hongxu Yin, et al

  6. [6]

    Nemotron-climb: Clustering-based iterative data mixture bootstrap- ping for language model pre-training.arXiv preprint arXiv:2504.13161, 2025

    CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training. arXiv preprint arXiv:2504.13161 (2025). Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al

  7. [7]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020). Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al

  8. [8]

    arXiv e-prints (2024), arXiv–2407

    The llama 3 herd of models. arXiv e-prints (2024), arXiv–2407. 18 Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al

  9. [9]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196 (2024). Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Ja- cob Steinhardt

  10. [10]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020). Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt

  11. [11]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021). Jian Hu, Xibin Wu, Zilin Zhu, Weixun Wang, Dehao Zhang, Yu Cao, et al

  12. [12]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. arXiv preprint arXiv:2405.11143 (2024). Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al

  13. [13]

    Qwen2.5-Coder Technical Report

    Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186 (2024). Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah A Smith, Yejin Choi, and Hanna Hajishirzi

  14. [14]

    Advances in neural information processing systems 37 (2024), 36602–36633

    Unpacking dpo and ppo: Disentangling best practices for learning from preference feedback. Advances in neural information processing systems 37 (2024), 36602–36633. Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine

  15. [15]

    Advances in neural information processing systems 33 (2020), 1179–1191

    Conservative q-learning for offline reinforcement learning. Advances in neural information processing systems 33 (2020), 1179–1191. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al

  16. [16]

    Hao Li, Bowen Deng, Chang Xu, Zhiyuan Feng, Viktor Schlegel, Yu-Hao Huang, Yizheng Sun, Jingyuan Sun, Kailai Yang, Yiyao Yu, et al

    Solving quantitative reasoning problems with language models.Advances in Neural Information Processing Systems 35 (2022), 3843–3857. Hao Li, Bowen Deng, Chang Xu, Zhiyuan Feng, Viktor Schlegel, Yu-Hao Huang, Yizheng Sun, Jingyuan Sun, Kailai Yang, Yiyao Yu, et al

  17. [17]

    MIRA: Medical time series foundation model for real-world health data.arXiv preprint arXiv:2506.07584, 2025

    MIRA: Medical Time Series Foundation Model for Real-World Health Data. arXiv preprint arXiv:2506.07584 (2025). Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Guha, Sedrick Scott Keh, Kushal Arora, et al

  18. [18]

    Advances in Neural Information Processing Systems 37 (2024), 14200–14282

    Datacomp-lm: In search of the next generation of training sets for language models. Advances in Neural Information Processing Systems 37 (2024), 14200–14282. Zhenghao Lin, Zihao Tang, Xiao Liu, Yeyun Gong, Yi Cheng, Qi Chen, Hang Li, Ying Xin, Ziyue Yang, Kailai Yang, et al

  19. [19]

    arXiv preprint arXiv:2501.13629 (2025)

    Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models. arXiv preprint arXiv:2501.13629 (2025). Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024a. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024). Jian Liu, Leyang Cui, H...

  20. [20]

    arXiv preprint arXiv:2007.08124 (2020)

    Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. arXiv preprint arXiv:2007.08124 (2020). Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin. 2024b. Regmix: Data mixture as regression for language model pre-training. arXiv preprint arXiv:2407.01492 (2024). Yun Lu...

  21. [21]

    An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747 (2023). 19 Zheheng Luo, Xin Zhang, Xiao Liu, Haoling Li, Yeyun Gong, Chen Qi, and Peng Cheng

  22. [22]

    arXiv preprint arXiv:2411.14318 (2024)

    Velocitune: A Velocity-based Dynamic Domain Reweighting Method for Continual Pre-training. arXiv preprint arXiv:2411.14318 (2024). Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal

  23. [23]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789 (2018). Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bha- gia, Yuling Gu, Shengyi Huang, Matt Jordan, et al

  24. [24]

    2 OLMo 2 Furious

    2 OLMo 2 Furious. arXiv preprint arXiv:2501.00656 (2024). Guilherme Penedo, Hynek Kydlíˇcek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro V on Werra, Thomas Wolf, et al

  25. [25]

    Advances in Neural Information Processing Systems 37 (2024), 30811–30849

    The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems 37 (2024), 30811–30849. Alexey Rukhovich, Alexander Podolskiy, and Irina Piontkovskaya

  26. [26]

    arXiv preprint arXiv:2501.15556 (2025)

    Commute Your Domains: Trajectory Optimality Criterion for Multi-Domain Learning. arXiv preprint arXiv:2501.15556 (2025). Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang

  27. [27]

    Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

    Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731 (2019). Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi

  28. [28]

    Winogrande: An adversarial winograd schema challenge at scale. Commun. ACM 64, 9 (2021), 99–106. Takuma Seno and Michita Imai

  29. [29]

    Journal of Machine Learning Research 23, 315 (2022), 1–20

    d3rlpy: An Offline Deep Reinforcement Learning Library. Journal of Machine Learning Research 23, 315 (2022), 1–20. http://jmlr.org/papers/v23/ 22-0017.html Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al

  30. [30]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024). Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva, et al

  31. [31]

    Slimpajama-dc: Understanding dat a combinations for llm training

    Slimpajama-dc: Understanding data combinations for llm training. arXiv preprint arXiv:2309.10818 (2023). Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang

  32. [32]

    Continual learning of large language models: A comprehensive survey. Comput. Surveys (2024). Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro

  33. [33]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019). Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro

  34. [34]

    Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset.arXiv preprint arXiv:2412.02595,

    Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset. arXiv preprint arXiv:2412.02595 (2024). Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour

  35. [35]

    Advances in neural information processing systems 12 (1999)

    Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems 12 (1999). Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al

  36. [36]

    Nejm Ai 1, 3 (2024), AIoa2300138

    Towards generalist biomedical AI. Nejm Ai 1, 3 (2024), AIoa2300138. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin

  37. [37]

    Advances in neural information processing systems 30 (2017)

    Attention is all you need. Advances in neural information processing systems 30 (2017). 20 Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li

  38. [38]

    Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning,

    Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. arXiv preprint arXiv:2310.03731 (2023). Johannes Welbl, Nelson F Liu, and Matt Gardner

  39. [39]

    Crowdsourcing Multiple Choice Science Questions

    Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209 (2017). Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, and Luca Soldaini

  40. [40]

    arXiv preprint arXiv:2502.10341 (2025)

    Organize the Web: Constructing Domains Enhances Pre-Training Data Curation. arXiv preprint arXiv:2502.10341 (2025). Xiangyu Xi, Deyang Kong, Jian Yang, Jiawei Yang, Zhengyu Chen, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, and Wei Ye

  41. [41]

    arXiv preprint arXiv:2503.01506 (2025)

    SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity. arXiv preprint arXiv:2503.01506 (2025). Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen

  42. [42]

    arXiv preprint arXiv:2310.06694 (2023)

    Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694 (2023). Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al

  43. [43]

    Advances in Neural Information Processing Systems 37 (2024), 95716–95743

    Finben: A holistic financial benchmark for large language models. Advances in Neural Information Processing Systems 37 (2024), 95716–95743. Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu

  44. [44]

    Advances in Neural Information Processing Systems 36 (2023), 69798–69818

    Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems 36 (2023), 69798–69818. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al

  45. [45]

    Qwen3 Technical Report

    Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025). An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al

  46. [46]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122 (2024). Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, and Xipeng Qiu

  47. [47]

    Data mixing laws: Optimizing data mixtures by predicting language modeling performance, 2024

    Data mixing laws: Optimizing data mixtures by predicting language modeling performance. arXiv preprint arXiv:2403.16952 (2024). Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu

  48. [48]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 (2023). Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi

  49. [49]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 (2019). Xiyuan Zhang, Ranak Roy Chowdhury, Rajesh K Gupta, and Jingbo Shang

  50. [50]

    R., Gupta , R

    Large language models for time series: A survey. arXiv preprint arXiv:2402.01801 (2024). 21