pith. machine review for the scientific record. sign in

arxiv: 2309.05653 · v3 · pith:25UWFI7Snew · submitted 2023-09-11 · 💻 cs.CL

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Pith reviewed 2026-05-17 23:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords mathematical reasoninginstruction tuningchain-of-thoughtprogram-of-thoughtlarge language modelsmath problem solvingopen-source modelshybrid rationales
0
0 comments X

The pith

Training on a hybrid of chain-of-thought and program-of-thought rationales builds open-source math models that outperform prior leaders on nine benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors create MAmmoTH models by instruction-tuning on MathInstruct, a dataset assembled from thirteen math sources that includes both verbal chain-of-thought steps and executable program-of-thought code for each problem. Six of the sources receive newly written rationales to broaden topic coverage. The hybrid format lets a single model switch between natural-language reasoning and code execution depending on the problem at hand. This produces consistent accuracy lifts across scales, including a 7B model reaching 33 percent on the competition MATH dataset and a 34B model reaching 44 percent, which exceeds GPT-4's chain-of-thought score on the same set.

Core claim

The paper claims that instruction tuning on MathInstruct, which mixes chain-of-thought and program-of-thought rationales across thirteen datasets with wide math-field coverage, yields MAmmoTH models that substantially outperform existing open-source models on nine mathematical reasoning datasets at every scale, delivering average accuracy gains of 16 to 32 percent. The 7B version scores 33 percent on MATH, 23 points above the previous best open-source 7B model, while the 34B version scores 44 percent on MATH and surpasses GPT-4's CoT result.

What carries the argument

MathInstruct, the instruction-tuning dataset that presents a hybrid of chain-of-thought and program-of-thought rationales compiled from thirteen math datasets.

If this is right

  • Models gain the ability to apply either verbal steps or code execution depending on the math problem.
  • The program-of-thought component increases the potential for tool use during reasoning.
  • Open-source models reach higher accuracy on competition-level tasks such as MATH.
  • Broad coverage across math fields supports stronger generalization to new problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hybrid rationale mix could be applied to scientific reasoning tasks that also mix explanation and simulation.
  • Curating high-quality rationales may matter more than raw data volume when specializing models for reasoning.
  • Adding verification steps to the program-of-thought outputs could further reduce calculation errors.
  • Smaller models trained this way might serve educational tools that need both text explanations and runnable code.

Load-bearing premise

The measured accuracy gains result specifically from the hybrid CoT-PoT format and the newly curated rationales rather than from dataset size, model scale, or other training choices.

What would settle it

Train identical base models on matched volumes of data that contain only CoT rationales, only PoT rationales, or the original uncurated sources, then check whether the reported gains on the nine evaluation datasets disappear.

read the original abstract

We introduce MAmmoTH, a series of open-source large language models (LLMs) specifically tailored for general math problem-solving. The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset. MathInstruct is compiled from 13 math datasets with intermediate rationales, six of which have rationales newly curated by us. It presents a unique hybrid of chain-of-thought (CoT) and program-of-thought (PoT) rationales, and also ensures extensive coverage of diverse fields in math. The hybrid of CoT and PoT not only unleashes the potential of tool use but also allows different thought processes for different math problems. As a result, the MAmmoTH series substantially outperform existing open-source models on nine mathematical reasoning datasets across all scales with an average accuracy gain between 16% and 32%. Remarkably, our MAmmoTH-7B model reaches 33% on MATH (a competition-level dataset), which exceeds the best open-source 7B model (WizardMath) by 23%, and the MAmmoTH-34B model achieves 44% accuracy on MATH, even surpassing GPT-4's CoT result. Our work underscores the importance of diverse problem coverage and the use of hybrid rationales in developing superior math generalist models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the MAmmoTH series of open-source LLMs for general mathematical problem-solving. These models are fine-tuned on MathInstruct, a hybrid instruction dataset compiled from 13 sources (six with newly curated rationales) that mixes chain-of-thought (CoT) and program-of-thought (PoT) formats. The central claim is that this hybrid approach yields substantial gains over prior open-source models on nine math reasoning benchmarks, with average accuracy improvements of 16-32%, including 33% on MATH for the 7B variant (23 points above WizardMath) and 44% for the 34B variant (exceeding GPT-4 CoT).

Significance. If the gains are robustly attributable to the hybrid CoT-PoT format and curated rationales rather than scale or unstated factors, the work would provide a practical recipe for improving mathematical reasoning in open models and underscore the value of diverse rationale styles. The release of models and dataset supports reproducibility and follow-up research.

major comments (1)
  1. [§4 and Table 2] §4 (Experiments) and Table 2: End-to-end results are reported against WizardMath and other baselines, but no ablation holds base model, training schedule, and total token count fixed while varying only the presence of PoT examples versus pure CoT or the six newly curated rationales. Without this isolation, the 16-32% average gains and the specific MATH jumps cannot be confidently attributed to the hybrid format as claimed in the abstract and §3.
minor comments (2)
  1. [§3.2] §3.2: The description of how the six new rationales were curated could be expanded with explicit quality-control steps or inter-annotator agreement metrics to strengthen the claim of 'meticulously curated'.
  2. [Figure 1 and §3] Figure 1 and §3: The mixture proportions across the 13 sources are not tabulated; adding a breakdown of example counts or token shares per source would clarify the 'extensive coverage' assertion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comment point by point below, providing clarifications and committing to revisions where appropriate to strengthen the paper.

read point-by-point responses
  1. Referee: [§4 and Table 2] §4 (Experiments) and Table 2: End-to-end results are reported against WizardMath and other baselines, but no ablation holds base model, training schedule, and total token count fixed while varying only the presence of PoT examples versus pure CoT or the six newly curated rationales. Without this isolation, the 16-32% average gains and the specific MATH jumps cannot be confidently attributed to the hybrid format as claimed in the abstract and §3.

    Authors: We appreciate the referee's emphasis on isolating the contribution of the hybrid CoT-PoT format and the newly curated rationales. Our primary comparisons are to WizardMath and similar baselines that use the same base models (Llama-2-7B/34B) and comparable fine-tuning setups, with the key distinction being our use of MathInstruct's hybrid rationales versus their predominantly CoT-based data. However, we acknowledge that a more tightly controlled ablation—fixing base model, training schedule, and total token count while varying only PoT inclusion or the six curated sources—would provide stronger attribution. In the revised manuscript, we will add such an ablation study in §4, reporting results for pure-CoT, pure-PoT, and hybrid variants under matched conditions. This will directly support the claims in the abstract and §3 regarding the benefits of hybrid rationales. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results rest on external benchmarks

full rationale

The paper trains models on the newly compiled MathInstruct mixture and reports accuracy on nine standard held-out mathematical reasoning benchmarks. These evaluation sets are distinct from the training sources, and the reported gains are measured against external baselines rather than being derived from any fitted parameter or self-referential definition. No equations, uniqueness theorems, or ansatzes are invoked that reduce the central performance claims to the inputs by construction. The absence of ablations is a limitation on causal attribution but does not constitute circularity under the defined criteria.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard LLM fine-tuning assumptions and the value of hybrid reasoning formats; no new entities are postulated and free parameters are limited to routine training choices.

free parameters (1)
  • LLM training hyperparameters
    Learning rate, epochs, and batch size chosen during fine-tuning but not central to the hybrid-rationale claim.
axioms (1)
  • domain assumption Instruction tuning on curated datasets with rationales improves LLM reasoning performance
    Invoked throughout the training and evaluation approach described in the abstract.

pith-pipeline@v0.9.0 · 5781 in / 1233 out tokens · 42127 ms · 2026-05-17T23:42:49.995983+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AI co-mathematician: Accelerating mathematicians with agentic AI

    cs.AI 2026-05 unverdicted novelty 7.0

    An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.

  2. Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.

  3. Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

    cs.CL 2024-06 unverdicted novelty 7.0

    Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, Ar...

  4. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  5. MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

    cs.CV 2024-03 conditional novelty 7.0

    MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.

  6. AI co-mathematician: Accelerating mathematicians with agentic AI

    cs.AI 2026-05 unverdicted novelty 6.0

    An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.

  7. CeRA: Overcoming the Linear Ceiling of Low-Rank Adaptation via Capacity Expansion

    cs.LG 2026-02 unverdicted novelty 6.0

    CeRA overcomes LoRA's linear ceiling by injecting non-linear SiLU gating and dropout, outperforming high-rank LoRA on complex math reasoning with 1/8 the parameters.

  8. Vision-aligned Latent Reasoning for Multi-modal Large Language Model

    cs.CV 2026-02 unverdicted novelty 6.0

    VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.

  9. SmolVLM: Redefining small and efficient multimodal models

    cs.AI 2025-04 unverdicted novelty 6.0

    SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.

  10. Muon is Scalable for LLM Training

    cs.LG 2025-02 unverdicted novelty 6.0

    Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

  11. Process Reinforcement through Implicit Rewards

    cs.LG 2025-02 conditional novelty 6.0

    PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...

  12. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    cs.AI 2023-12 conditional novelty 6.0

    Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.

  13. Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity

    cs.LG 2026-05 unverdicted novelty 5.0

    Cosine similarity poorly predicts performance degradation from layer removal in LLMs, making direct accuracy-drop ablation a more reliable relevance metric.

  14. NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

    cs.LG 2026-05 unverdicted novelty 5.0

    Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.

  15. Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.

  16. NVIDIA Nemotron 3: Efficient and Open Intelligence

    cs.CL 2025-12 unverdicted novelty 5.0

    NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

  17. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  18. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    cs.CL 2024-01 unverdicted novelty 4.0

    DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

  19. A Survey on Knowledge Distillation of Large Language Models

    cs.CL 2024-02 accept novelty 3.0

    A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 18 Pith papers · 32 internal anchors

  1. [1]

    M ath QA : Towards interpretable math word problem solving with operation-based formalisms

    Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. M ath QA : Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long...

  2. [2]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. ArXiv preprint, abs/2305.10403, 2023. URL https://arxiv.org/abs/2305.10403

  3. [3]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. ArXiv preprint, abs/2212.08073, 2022. URL https://arxiv.org/abs/2212.08073

  4. [4]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. ArXiv preprint, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374

  5. [5]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. ArXiv preprint, abs/2211.12588, 2022. URL https://arxiv.org/abs/2211.12588

  6. [6]

    Theoremqa: A theorem-driven question answering dataset

    Wenhu Chen, Ming Yin, Max Ku, Elaine Wan, Xueguang Ma, Jianyu Xu, Tony Xia, Xinyi Wang, and Pan Lu. Theoremqa: A theorem-driven question answering dataset. ArXiv preprint, abs/2305.12524, 2023. URL https://arxiv.org/abs/2305.12524

  7. [7]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. ArXiv preprint, abs/2210.11416, 2022. URL https://arxiv.org/abs/2210.11416

  8. [8]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. ArXiv preprint, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

  9. [9]

    Advancing mathematics by guiding human intuition with ai

    Alex Davies, Petar Veli c kovi \'c , Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad Toma s ev, Richard Tanburn, Peter Battaglia, Charles Blundell, Andr \'a s Juh \'a sz, et al. Advancing mathematics by guiding human intuition with ai. Nature, 600 0 (7887): 0 70--74, 2021. URL https://www.nature.com/articles/s41586-021-04086-x

  10. [10]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. ArXiv preprint, abs/2305.14314, 2023. URL https://arxiv.org/abs/2305.14314

  11. [11]

    a rli, Ekin Aky \

    Andrew Drozdov, Nathanael Sch \"a rli, Ekin Aky \"u rek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier Bousquet, and Denny Zhou. Compositional semantic parsing with large language models. International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=gJW8hSGBys8

  12. [12]

    Pal: Program-aided language models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pp.\ 10764--10799. PMLR, 2023. URL https://proceedings.mlr.press/v202/gao23f/gao23f.pdf

  13. [13]

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. ArXiv preprint, abs/2305.11738, 2023. URL https://arxiv.org/abs/2305.11738

  14. [14]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 , 2021 a . URL https://openreview.net/forum?id=d7KBjmI3GmQ

  15. [15]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021 b . URL https://datasets-benchmarks-proceedings.neurips.cc/paper...

  16. [16]

    Learning to solve arithmetic word problems with verb categorization

    Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pp.\ 523--533, 2014. doi:10.3115/v1/D14-1058. URL https://aclanthology.org/D14-1058

  17. [17]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. NeurIPS, 2022

  18. [18]

    Parsing algebraic word problems into equations

    Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3: 0 585--597, 2015. doi:10.1162/tacl_a_00160. URL https://aclanthology.org/Q15-1042

  19. [19]

    MAWPS : A math word problem repository

    Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS : A math word problem repository. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pp.\ 1152--1157, 2016. doi:10.18653/v1/N16-1136. URL https://aclanthology.org/N16-1136

  20. [20]

    Platypus: Quick, cheap, and powerful refinement of llms

    Ariel N Lee, Cole J Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms. ArXiv preprint, abs/2308.07317, 2023. URL https://arxiv.org/abs/2308.07317

  21. [21]

    Solving quantitative reasoning problems with language models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35: 0 3843--3857, 2022. URL https://openreview.net/pdf?id=IFXTZERXdM7

  22. [22]

    CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

    Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large scale language model society. ArXiv preprint, abs/2303.17760, 2023 a . URL https://arxiv.org/abs/2303.17760

  23. [23]

    Making language models better reasoners with step-aware verifier

    Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5315--5333, 2023 b . URL https://aclanthology.org/2023.acl-long.291.pdf

  24. [24]

    Program induction by rationale generation: Learning to solve and explain algebraic word problems

    Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 158--167, 2017. doi:10.18653/v1/P17-1015. URL https://aclanthology.org/P17-1015

  25. [25]

    The flan collection: Designing data and methods for effective instruction tuning

    Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. ICML, 2023. URL https://openreview.net/pdf?id=ZX4uS605XV

  26. [26]

    WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

    Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. ArXiv preprint, abs/2308.09583, 2023. URL https://arxiv.org/abs/2308.09583

  27. [27]

    Language models of code are few-shot commonsense learners

    Aman Madaan, Shuyan Zhou, Uri Alon, Yiming Yang, and Graham Neubig. Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 1384--1403, 2022. URL https://aclanthology.org/2022.emnlp-main.90.pdf

  28. [28]

    LILA : A unified benchmark for mathematical reasoning

    Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. LILA : A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 5807--5832, 2022 a . URL https://acl...

  29. [29]

    N um GLUE : A suite of fundamental yet challenging mathematical reasoning tasks

    Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, and Ashwin Kalyan. N um GLUE : A suite of fundamental yet challenging mathematical reasoning tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3505--3523, 2022 b . doi:10.18653/v1/2022....

  30. [30]

    Orca: Progressive Learning from Complex Explanation Traces of GPT-4

    Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4. ArXiv preprint, abs/2306.02707, 2023. URL https://arxiv.org/abs/2306.02707

  31. [31]

    Codegen: An open large language model for code with multi-turn program synthesis

    Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/pdf?id=iaYcJKpY2B_

  32. [32]

    Show Your Work: Scratchpads for Intermediate Computation with Language Models

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop, 2022. URL https://arxiv.org/abs/2112.00114

  33. [33]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. ArXiv preprint, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774

  34. [34]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 2080--2094, 2021. doi:10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.na...

  35. [35]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. ArXiv preprint, abs/2306.01116, 2023. URL https://arxiv.org/abs/2306.01116

  36. [36]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. ArXiv preprint, abs/2304.03277, 2023. URL https://arxiv.org/abs/2304.03277

  37. [37]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.\ 1--16. IEEE, 2020. URL https://dl.acm.org/doi/10.5555/3433701.3433727

  38. [38]

    Solving general arithmetic word problems

    Subhro Roy and Dan Roth. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.\ 1743--1752, 2015. doi:10.18653/v1/D15-1202. URL https://aclanthology.org/D15-1202

  39. [39]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozi \`e re, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J \'e r \'e my Rapin, et al. Code llama: Open foundation models for code. ArXiv preprint, abs/2308.12950, 2023. URL https://arxiv.org/abs/2308.12950

  40. [40]

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian - Jian Jiang, Han Wang, Matteo Manica,...

  41. [41]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. ArXiv preprint, abs/2210.09261, 2022. URL https://arxiv.org/abs/2210.09261

  42. [42]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  43. [43]

    Galactica: A Large Language Model for Science

    Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. ArXiv preprint, abs/2211.09085, 2022. URL https://arxiv.org/abs/2211.09085

  44. [44]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023 a . URL https://arxiv.org/abs/2302.13971

  45. [45]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023 b . URL https://arxiv.org/abs/2307.09288

  46. [46]

    Iteratively prompt pre-trained language models for chain of thought

    Boshi Wang, Xiang Deng, and Huan Sun. Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 2714--2730. Association for Computational Linguistics, 2022 a . URL https://aclanthology.org/2022.emnlp-main.174

  47. [47]

    Towards understanding chain-of-thought prompting: An empirical study of what matters

    Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. Towards understanding chain-of-thought prompting: An empirical study of what matters. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2717--2739. Association for Computational Linguistics, 2023 a...

  48. [48]

    Can chatgpt defend the truth? automatic dialectical evaluation elicits llms' deficiencies in reasoning

    Boshi Wang, Xiang Yue, and Huan Sun. Can chatgpt defend the truth? automatic dialectical evaluation elicits llms' deficiencies in reasoning. ArXiv preprint, abs/2305.13160, 2023 b . URL https://arxiv.org/abs/2305.13160

  49. [49]

    Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

    Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. ArXiv preprint, abs/2305.04091, 2023 c . URL https://arxiv.org/abs/2305.04091

  50. [50]

    Making large language models better reasoners with alignment

    Peiyi Wang, Lei Li, Liang Chen, Feifan Song, Binghuai Lin, Yunbo Cao, Tianyu Liu, and Zhifang Sui. Making large language models better reasoners with alignment. ArXiv preprint, abs/2309.02144, 2023 d . URL https://arxiv.org/abs/2309.02144

  51. [51]

    SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

    Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. ArXiv preprint, abs/2307.10635, 2023 e . URL https://arxiv.org/abs/2307.10635

  52. [52]

    Self-consistency improves chain of thought reasoning in language models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations (ICLR), 2023 f . URL https://openreview.net/pdf?id=1PL1NIMMrw

  53. [53]

    Super- N atural I nstructions: Generalization via declarative instructions on 1600+ NLP tasks

    Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parma...

  54. [54]

    How far can camels go? exploring the state of instruction tuning on open resources

    Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources. ArXiv preprint, abs/2306.04751, 2023 g . URL https://arxiv.org/abs/2306.04751

  55. [55]

    Self-instruct: Aligning language model with self generated instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023 h . URL https://aclanthology.org/2023.acl-long.754.pdf

  56. [56]

    Codet5+: Open code large language models for code understanding and generation

    Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. Codet5+: Open code large language models for code understanding and generation. ArXiv preprint, abs/2305.07922, 2023 i . URL https://arxiv.org/abs/2305.07922

  57. [57]

    Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

    Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , 2022 a . URL https://openreview.net/forum?id=gEZrGCozdqR

  58. [58]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 0 24824--24837, 2022 b . URL https://openreview.net/pdf?id=_VjQlMeSB_J

  59. [59]

    Simple synthetic data reduces sycophancy in large language models

    Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V Le. Simple synthetic data reduces sycophancy in large language models. ArXiv preprint, abs/2308.03958, 2023. URL https://arxiv.org/abs/2308.03958

  60. [60]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R \'e mi Louf, Morgan Funtowicz, et al. Huggingface's transformers: State-of-the-art natural language processing. ArXiv preprint, abs/1910.03771, 2019. URL https://arxiv.org/abs/1910.03771

  61. [61]

    An explanation of in-context learning as implicit bayesian inference

    Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , 2022. URL https://openreview.net/forum?id=RdJVFCHjUMI

  62. [62]

    Decomposition enhances reasoning via self-evaluation guided decoding

    Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, and Qizhe Xie. Decomposition enhances reasoning via self-evaluation guided decoding. ArXiv preprint, abs/2305.00633, 2023. URL https://arxiv.org/abs/2305.00633

  63. [63]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. ArXiv preprint, abs/2304.12244, 2023. URL https://arxiv.org/abs/2304.12244

  64. [64]

    Gpt can solve mathematical problems without a calculator

    Zhen Yang, Ming Ding, Qingsong Lv, Zhihuan Jiang, Zehai He, Yuyi Guo, Jinfeng Bai, and Jie Tang. Gpt can solve mathematical problems without a calculator. ArXiv preprint, abs/2309.03241, 2023. URL https://arxiv.org/abs/2309.03241

  65. [65]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/pdf?id=WE_vluYUL-X

  66. [66]

    C ross F it: A few-shot learning challenge for cross-task generalization in NLP

    Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. C ross F it: A few-shot learning challenge for cross-task generalization in NLP . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 7163--7189, 2021. doi:10.18653/v1/2021.emnlp-main.572. URL https://aclanthology.org/2021.emnlp-main.572

  67. [67]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. ArXiv preprint, abs/2309.12284, 2023. URL https://arxiv.org/abs/2309.12284

  68. [68]

    Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. ArXiv preprint, abs/2308.01825, 2023. URL https://arxiv.org/abs/2308.01825

  69. [69]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. ArXiv preprint, abs/2205.01068, 2022. URL https://arxiv.org/abs/2205.01068

  70. [70]

    Progressive-hint prompting improves reasoning in large language models

    Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. Progressive-hint prompting improves reasoning in large language models. ArXiv preprint, abs/2304.09797, 2023 a . URL https://arxiv.org/abs/2304.09797

  71. [71]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv preprint, abs/2306.05685, 2023 b . URL https://arxiv.org/abs/2306.05685

  72. [72]

    AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. ArXiv preprint, abs/2304.06364, 2023. URL https://arxiv.org/abs/2304.06364

  73. [73]

    Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification

    Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. ArXiv preprint, abs/2308.07921, 2023 a . URL https://arxiv.org/abs/2308.07921

  74. [74]

    LIMA: Less Is More for Alignment

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. ArXiv preprint, abs/2305.11206, 2023 b . URL https://arxiv.org/abs/2305.11206

  75. [75]

    Least-to-most prompting enables complex reasoning in large language models

    Denny Zhou, Nathanael Sch \"a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. International Conference on Learning Representations (ICLR), 2023 c . URL https://openreview.net/pdf?id=WZH7099tgfM