arxiv: 2309.05653 · v3 · pith:25UWFI7Snew · submitted 2023-09-11 · 💻 cs.CL

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Xiang Yue , Xingwei Qu , Ge Zhang , Yao Fu , Wenhao Huang , Huan Sun , Yu Su , Wenhu Chen This is my paper

Pith reviewed 2026-05-17 23:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords mathematical reasoninginstruction tuningchain-of-thoughtprogram-of-thoughtlarge language modelsmath problem solvingopen-source modelshybrid rationales

0 comments

The pith

Training on a hybrid of chain-of-thought and program-of-thought rationales builds open-source math models that outperform prior leaders on nine benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors create MAmmoTH models by instruction-tuning on MathInstruct, a dataset assembled from thirteen math sources that includes both verbal chain-of-thought steps and executable program-of-thought code for each problem. Six of the sources receive newly written rationales to broaden topic coverage. The hybrid format lets a single model switch between natural-language reasoning and code execution depending on the problem at hand. This produces consistent accuracy lifts across scales, including a 7B model reaching 33 percent on the competition MATH dataset and a 34B model reaching 44 percent, which exceeds GPT-4's chain-of-thought score on the same set.

Core claim

The paper claims that instruction tuning on MathInstruct, which mixes chain-of-thought and program-of-thought rationales across thirteen datasets with wide math-field coverage, yields MAmmoTH models that substantially outperform existing open-source models on nine mathematical reasoning datasets at every scale, delivering average accuracy gains of 16 to 32 percent. The 7B version scores 33 percent on MATH, 23 points above the previous best open-source 7B model, while the 34B version scores 44 percent on MATH and surpasses GPT-4's CoT result.

What carries the argument

MathInstruct, the instruction-tuning dataset that presents a hybrid of chain-of-thought and program-of-thought rationales compiled from thirteen math datasets.

If this is right

Models gain the ability to apply either verbal steps or code execution depending on the math problem.
The program-of-thought component increases the potential for tool use during reasoning.
Open-source models reach higher accuracy on competition-level tasks such as MATH.
Broad coverage across math fields supports stronger generalization to new problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hybrid rationale mix could be applied to scientific reasoning tasks that also mix explanation and simulation.
Curating high-quality rationales may matter more than raw data volume when specializing models for reasoning.
Adding verification steps to the program-of-thought outputs could further reduce calculation errors.
Smaller models trained this way might serve educational tools that need both text explanations and runnable code.

Load-bearing premise

The measured accuracy gains result specifically from the hybrid CoT-PoT format and the newly curated rationales rather than from dataset size, model scale, or other training choices.

What would settle it

Train identical base models on matched volumes of data that contain only CoT rationales, only PoT rationales, or the original uncurated sources, then check whether the reported gains on the nine evaluation datasets disappear.

read the original abstract

We introduce MAmmoTH, a series of open-source large language models (LLMs) specifically tailored for general math problem-solving. The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset. MathInstruct is compiled from 13 math datasets with intermediate rationales, six of which have rationales newly curated by us. It presents a unique hybrid of chain-of-thought (CoT) and program-of-thought (PoT) rationales, and also ensures extensive coverage of diverse fields in math. The hybrid of CoT and PoT not only unleashes the potential of tool use but also allows different thought processes for different math problems. As a result, the MAmmoTH series substantially outperform existing open-source models on nine mathematical reasoning datasets across all scales with an average accuracy gain between 16% and 32%. Remarkably, our MAmmoTH-7B model reaches 33% on MATH (a competition-level dataset), which exceeds the best open-source 7B model (WizardMath) by 23%, and the MAmmoTH-34B model achieves 44% accuracy on MATH, even surpassing GPT-4's CoT result. Our work underscores the importance of diverse problem coverage and the use of hybrid rationales in developing superior math generalist models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAmmoTH delivers a new hybrid CoT-PoT math dataset and solid benchmark gains over prior open-source models, but the improvements are not yet isolated from data volume or curation effects.

read the letter

The main thing to know is that this paper builds a new instruction dataset called MathInstruct from 13 sources, adds fresh rationales to six of them, mixes chain-of-thought and program-of-thought formats, and trains models that beat earlier open-source math LLMs on nine benchmarks. The 7B version hits 33% on MATH and the 34B version reaches 44%, which is a noticeable step up from things like WizardMath at similar scales. That is the concrete advance here: more diverse coverage plus the hybrid rationale style, with the dataset itself as the reusable piece.

Referee Report

1 major / 2 minor

Summary. The paper introduces the MAmmoTH series of open-source LLMs for general mathematical problem-solving. These models are fine-tuned on MathInstruct, a hybrid instruction dataset compiled from 13 sources (six with newly curated rationales) that mixes chain-of-thought (CoT) and program-of-thought (PoT) formats. The central claim is that this hybrid approach yields substantial gains over prior open-source models on nine math reasoning benchmarks, with average accuracy improvements of 16-32%, including 33% on MATH for the 7B variant (23 points above WizardMath) and 44% for the 34B variant (exceeding GPT-4 CoT).

Significance. If the gains are robustly attributable to the hybrid CoT-PoT format and curated rationales rather than scale or unstated factors, the work would provide a practical recipe for improving mathematical reasoning in open models and underscore the value of diverse rationale styles. The release of models and dataset supports reproducibility and follow-up research.

major comments (1)

[§4 and Table 2] §4 (Experiments) and Table 2: End-to-end results are reported against WizardMath and other baselines, but no ablation holds base model, training schedule, and total token count fixed while varying only the presence of PoT examples versus pure CoT or the six newly curated rationales. Without this isolation, the 16-32% average gains and the specific MATH jumps cannot be confidently attributed to the hybrid format as claimed in the abstract and §3.

minor comments (2)

[§3.2] §3.2: The description of how the six new rationales were curated could be expanded with explicit quality-control steps or inter-annotator agreement metrics to strengthen the claim of 'meticulously curated'.
[Figure 1 and §3] Figure 1 and §3: The mixture proportions across the 13 sources are not tabulated; adding a breakdown of example counts or token shares per source would clarify the 'extensive coverage' assertion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comment point by point below, providing clarifications and committing to revisions where appropriate to strengthen the paper.

read point-by-point responses

Referee: [§4 and Table 2] §4 (Experiments) and Table 2: End-to-end results are reported against WizardMath and other baselines, but no ablation holds base model, training schedule, and total token count fixed while varying only the presence of PoT examples versus pure CoT or the six newly curated rationales. Without this isolation, the 16-32% average gains and the specific MATH jumps cannot be confidently attributed to the hybrid format as claimed in the abstract and §3.

Authors: We appreciate the referee's emphasis on isolating the contribution of the hybrid CoT-PoT format and the newly curated rationales. Our primary comparisons are to WizardMath and similar baselines that use the same base models (Llama-2-7B/34B) and comparable fine-tuning setups, with the key distinction being our use of MathInstruct's hybrid rationales versus their predominantly CoT-based data. However, we acknowledge that a more tightly controlled ablation—fixing base model, training schedule, and total token count while varying only PoT inclusion or the six curated sources—would provide stronger attribution. In the revised manuscript, we will add such an ablation study in §4, reporting results for pure-CoT, pure-PoT, and hybrid variants under matched conditions. This will directly support the claims in the abstract and §3 regarding the benefits of hybrid rationales. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results rest on external benchmarks

full rationale

The paper trains models on the newly compiled MathInstruct mixture and reports accuracy on nine standard held-out mathematical reasoning benchmarks. These evaluation sets are distinct from the training sources, and the reported gains are measured against external baselines rather than being derived from any fitted parameter or self-referential definition. No equations, uniqueness theorems, or ansatzes are invoked that reduce the central performance claims to the inputs by construction. The absence of ablations is a limitation on causal attribution but does not constitute circularity under the defined criteria.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard LLM fine-tuning assumptions and the value of hybrid reasoning formats; no new entities are postulated and free parameters are limited to routine training choices.

free parameters (1)

LLM training hyperparameters
Learning rate, epochs, and batch size chosen during fine-tuning but not central to the hybrid-rationale claim.

axioms (1)

domain assumption Instruction tuning on curated datasets with rationales improves LLM reasoning performance
Invoked throughout the training and evaluation approach described in the abstract.

pith-pipeline@v0.9.0 · 5781 in / 1233 out tokens · 42127 ms · 2026-05-17T23:42:49.995983+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AI co-mathematician: Accelerating mathematicians with agentic AI
cs.AI 2026-05 unverdicted novelty 7.0

An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
cs.CL 2024-06 unverdicted novelty 7.0

Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, Ar...
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
cs.CV 2024-03 conditional novelty 7.0

MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
AI co-mathematician: Accelerating mathematicians with agentic AI
cs.AI 2026-05 unverdicted novelty 6.0

An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.
CeRA: Overcoming the Linear Ceiling of Low-Rank Adaptation via Capacity Expansion
cs.LG 2026-02 unverdicted novelty 6.0

CeRA overcomes LoRA's linear ceiling by injecting non-linear SiLU gating and dropout, outperforming high-rank LoRA on complex math reasoning with 1/8 the parameters.
Vision-aligned Latent Reasoning for Multi-modal Large Language Model
cs.CV 2026-02 unverdicted novelty 6.0

VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
SmolVLM: Redefining small and efficient multimodal models
cs.AI 2025-04 unverdicted novelty 6.0

SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
Muon is Scalable for LLM Training
cs.LG 2025-02 unverdicted novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
Process Reinforcement through Implicit Rewards
cs.LG 2025-02 conditional novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
cs.AI 2023-12 conditional novelty 6.0

Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity
cs.LG 2026-05 unverdicted novelty 5.0

Cosine similarity poorly predicts performance degradation from layer removal in LLMs, making direct accuracy-drop ablation a more reliable relevance metric.
NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning
cs.LG 2026-05 unverdicted novelty 5.0

Injecting noise into LLM latent trajectories creates diverse reasoning paths whose agreement acts as a confidence signal for selective abstention, cutting error rates from 40-70% to under 15% on math tasks.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
cs.LG 2026-04 unverdicted novelty 5.0

Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
NVIDIA Nemotron 3: Efficient and Open Intelligence
cs.CL 2025-12 unverdicted novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
A Survey on Knowledge Distillation of Large Language Models
cs.CL 2024-02 accept novelty 3.0

A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 18 Pith papers · 32 internal anchors

[1]

M ath QA : Towards interpretable math word problem solving with operation-based formalisms

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. M ath QA : Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long...

work page doi:10.18653/v1/n19-1245 2019
[2]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. ArXiv preprint, abs/2305.10403, 2023. URL https://arxiv.org/abs/2305.10403

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. ArXiv preprint, abs/2212.08073, 2022. URL https://arxiv.org/abs/2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. ArXiv preprint, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. ArXiv preprint, abs/2211.12588, 2022. URL https://arxiv.org/abs/2211.12588

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Theoremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Elaine Wan, Xueguang Ma, Jianyu Xu, Tony Xia, Xinyi Wang, and Pan Lu. Theoremqa: A theorem-driven question answering dataset. ArXiv preprint, abs/2305.12524, 2023. URL https://arxiv.org/abs/2305.12524

work page arXiv 2023
[7]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. ArXiv preprint, abs/2210.11416, 2022. URL https://arxiv.org/abs/2210.11416

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. ArXiv preprint, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Advancing mathematics by guiding human intuition with ai

Alex Davies, Petar Veli c kovi \'c , Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad Toma s ev, Richard Tanburn, Peter Battaglia, Charles Blundell, Andr \'a s Juh \'a sz, et al. Advancing mathematics by guiding human intuition with ai. Nature, 600 0 (7887): 0 70--74, 2021. URL https://www.nature.com/articles/s41586-021-04086-x

work page 2021
[10]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. ArXiv preprint, abs/2305.14314, 2023. URL https://arxiv.org/abs/2305.14314

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

a rli, Ekin Aky \

Andrew Drozdov, Nathanael Sch \"a rli, Ekin Aky \"u rek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier Bousquet, and Denny Zhou. Compositional semantic parsing with large language models. International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=gJW8hSGBys8

work page 2023
[12]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pp.\ 10764--10799. PMLR, 2023. URL https://proceedings.mlr.press/v202/gao23f/gao23f.pdf

work page 2023
[13]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. ArXiv preprint, abs/2305.11738, 2023. URL https://arxiv.org/abs/2305.11738

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 , 2021 a . URL https://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021
[15]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021 b . URL https://datasets-benchmarks-proceedings.neurips.cc/paper...

work page 2021
[16]

Learning to solve arithmetic word problems with verb categorization

Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pp.\ 523--533, 2014. doi:10.3115/v1/D14-1058. URL https://aclanthology.org/D14-1058

work page doi:10.3115/v1/d14-1058 2014
[17]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. NeurIPS, 2022

work page 2022
[18]

Parsing algebraic word problems into equations

Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3: 0 585--597, 2015. doi:10.1162/tacl_a_00160. URL https://aclanthology.org/Q15-1042

work page doi:10.1162/tacl_a_00160 2015
[19]

MAWPS : A math word problem repository

Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS : A math word problem repository. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pp.\ 1152--1157, 2016. doi:10.18653/v1/N16-1136. URL https://aclanthology.org/N16-1136

work page doi:10.18653/v1/n16-1136 2016
[20]

Platypus: Quick, cheap, and powerful refinement of llms

Ariel N Lee, Cole J Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms. ArXiv preprint, abs/2308.07317, 2023. URL https://arxiv.org/abs/2308.07317

work page arXiv 2023
[21]

Solving quantitative reasoning problems with language models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35: 0 3843--3857, 2022. URL https://openreview.net/pdf?id=IFXTZERXdM7

work page 2022
[22]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large scale language model society. ArXiv preprint, abs/2303.17760, 2023 a . URL https://arxiv.org/abs/2303.17760

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Making language models better reasoners with step-aware verifier

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5315--5333, 2023 b . URL https://aclanthology.org/2023.acl-long.291.pdf

work page 2023
[24]

Program induction by rationale generation: Learning to solve and explain algebraic word problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 158--167, 2017. doi:10.18653/v1/P17-1015. URL https://aclanthology.org/P17-1015

work page doi:10.18653/v1/p17-1015 2017
[25]

The flan collection: Designing data and methods for effective instruction tuning

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. ICML, 2023. URL https://openreview.net/pdf?id=ZX4uS605XV

work page 2023
[26]

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. ArXiv preprint, abs/2308.09583, 2023. URL https://arxiv.org/abs/2308.09583

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Language models of code are few-shot commonsense learners

Aman Madaan, Shuyan Zhou, Uri Alon, Yiming Yang, and Graham Neubig. Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 1384--1403, 2022. URL https://aclanthology.org/2022.emnlp-main.90.pdf

work page 2022
[28]

LILA : A unified benchmark for mathematical reasoning

Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. LILA : A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 5807--5832, 2022 a . URL https://acl...

work page 2022
[29]

N um GLUE : A suite of fundamental yet challenging mathematical reasoning tasks

Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, and Ashwin Kalyan. N um GLUE : A suite of fundamental yet challenging mathematical reasoning tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3505--3523, 2022 b . doi:10.18653/v1/2022....

work page doi:10.18653/v1/2022.acl-long.246 2022
[30]

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4. ArXiv preprint, abs/2306.02707, 2023. URL https://arxiv.org/abs/2306.02707

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Codegen: An open large language model for code with multi-turn program synthesis

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/pdf?id=iaYcJKpY2B_

work page 2023
[32]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop, 2022. URL https://arxiv.org/abs/2112.00114

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. ArXiv preprint, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 2080--2094, 2021. doi:10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.na...

work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021
[35]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. ArXiv preprint, abs/2306.01116, 2023. URL https://arxiv.org/abs/2306.01116

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Instruction Tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. ArXiv preprint, abs/2304.03277, 2023. URL https://arxiv.org/abs/2304.03277

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.\ 1--16. IEEE, 2020. URL https://dl.acm.org/doi/10.5555/3433701.3433727

work page doi:10.5555/3433701.3433727 2020
[38]

Solving general arithmetic word problems

Subhro Roy and Dan Roth. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.\ 1743--1752, 2015. doi:10.18653/v1/D15-1202. URL https://aclanthology.org/D15-1202

work page doi:10.18653/v1/d15-1202 2015
[39]

Code Llama: Open Foundation Models for Code

Baptiste Rozi \`e re, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J \'e r \'e my Rapin, et al. Code llama: Open foundation models for code. ArXiv preprint, abs/2308.12950, 2023. URL https://arxiv.org/abs/2308.12950

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian - Jian Jiang, Han Wang, Matteo Manica,...

work page 2022
[41]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. ArXiv preprint, abs/2210.09261, 2022. URL https://arxiv.org/abs/2210.09261

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[43]

Galactica: A Large Language Model for Science

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science. ArXiv preprint, abs/2211.09085, 2022. URL https://arxiv.org/abs/2211.09085

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023 a . URL https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023 b . URL https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Iteratively prompt pre-trained language models for chain of thought

Boshi Wang, Xiang Deng, and Huan Sun. Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 2714--2730. Association for Computational Linguistics, 2022 a . URL https://aclanthology.org/2022.emnlp-main.174

work page 2022
[47]

Towards understanding chain-of-thought prompting: An empirical study of what matters

Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. Towards understanding chain-of-thought prompting: An empirical study of what matters. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2717--2739. Association for Computational Linguistics, 2023 a...

work page doi:10.18653/v1/2023.acl-long.153 2023
[48]

Can chatgpt defend the truth? automatic dialectical evaluation elicits llms' deficiencies in reasoning

Boshi Wang, Xiang Yue, and Huan Sun. Can chatgpt defend the truth? automatic dialectical evaluation elicits llms' deficiencies in reasoning. ArXiv preprint, abs/2305.13160, 2023 b . URL https://arxiv.org/abs/2305.13160

work page arXiv 2023
[49]

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. ArXiv preprint, abs/2305.04091, 2023 c . URL https://arxiv.org/abs/2305.04091

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Making large language models better reasoners with alignment

Peiyi Wang, Lei Li, Liang Chen, Feifan Song, Binghuai Lin, Yunbo Cao, Tianyu Liu, and Zhifang Sui. Making large language models better reasoners with alignment. ArXiv preprint, abs/2309.02144, 2023 d . URL https://arxiv.org/abs/2309.02144

work page arXiv 2023
[51]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. ArXiv preprint, abs/2307.10635, 2023 e . URL https://arxiv.org/abs/2307.10635

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations (ICLR), 2023 f . URL https://openreview.net/pdf?id=1PL1NIMMrw

work page 2023
[53]

Super- N atural I nstructions: Generalization via declarative instructions on 1600+ NLP tasks

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parma...

work page 2022
[54]

How far can camels go? exploring the state of instruction tuning on open resources

Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources. ArXiv preprint, abs/2306.04751, 2023 g . URL https://arxiv.org/abs/2306.04751

work page arXiv 2023
[55]

Self-instruct: Aligning language model with self generated instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023 h . URL https://aclanthology.org/2023.acl-long.754.pdf

work page 2023
[56]

Codet5+: Open code large language models for code understanding and generation

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. Codet5+: Open code large language models for code understanding and generation. ArXiv preprint, abs/2305.07922, 2023 i . URL https://arxiv.org/abs/2305.07922

work page arXiv 2023
[57]

Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , 2022 a . URL https://openreview.net/forum?id=gEZrGCozdqR

work page 2022
[58]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 0 24824--24837, 2022 b . URL https://openreview.net/pdf?id=_VjQlMeSB_J

work page 2022
[59]

Simple synthetic data reduces sycophancy in large language models

Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V Le. Simple synthetic data reduces sycophancy in large language models. ArXiv preprint, abs/2308.03958, 2023. URL https://arxiv.org/abs/2308.03958

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R \'e mi Louf, Morgan Funtowicz, et al. Huggingface's transformers: State-of-the-art natural language processing. ArXiv preprint, abs/1910.03771, 2019. URL https://arxiv.org/abs/1910.03771

work page internal anchor Pith review Pith/arXiv arXiv 1910
[61]

An explanation of in-context learning as implicit bayesian inference

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , 2022. URL https://openreview.net/forum?id=RdJVFCHjUMI

work page 2022
[62]

Decomposition enhances reasoning via self-evaluation guided decoding

Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, and Qizhe Xie. Decomposition enhances reasoning via self-evaluation guided decoding. ArXiv preprint, abs/2305.00633, 2023. URL https://arxiv.org/abs/2305.00633

work page arXiv 2023
[63]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. ArXiv preprint, abs/2304.12244, 2023. URL https://arxiv.org/abs/2304.12244

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Gpt can solve mathematical problems without a calculator

Zhen Yang, Ming Ding, Qingsong Lv, Zhihuan Jiang, Zehai He, Yuyi Guo, Jinfeng Bai, and Jie Tang. Gpt can solve mathematical problems without a calculator. ArXiv preprint, abs/2309.03241, 2023. URL https://arxiv.org/abs/2309.03241

work page arXiv 2023
[65]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/pdf?id=WE_vluYUL-X

work page 2023
[66]

C ross F it: A few-shot learning challenge for cross-task generalization in NLP

Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. C ross F it: A few-shot learning challenge for cross-task generalization in NLP . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 7163--7189, 2021. doi:10.18653/v1/2021.emnlp-main.572. URL https://aclanthology.org/2021.emnlp-main.572

work page doi:10.18653/v1/2021.emnlp-main.572 2021
[67]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. ArXiv preprint, abs/2309.12284, 2023. URL https://arxiv.org/abs/2309.12284

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. ArXiv preprint, abs/2308.01825, 2023. URL https://arxiv.org/abs/2308.01825

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. ArXiv preprint, abs/2205.01068, 2022. URL https://arxiv.org/abs/2205.01068

work page internal anchor Pith review Pith/arXiv arXiv 2022
[70]

Progressive-hint prompting improves reasoning in large language models

Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. Progressive-hint prompting improves reasoning in large language models. ArXiv preprint, abs/2304.09797, 2023 a . URL https://arxiv.org/abs/2304.09797

work page arXiv 2023
[71]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv preprint, abs/2306.05685, 2023 b . URL https://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. ArXiv preprint, abs/2304.06364, 2023. URL https://arxiv.org/abs/2304.06364

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification

Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. ArXiv preprint, abs/2308.07921, 2023 a . URL https://arxiv.org/abs/2308.07921

work page arXiv 2023
[74]

LIMA: Less Is More for Alignment

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. ArXiv preprint, abs/2305.11206, 2023 b . URL https://arxiv.org/abs/2305.11206

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

Least-to-most prompting enables complex reasoning in large language models

Denny Zhou, Nathanael Sch \"a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. International Conference on Learning Representations (ICLR), 2023 c . URL https://openreview.net/pdf?id=WZH7099tgfM

work page 2023