ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
Pith reviewed 2026-05-19 09:14 UTC · model grok-4.3
Add this Pith Number to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{2FFNLPMK}
Prints a linked pith:2FFNLPMK badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
ToRA agents combine language reasoning with external tool calls to solve complex math problems at new levels for open models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training on curated interactive tool-use trajectories and applying imitation learning with output shaping, ToRA models integrate natural language reasoning with external tools to outperform open-source baselines on ten mathematical reasoning datasets, delivering 13 to 19 percent absolute gains on average; ToRA-7B reaches 44.6 percent on the MATH dataset while ToRA-Code-34B exceeds 50 percent and surpasses GPT-4 chain-of-thought performance.
What carries the argument
The Tool-integrated Reasoning Agent that interleaves language reasoning steps with calls to external tools such as code interpreters and symbolic solvers.
If this is right
- Open models as small as 7B parameters can exceed the math performance of 70B models trained without tools.
- An open-source model can reach above 50 percent accuracy on the MATH benchmark for the first time.
- Hybrid reasoning that mixes text and tool calls becomes competitive with closed models using programs on the same tasks.
Where Pith is reading between the lines
- The same trajectory-curation plus imitation approach could be tested on non-math reasoning domains that also benefit from precise external verification.
- Future work might examine whether the learned tool-calling patterns remain effective when the underlying solver libraries are updated or replaced.
- Scaling the method to larger base models or richer tool sets might further close the gap with frontier closed systems.
Load-bearing premise
The interactive tool-use trajectories collected during data curation supply high-quality supervision that imitation learning can generalize to unseen math problems.
What would settle it
Training a ToRA-style model on the same trajectories and measuring no accuracy gain over a strong baseline on a fresh set of competition math problems not seen during trajectory collection.
read the original abstract
Large language models have made significant progress in various language tasks, yet they still struggle with complex mathematics. In this paper, we propose ToRA a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems by seamlessly integrating natural language reasoning with the utilization of external tools (e.g., computation libraries and symbolic solvers), thereby amalgamating the analytical prowess of language and the computational efficiency of tools. To train ToRA, we curate interactive tool-use trajectories on mathematical datasets, apply imitation learning on the annotations, and propose output space shaping to further refine models' reasoning behavior. As a result, ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales with 13%-19% absolute improvements on average. Notably, ToRA-7B reaches 44.6% on the competition-level dataset MATH, surpassing the best open-source model WizardMath-70B by 22% absolute. ToRA-Code-34B is also the first open-source model that achieves an accuracy exceeding 50% on MATH, which significantly outperforms GPT-4's CoT result, and is competitive with GPT-4 solving problems with programs. Additionally, we conduct a comprehensive analysis of the benefits and remaining challenges of tool interaction for mathematical reasoning, providing valuable insights for future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ToRA, a series of Tool-integrated Reasoning Agents that combine natural language reasoning with external tools such as computation libraries and symbolic solvers to solve complex mathematical problems. Training involves curating interactive tool-use trajectories on datasets like GSM8K and MATH, followed by imitation learning and output space shaping. The models demonstrate substantial gains over open-source baselines on 10 mathematical reasoning datasets, including ToRA-7B achieving 44.6% accuracy on MATH (surpassing WizardMath-70B by 22%) and ToRA-Code-34B exceeding 50% on MATH, competitive with GPT-4.
Significance. If the results hold and the improvements stem from robust tool integration rather than memorization of trajectory patterns, this would be a significant contribution to mathematical reasoning in LLMs by showing how tool use can be effectively integrated via imitation learning. The work highlights the potential for open-source models to approach or exceed proprietary model performance on competition-level math problems and provides analysis of tool interaction benefits and challenges.
major comments (2)
- [Section 3.2] Section 3.2: The trajectory collection process using GPT-4 prompting on training splits, followed by filtering and output-space shaping, is described at a high level. Without an ablation that removes the specific tool-call format while preserving reasoning content, it remains unclear whether the reported 13-22% gains on MATH reflect generalizable tool integration or exploitation of recurring syntactic patterns in the curated trajectories.
- [Experimental results] Experimental results: Headline performance numbers (e.g., 44.6% and >50% on MATH) are reported without error bars, multiple random seeds, or full details on training hyperparameters and data splits, which limits assessment of the robustness and reproducibility of the central performance claims across the ten datasets.
minor comments (2)
- [Abstract] Abstract: The statement that ToRA-Code-34B is 'the first open-source model' to exceed 50% on MATH should explicitly list the prior open-source models considered in the comparison.
- [Tables] Tables: Ensure all result tables include standard deviations or confidence intervals alongside accuracy metrics to support the cross-scale and cross-dataset claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps clarify the contributions of tool integration and strengthens the reporting of our results. We address each major comment below and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [Section 3.2] Section 3.2: The trajectory collection process using GPT-4 prompting on training splits, followed by filtering and output-space shaping, is described at a high level. Without an ablation that removes the specific tool-call format while preserving reasoning content, it remains unclear whether the reported 13-22% gains on MATH reflect generalizable tool integration or exploitation of recurring syntactic patterns in the curated trajectories.
Authors: We appreciate the referee's point on distinguishing tool integration from potential pattern memorization. Section 3.2 describes the GPT-4-based trajectory curation and output-space shaping, while Section 5 analyzes tool-use benefits through case studies and error breakdowns showing improved handling of computation and symbolic steps. To directly address the concern, we will add a new ablation in the revised manuscript: we will generate parallel trajectory sets that preserve reasoning content but modify or remove the specific tool-call syntax, then compare resulting model performance to isolate the contribution of the tool format. revision: yes
-
Referee: [Experimental results] Experimental results: Headline performance numbers (e.g., 44.6% and >50% on MATH) are reported without error bars, multiple random seeds, or full details on training hyperparameters and data splits, which limits assessment of the robustness and reproducibility of the central performance claims across the ten datasets.
Authors: We agree that expanded experimental details improve reproducibility. In the revision we will add full specifications of training hyperparameters, data splits, and implementation choices. For variance, we will report any repeated-run statistics we can obtain and discuss consistency of gains across scales and datasets. However, running multiple independent random seeds for every model size and all ten datasets is computationally prohibitive given the resources needed to train up to 34B models; we will explicitly note this limitation. revision: partial
Circularity Check
No circularity: empirical results on external benchmarks
full rationale
The paper presents an empirical method: curating tool-use trajectories via GPT-4 on training splits of public datasets, applying imitation learning, and reporting accuracy on held-out test sets of MATH, GSM8K and eight other standard benchmarks. All performance numbers (e.g., ToRA-7B at 44.6% on MATH) are direct measurements against independent external baselines such as WizardMath-70B and GPT-4. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the central claim is therefore a set of falsifiable experimental outcomes rather than a derivation that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- model scale choices (7B, 34B)
axioms (1)
- domain assumption Imitation learning on curated tool-use trajectories transfers to improved performance on unseen mathematical problems.
Forward citations
Cited by 18 Pith papers
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing
FaSTA* combines LLM fast planning with A* search and inductive subroutine mining to create an efficient agent for multi-turn image editing tasks.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.
-
Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents
TRACE is a reference-free multi-dimensional evaluation framework for tool-augmented LLM reasoning trajectories that uses an evidence bank and is validated on a new meta-evaluation dataset of flawed trajectories.
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
ToolRL: Reward is All Tool Learning Needs
A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
-
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
Step-DPO performs preference optimization on individual reasoning steps rather than complete answers, producing nearly 3% accuracy gains on MATH for 70B+ parameter models with 10K preference pairs.
-
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
-
LLMs with in-context learning for Algorithmic Theoretical Physics
Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
-
Rethinking Wireless Communications through Formal Mathematical AI Reasoning
Proposes a three-layer framework using formal AI reasoning for verification, derivation, and discovery in wireless communications theory.
-
Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation
AMR uses difficulty-aware routing and uncertainty-guided aggregation across three experts plus a neural verifier to reach 75.28% accuracy on GSM8K without synthetic training data.
-
Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub
Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
-
A Survey on the Memory Mechanism of Large Language Model based Agents
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
Reference graph
Works this paper leans on
-
[1]
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Improving language models by retrieving from trillions of tokens
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp.\ 2206--2240. PMLR, 2022
work page 2022
-
[3]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S \' e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco T \' u lio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4 . CoRR, abs/2303.12712, 2023. doi:10.48550/arXiv.2303.12712. U...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.12712 2023
-
[4]
Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.\ 535--541, 2006
work page 2006
-
[5]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Flash A ttention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. 2023
work page 2023
-
[8]
Computers and thought, volume 7
Edward A Feigenbaum, Julian Feldman, et al. Computers and thought, volume 7. New York McGraw-Hill, 1963
work page 1963
-
[9]
Specializing smaller language models towards multi-step reasoning
Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Pr...
work page 2023
-
[10]
PAL: Program-aided Language Models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Measuring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021
work page 2021
-
[13]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[14]
Large language models are reasoning teachers
Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 14852--14882, Toronto, Canada, July 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.acl-long.830. URL https://aclanthology.or...
-
[15]
Learning to solve arithmetic word problems with verb categorization
Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 523--533, 2014
work page 2014
-
[16]
Large Language Models Can Self-Improve
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. CoRR, abs/2210.11610, 2022. doi:10.48550/arXiv.2210.11610. URL https://doi.org/10.48550/arXiv.2210.11610
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.11610 2022
-
[17]
Backward reasoning in large language models for verification
Weisen Jiang, Han Shi, Longhui Yu, Zhengying Liu, Yu Zhang, Zhenguo Li, and James T Kwok. Backward reasoning in large language models for verification. arXiv preprint arXiv:2308.07758, 2023
-
[18]
MAWPS : A math word problem repository
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS : A math word problem repository. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pp.\ 1152--1157, San Diego, California, June 2016. Association for Computational L...
-
[19]
Platypus: Quick, cheap, and powerful refinement of llms
Ariel N Lee, Cole J Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317, 2023
-
[20]
Making language models better reasoners with step-aware verifier
Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5315--5333, 2023
work page 2023
-
[21]
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning
Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=DHyHRBwJUTN
work page 2023
-
[24]
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Augmented Language Models: a Survey
Gr \'e goire Mialon, Roberto Dess \` , Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozi \`e re, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
A diverse corpus for evaluating and developing E nglish math word problem solvers
Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing E nglish math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 975--984, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.92. URL https://aclantholog...
-
[27]
Lila: A unified benchmark for mathematical reasoning
Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. Lila: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
work page 2022
-
[28]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [29]
-
[30]
ART: Automatic multi-step reasoning and tool-use for large language models
Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Talm: Tool augmented language models
Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022
-
[32]
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 2080--2094, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2...
work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021
-
[33]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The R efined W eb dataset for F alcon LLM : outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Generative language modeling for automated theorem proving
Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020
-
[36]
Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning
Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.\ 1--14, 2021
work page 2021
-
[37]
Code Llama: Open Foundation Models for Code
Baptiste Rozi \`e re, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J \'e r \'e my Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Chaining simultaneous thoughts for numerical reasoning
Zhihong Shao, Fei Huang, and Minlie Huang. Chaining simultaneous thoughts for numerical reasoning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022 , pp.\ 2533--2547. Association for Computational Linguistics, 2022. doi:10....
-
[40]
Synthetic prompting: Generating chain-of-thought demonstrations for large language models
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023...
work page 2023
-
[41]
Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. CoRR, abs/2305.15294, 2023 b . doi:10.48550/arXiv.2305.15294. URL https://doi.org/10.48550/arXiv.2305.15294
- [42]
-
[43]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton - Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Har...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
-
[45]
Chi, Quoc V Le, and Denny Zhou
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=...
work page 2022
-
[46]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X
work page 2023
-
[47]
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Star: Bootstrapping reasoning with reasoning
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022
work page 2022
-
[50]
Evaluating and improving tool-augmented computation-intensive math reasoning
Beichen Zhang, Kun Zhou, Xilin Wei, Wayne Xin Zhao, Jing Sha, Shijin Wang, and Ji-Rong Wen. Evaluating and improving tool-augmented computation-intensive math reasoning. arXiv preprint arXiv:2306.02408, 2023
-
[51]
Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023 a
-
[52]
Denny Zhou, Nathanael Sch \" a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview....
work page 2023
-
[53]
Solving math word problems via cooperative reasoning induced language models
Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. Solving math word problems via cooperative reasoning induced language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 4471--4485, Toronto, Canada, July 2023. Association...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.