Recognition: 3 theorem links
· Lean TheoremReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Pith reviewed 2026-05-13 18:38 UTC · model grok-4.3
The pith
ReTool trains large language models to dynamically interleave code execution into reasoning chains using only task-outcome rewards.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReTool enables dynamic interleaving of real-time code execution within natural language reasoning processes and employs an automated RL paradigm that performs policy rollouts with multi-turn code execution, using task outcomes as rewards to let the model learn optimal tool invocation patterns without human priors on timing or method.
What carries the argument
The ReTool training pipeline of synthetic cold-start code-augmented reasoning traces followed by RL optimization on task-success rewards for multi-turn tool-use rollouts.
If this is right
- A 32B model reaches 67 percent accuracy on AIME after 400 training steps compared with 40 percent after 1080 steps for text-only RL.
- Extended training yields 72.5 percent accuracy, exceeding o1-preview by 27.9 percentage points.
- The model develops emergent behaviors such as code self-correction during training.
- Outcome-driven reinforcement learning produces efficient autonomous discovery of tool-invocation strategies for complex mathematical reasoning.
Where Pith is reading between the lines
- The same outcome-reward approach could be applied to other external tools such as symbolic solvers or data-analysis libraries.
- Autonomous tool mastery might lower the need for hand-crafted examples when teaching models structured problem-solving skills.
- Hybrid language-plus-tool systems may scale to domains like physics modeling or algorithmic optimization that mix symbolic and numerical steps.
Load-bearing premise
Task-outcome rewards alone are enough to teach the model the right times and ways to call code tools without any human guidance on tool-use patterns.
What would settle it
ReTool-trained models showing no accuracy advantage or no reduction in required training steps over text-only RL baselines when evaluated on AIME problems that require computation.
read the original abstract
While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ReTool, a two-stage framework for training LLMs on strategic tool use (specifically code interpreters) in long-form mathematical reasoning. It begins with synthetic cold-start data generation to produce code-augmented reasoning traces for supervised fine-tuning, followed by RL that uses only task-outcome rewards to enable dynamic interleaving of code execution and autonomous discovery of tool-invocation patterns. On the AIME benchmark, the 32B model reaches 67% accuracy after 400 RL steps (vs. 40% for a text-only RL baseline after 1080 steps) and 72.5% in extended settings, exceeding o1-preview by 27.9%; emergent behaviors such as code self-correction are reported.
Significance. If the empirical gains are reproducible and the 'no human priors' claim holds after full disclosure of the cold-start pipeline, the work would supply concrete evidence that outcome-only RL can produce adaptive neuro-symbolic reasoning strategies, with implications for hybrid systems that combine language models with external tools.
major comments (2)
- [Abstract] Abstract: The central claim that 'outcome feedback' alone enables 'autonomous discovery of optimal tool invocation patterns without human priors' is load-bearing for the novelty argument, yet the synthetic cold-start data generation step (which precedes RL) is described only at high level; without explicit details on prompting templates, few-shot examples, or code-interleaving heuristics used to create the traces, it is impossible to verify that the subsequent 400-step RL phase discovers rather than refines pre-embedded strategies.
- [Abstract] Abstract and Experiments section: The reported efficiency advantage (67% accuracy at 400 steps vs. 40% at 1080 steps for the text baseline) and the 72.5% extended-setting result are presented without rollout mechanics, reward-shaping details, variance across seeds, or exact baseline implementations; these omissions make it difficult to assess whether the performance gap is robust or sensitive to initialization from the synthetic traces.
minor comments (1)
- [Abstract] Abstract: The phrase 'signaling an ''aha moment''' contains inconsistent quotation marks that should be standardized for readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications for improved transparency and reproducibility.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'outcome feedback' alone enables 'autonomous discovery of optimal tool invocation patterns without human priors' is load-bearing for the novelty argument, yet the synthetic cold-start data generation step (which precedes RL) is described only at high level; without explicit details on prompting templates, few-shot examples, or code-interleaving heuristics used to create the traces, it is impossible to verify that the subsequent 400-step RL phase discovers rather than refines pre-embedded strategies.
Authors: We agree that the current high-level description of the synthetic cold-start data generation limits verification of the novelty claim. In the revised manuscript we will expand the Methods section to include the exact prompting templates, few-shot examples, and code-interleaving heuristics used to generate the initial traces. This will make explicit that the cold-start supplies basic tool-use exposure while the subsequent RL stage, using only task-outcome rewards and no additional human-designed signals, enables further autonomous refinement and discovery of invocation patterns. revision: yes
-
Referee: [Abstract] Abstract and Experiments section: The reported efficiency advantage (67% accuracy at 400 steps vs. 40% at 1080 steps for the text baseline) and the 72.5% extended-setting result are presented without rollout mechanics, reward-shaping details, variance across seeds, or exact baseline implementations; these omissions make it difficult to assess whether the performance gap is robust or sensitive to initialization from the synthetic traces.
Authors: We acknowledge these omissions hinder reproducibility assessment. The revised manuscript will add a dedicated subsection detailing the multi-turn rollout mechanics for real-time code execution, confirm that rewards are strictly outcome-based with no shaping, report performance variance across multiple random seeds, and provide exact implementation specifications for the text-only RL baseline. These additions will allow readers to evaluate the robustness of the efficiency and accuracy gains. revision: yes
Circularity Check
No significant circularity; empirical pipeline rests on external benchmarks and standard RL rewards.
full rationale
The paper describes an empirical training process: synthetic cold-start data generation for fine-tuning, followed by RL using only task-outcome rewards. Performance is measured on the external AIME benchmark with direct comparisons to baselines. No equations, derivations, or claims reduce by construction to fitted parameters, self-citations, or renamed inputs. The central assertions about autonomous tool-use discovery are supported by reported accuracies and observed behaviors rather than self-referential definitions or load-bearing self-citations.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
enabling autonomous discovery of optimal tool invocation patterns without human priors
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced contradicts?
contradictsCONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.
synthetic cold-start data generation to produce code-augmented long-form reasoning traces
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ReTool-32B attains 72.5% accuracy... outperforming OpenAI's o1-preview by 27.9%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 26 Pith papers
-
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.
-
RewardHarness: Self-Evolving Agentic Post-Training
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
-
Teaching Language Models to Think in Code
ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.
-
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
-
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
-
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...
-
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning
PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.
-
SOD: Step-wise On-policy Distillation for Small Language Model Agents
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
-
Teaching Language Models to Think in Code
ThinC trains smaller language models to reason entirely in code after minimal NL planning, outperforming tool-integrated baselines and even much larger models on competition math benchmarks.
-
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
-
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
-
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training
JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
-
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.
-
Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards
A constrained-synthesis RL method with graduated rewards for atomic validity and orchestration consistency improves LLM turn accuracy on multi-step tool benchmarks and transfers to new API sets.
-
Reinforced Collaboration in Multi-Agent Flow Networks
MANGO optimizes multi-agent LLM workflows via flow networks, RL, and textual gradients, delivering up to 12.8% higher performance and 47.4% better efficiency while generalizing to new domains.
-
E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning
E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs
SepSeq improves LLM accuracy on long numerical sequences by an average of 35.6% by inserting separator tokens that serve as attention sinks while cutting token usage by 16.4%.
-
A Survey of Context Engineering for Large Language Models
The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle...
-
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
-
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
Reference graph
Works this paper leans on
-
[1]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2023. URLhttps://arxiv.org/abs/2211.12588
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models.arXiv preprint arXiv:2503.04548, 2025
-
[3]
Claude. Claude 3.7 sonnet. 2025. URLhttps://www.anthropic.com/news/claude-3-7-sonnet
work page 2025
-
[4]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Pal: Program-aided language models,
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models, 2023. URLhttps://arxiv.org/abs/2211.10435
-
[6]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025. URL https://arxiv.org/abs/2503.09516
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024
Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, and Jiaya Jia. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024. URLhttps://arxiv.org/abs/2406.18629
-
[8]
Torl: Scaling tool-integrated rl, 2025 b
Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated rl, 2025. URLhttps://arxiv.org/abs/ 2503.23383
-
[9]
Reft: Reasoning with reinforced fine-tuning, 2024
Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning, 2024. URLhttps://arxiv.org/abs/2401.08967
-
[10]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URLhttps: //arxiv.org/abs/2501.19393
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Learning to reason with llms, September 2024
OpenAI. Learning to reason with llms, September 2024. URL https://openai.com/index/ learning-to-reason-with-llms/
work page 2024
-
[12]
OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, 11 Andrew Duberstein, Andrew Kon...
-
[13]
URL https://arxiv.org/abs/2412.16720
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning
Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3806–3824, Singapore, December
work page 2023
-
[16]
doi: 10.18653/v1/2023.findings-emnlp.248
Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.248. URL https: //aclanthology.org/2023.findings-emnlp.248/
-
[18]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[19]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URLhttps://arxiv.org/abs/2408.03314
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji- 12 Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2503.05592
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Sky-t1: Train your own o1 preview model within $450
NovaSky Team. Sky-t1: Train your own o1 preview model within $450. 2025. URLhttps://novasky-ai.github. io/posts/sky-t1
work page 2025
-
[24]
OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025
work page 2025
-
[25]
Qwq-32b: Embracing the power of reinforcement learning, March 2025
Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URLhttps://qwenlm. github.io/blog/qwq-32b/
work page 2025
-
[26]
Hongru Wang, Yujia Qin, Yankai Lin, Jeff Z. Pan, and Kam-Fai Wong. Empowering large language models: Tool learning for real-world interaction. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, page 2983–2986, New York, NY, USA,
-
[27]
Association for Computing Machinery. ISBN 9798400704314. doi: 10.1145/3626772.3661381. URL https://doi.org/10.1145/3626772.3661381
-
[28]
Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning,
Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning,
- [29]
-
[30]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associ...
work page 2022
-
[31]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https: //arxiv.org/abs/2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
xAI. Grok. https://x.ai/, 2023. URL https://x.ai/. Large language model
work page 2023
-
[33]
Lillicrap, Kenji Kawaguchi, and Michael Shieh
Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning, 2024. URLhttps://arxiv.org/ abs/2405.00451
-
[34]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru 13 Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024. URL https://arxiv.org/abs/2409.12122
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Griffiths, Yuan Cao, and Karthik Narasimhan
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: deliberate problem solving with large language models. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc
work page 2023
-
[37]
Mammoth: Building math generalist models through hybrid instruction tuning, 2023
Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning, 2023. URLhttps://arxiv.org/abs/2309. 05653
work page 2023
-
[38]
Map-neo: Highly capable and transparent bilingual large language model series, 2024
Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin, Raven Yuan, Tuney Zheng, Wei Pang, Xinrun Du, Yiming Liang, Yinghao Ma, Yizhi Li, Ziyang Ma, Bill Lin, Emmanouil Benetos, Huan Yang, Junting Zhou, Kaijing Ma, Minghao Liu, Morry Niu, Noah Wang, Quehry Que, Ruibo Liu, Sine Liu, Shawn...
-
[39]
Identify sections where code execution could speed up the reasoning process or make the calculation more accurate
-
[40]
Replace the manual calculation steps with code snippets and the corresponding interpreter's execution results
-
[41]
Keep the logical flow of the reasoning process intact, including any failed exploration attempts that were part of the ini tial process
-
[42]
The code snippets should be complete scripts, including necessary imports, and should not contain markdown symbols like <code> ```python code snippet ``` </code>
-
[43]
Outputs in the code snippets must explicitly call the print function
-
[44]
Execution results should match the model's output exactly, with no extra or missing tokens
-
[45]
If the Original Thinking Process does not include an <answer> section at the end, please add it in the Revised Thinking Pr ocess: <answer> \boxed{’The final answer goes here.’} </answer> Revised Thinking Process (With code interpreter’s support): Figure 8 Template Prompt for Data Curation. 16
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.