arxiv: 2504.11536 · v2 · submitted 2025-04-15 · 💻 cs.CL · cs.AI

Recognition: 3 theorem links

· Lean Theorem

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng , Shijue Huang , Xingwei Qu , Ge Zhang , Yujia Qin , Baoquan Zhong , Chengquan Jiang , Jinxin Chi , Wanjun Zhong

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords reinforcement learningtool uselarge language modelsmathematical reasoningcode interpreterAIME benchmarkhybrid reasoning

0 comments

The pith

ReTool trains large language models to dynamically interleave code execution into reasoning chains using only task-outcome rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ReTool as a way to overcome the limits of text-only reasoning models on problems that benefit from structured computation. It begins with synthetic data to create initial traces that mix natural language steps with code calls, then applies reinforcement learning where the model receives reward solely from whether the final answer is correct. This setup lets the model discover on its own when and how to invoke a code interpreter during multi-turn rollouts. A reader would care because the approach promises better performance on hard math benchmarks with substantially fewer training steps than pure language reinforcement learning.

Core claim

ReTool enables dynamic interleaving of real-time code execution within natural language reasoning processes and employs an automated RL paradigm that performs policy rollouts with multi-turn code execution, using task outcomes as rewards to let the model learn optimal tool invocation patterns without human priors on timing or method.

What carries the argument

The ReTool training pipeline of synthetic cold-start code-augmented reasoning traces followed by RL optimization on task-success rewards for multi-turn tool-use rollouts.

If this is right

A 32B model reaches 67 percent accuracy on AIME after 400 training steps compared with 40 percent after 1080 steps for text-only RL.
Extended training yields 72.5 percent accuracy, exceeding o1-preview by 27.9 percentage points.
The model develops emergent behaviors such as code self-correction during training.
Outcome-driven reinforcement learning produces efficient autonomous discovery of tool-invocation strategies for complex mathematical reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same outcome-reward approach could be applied to other external tools such as symbolic solvers or data-analysis libraries.
Autonomous tool mastery might lower the need for hand-crafted examples when teaching models structured problem-solving skills.
Hybrid language-plus-tool systems may scale to domains like physics modeling or algorithmic optimization that mix symbolic and numerical steps.

Load-bearing premise

Task-outcome rewards alone are enough to teach the model the right times and ways to call code tools without any human guidance on tool-use patterns.

What would settle it

ReTool-trained models showing no accuracy advantage or no reduction in required training steps over text-only RL baselines when evaluated on AIME problems that require computation.

read the original abstract

While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReTool shows faster convergence on AIME by interleaving real-time code in RL rollouts, but the synthetic cold-start traces probably embed the tool patterns the paper claims emerge from outcome rewards alone.

read the letter

ReTool trains a 32B model to interleave code execution inside multi-turn reasoning and uses only task outcome rewards for the RL stage. It reports 67% accuracy on AIME after 400 steps versus 40% for a text-only RL baseline that took 1080 steps, with an extended run reaching 72.5% and beating o1-preview. The setup starts with synthetic code-augmented traces for fine-tuning, then runs RL to refine tool invocation.

Referee Report

2 major / 1 minor

Summary. The paper introduces ReTool, a two-stage framework for training LLMs on strategic tool use (specifically code interpreters) in long-form mathematical reasoning. It begins with synthetic cold-start data generation to produce code-augmented reasoning traces for supervised fine-tuning, followed by RL that uses only task-outcome rewards to enable dynamic interleaving of code execution and autonomous discovery of tool-invocation patterns. On the AIME benchmark, the 32B model reaches 67% accuracy after 400 RL steps (vs. 40% for a text-only RL baseline after 1080 steps) and 72.5% in extended settings, exceeding o1-preview by 27.9%; emergent behaviors such as code self-correction are reported.

Significance. If the empirical gains are reproducible and the 'no human priors' claim holds after full disclosure of the cold-start pipeline, the work would supply concrete evidence that outcome-only RL can produce adaptive neuro-symbolic reasoning strategies, with implications for hybrid systems that combine language models with external tools.

major comments (2)

[Abstract] Abstract: The central claim that 'outcome feedback' alone enables 'autonomous discovery of optimal tool invocation patterns without human priors' is load-bearing for the novelty argument, yet the synthetic cold-start data generation step (which precedes RL) is described only at high level; without explicit details on prompting templates, few-shot examples, or code-interleaving heuristics used to create the traces, it is impossible to verify that the subsequent 400-step RL phase discovers rather than refines pre-embedded strategies.
[Abstract] Abstract and Experiments section: The reported efficiency advantage (67% accuracy at 400 steps vs. 40% at 1080 steps for the text baseline) and the 72.5% extended-setting result are presented without rollout mechanics, reward-shaping details, variance across seeds, or exact baseline implementations; these omissions make it difficult to assess whether the performance gap is robust or sensitive to initialization from the synthetic traces.

minor comments (1)

[Abstract] Abstract: The phrase 'signaling an ''aha moment''' contains inconsistent quotation marks that should be standardized for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications for improved transparency and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'outcome feedback' alone enables 'autonomous discovery of optimal tool invocation patterns without human priors' is load-bearing for the novelty argument, yet the synthetic cold-start data generation step (which precedes RL) is described only at high level; without explicit details on prompting templates, few-shot examples, or code-interleaving heuristics used to create the traces, it is impossible to verify that the subsequent 400-step RL phase discovers rather than refines pre-embedded strategies.

Authors: We agree that the current high-level description of the synthetic cold-start data generation limits verification of the novelty claim. In the revised manuscript we will expand the Methods section to include the exact prompting templates, few-shot examples, and code-interleaving heuristics used to generate the initial traces. This will make explicit that the cold-start supplies basic tool-use exposure while the subsequent RL stage, using only task-outcome rewards and no additional human-designed signals, enables further autonomous refinement and discovery of invocation patterns. revision: yes
Referee: [Abstract] Abstract and Experiments section: The reported efficiency advantage (67% accuracy at 400 steps vs. 40% at 1080 steps for the text baseline) and the 72.5% extended-setting result are presented without rollout mechanics, reward-shaping details, variance across seeds, or exact baseline implementations; these omissions make it difficult to assess whether the performance gap is robust or sensitive to initialization from the synthetic traces.

Authors: We acknowledge these omissions hinder reproducibility assessment. The revised manuscript will add a dedicated subsection detailing the multi-turn rollout mechanics for real-time code execution, confirm that rewards are strictly outcome-based with no shaping, report performance variance across multiple random seeds, and provide exact implementation specifications for the text-only RL baseline. These additions will allow readers to evaluate the robustness of the efficiency and accuracy gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline rests on external benchmarks and standard RL rewards.

full rationale

The paper describes an empirical training process: synthetic cold-start data generation for fine-tuning, followed by RL using only task-outcome rewards. Performance is measured on the external AIME benchmark with direct comparisons to baselines. No equations, derivations, or claims reduce by construction to fitted parameters, self-citations, or renamed inputs. The central assertions about autonomous tool-use discovery are supported by reported accuracies and observed behaviors rather than self-referential definitions or load-bearing self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond standard RL assumptions and synthetic data generation.

pith-pipeline@v0.9.0 · 5636 in / 1087 out tokens · 55454 ms · 2026-05-13T18:38:15.451027+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

enabling autonomous discovery of optimal tool invocation patterns without human priors
IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced contradicts

?

contradicts
CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

synthetic cold-start data generation to produce code-augmented long-form reasoning traces
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ReTool-32B attains 72.5% accuracy... outperforming OpenAI's o1-preview by 27.9%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
cs.LG 2026-05 unverdicted novelty 7.0

GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.
RewardHarness: Self-Evolving Agentic Post-Training
cs.AI 2026-05 unverdicted novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
Teaching Language Models to Think in Code
cs.CL 2026-05 unverdicted novelty 7.0

ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
cs.CL 2026-04 unverdicted novelty 7.0

DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
cs.LG 2026-04 unverdicted novelty 7.0

RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
cs.AI 2026-05 unverdicted novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
Teaching Language Models to Think in Code
cs.CL 2026-05 unverdicted novelty 6.0

ThinC trains smaller language models to reason entirely in code after minimal NL planning, outperforming tool-integrated baselines and even much larger models on competition math benchmarks.
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
cs.CV 2026-04 unverdicted novelty 6.0

ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training
cs.LG 2026-04 unverdicted novelty 6.0

JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
cs.LG 2026-04 unverdicted novelty 6.0

AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.
Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards
cs.LG 2026-03 unverdicted novelty 6.0

A constrained-synthesis RL method with graduated rewards for atomic validity and orchestration consistency improves LLM turn accuracy on multi-step tool benchmarks and transfers to new API sets.
Reinforced Collaboration in Multi-Agent Flow Networks
cs.LG 2026-05 unverdicted novelty 5.0

MANGO optimizes multi-agent LLM workflows via flow networks, RL, and textual gradients, delivering up to 12.8% higher performance and 47.4% better efficiency while generalizing to new domains.
E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs
cs.CL 2026-04 unverdicted novelty 4.0

SepSeq improves LLM accuracy on long numerical sequences by an average of 35.6% by inserting separator tokens that serve as attention sinks while cutting token usage by 16.4%.
A Survey of Context Engineering for Large Language Models
cs.CL 2025-07 accept novelty 4.0

The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle...
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 24 Pith papers · 14 internal anchors

[1]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2023. URLhttps://arxiv.org/abs/2211.12588

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

An empirical study on eliciting and improving r1-like reasoning models.arXiv preprint arXiv:2503.04548, 2025

Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models.arXiv preprint arXiv:2503.04548, 2025

work page arXiv 2025
[3]

Claude 3.7 sonnet

Claude. Claude 3.7 sonnet. 2025. URLhttps://www.anthropic.com/news/claude-3-7-sonnet

work page 2025
[4]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Pal: Program-aided language models,

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models, 2023. URLhttps://arxiv.org/abs/2211.10435

work page arXiv 2023
[6]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025. URL https://arxiv.org/abs/2503.09516

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024

Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, and Jiaya Jia. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024. URLhttps://arxiv.org/abs/2406.18629

work page arXiv 2024
[8]

Torl: Scaling tool-integrated rl, 2025 b

Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated rl, 2025. URLhttps://arxiv.org/abs/ 2503.23383

work page arXiv 2025
[9]

Reft: Reasoning with reinforced fine-tuning, 2024

Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning, 2024. URLhttps://arxiv.org/abs/2401.08967

work page arXiv 2024
[10]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URLhttps: //arxiv.org/abs/2501.19393

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Learning to reason with llms, September 2024

OpenAI. Learning to reason with llms, September 2024. URL https://openai.com/index/ learning-to-reason-with-llms/

work page 2024
[12]

OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, 11 Andrew Duberstein, Andrew Kon...

work page
[13]

URL https://arxiv.org/abs/2412.16720

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning

Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3806–3824, Singapore, December

work page 2023
[16]

doi: 10.18653/v1/2023.findings-emnlp.248

Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.248. URL https: //aclanthology.org/2023.findings-emnlp.248/

work page doi:10.18653/v1/2023.findings-emnlp.248 2023
[18]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URLhttps://arxiv.org/abs/2408.03314

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji- 12 Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2503.05592

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Sky-t1: Train your own o1 preview model within $450

NovaSky Team. Sky-t1: Train your own o1 preview model within $450. 2025. URLhttps://novasky-ai.github. io/posts/sky-t1

work page 2025
[24]

Open Thoughts

OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025

work page 2025
[25]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URLhttps://qwenlm. github.io/blog/qwq-32b/

work page 2025
[26]

Pan, and Kam-Fai Wong

Hongru Wang, Yujia Qin, Yankai Lin, Jeff Z. Pan, and Kam-Fai Wong. Empowering large language models: Tool learning for real-world interaction. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, page 2983–2986, New York, NY, USA,

work page
[27]

and Wong, Kam-Fai , title =

Association for Computing Machinery. ISBN 9798400704314. doi: 10.1145/3626772.3661381. URL https://doi.org/10.1145/3626772.3661381

work page doi:10.1145/3626772.3661381
[28]

Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning,

Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning,

work page
[29]

URL https://arxiv.org/abs/2310.03731

work page arXiv
[30]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associ...

work page 2022
[31]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https: //arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

xAI. Grok. https://x.ai/, 2023. URL https://x.ai/. Large language model

work page 2023
[33]

Lillicrap, Kenji Kawaguchi, and Michael Shieh

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning, 2024. URLhttps://arxiv.org/ abs/2405.00451

work page arXiv 2024
[34]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru 13 Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024. URL https://arxiv.org/abs/2409.12122

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Griffiths, Yuan Cao, and Karthik Narasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: deliberate problem solving with large language models. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc

work page 2023
[37]

Mammoth: Building math generalist models through hybrid instruction tuning, 2023

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning, 2023. URLhttps://arxiv.org/abs/2309. 05653

work page 2023
[38]

Map-neo: Highly capable and transparent bilingual large language model series, 2024

Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin, Raven Yuan, Tuney Zheng, Wei Pang, Xinrun Du, Yiming Liang, Yinghao Ma, Yizhi Li, Ziyang Ma, Bill Lin, Emmanouil Benetos, Huan Yang, Junting Zhou, Kaijing Ma, Minghao Liu, Morry Niu, Noah Wang, Quehry Que, Ruibo Liu, Sine Liu, Shawn...

work page arXiv 2024
[39]

Identify sections where code execution could speed up the reasoning process or make the calculation more accurate

work page
[40]

Replace the manual calculation steps with code snippets and the corresponding interpreter's execution results

work page
[41]

Keep the logical flow of the reasoning process intact, including any failed exploration attempts that were part of the ini tial process

work page
[42]

The code snippets should be complete scripts, including necessary imports, and should not contain markdown symbols like <code> ```python code snippet ``` </code>

work page
[43]

Outputs in the code snippets must explicitly call the print function

work page
[44]

Execution results should match the model's output exactly, with no extra or missing tokens

work page
[45]

If the Original Thinking Process does not include an <answer> section at the end, please add it in the Revised Thinking Pr ocess: <answer> \boxed{’The final answer goes here.’} </answer> Revised Thinking Process (With code interpreter’s support): Figure 8 Template Prompt for Data Curation. 16

work page