pith. machine review for the scientific record. sign in

arxiv: 2605.03862 · v3 · submitted 2026-05-05 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:12 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords reasoning plannersexecutor-grounded rewardsreinforcement learningreasoning reward modelsmath benchmarkscode generationtrace evaluationplanner-executor systems
0
0 comments X

The pith

Reasoning planners improve when trained on rewards that measure how much their traces actually help a frozen executor reach correct answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models trained only on final-answer correctness can learn reasoning traces that succeed for the wrong reasons or contain hidden flaws that hinder the model using them. The paper introduces a two-stage planner-executor setup in which the planner generates tagged reasoning traces, a frozen executor consumes them to produce the final output, and training uses a combined reward. This reward multiplies a score from a rubric-based reasoning quality model by the measured performance uplift the trace provides to the executor. A supporting dataset groups high-quality reference traces with localized flawed variants for the same problems. Experiments on code and math benchmarks show the resulting planners outperform those trained with execution-only signals.

Core claim

Reasoning traces should be supervised by their measured usefulness to a consumer model rather than final correctness alone; this is achieved in the TraceLift framework by training the planner with an executor-grounded reward formed by multiplying a rubric-based Reasoning Reward Model score by the uplift the trace delivers to a frozen executor, which produces more effective intermediate reasoning on math and code tasks.

What carries the argument

The executor-grounded reward, which multiplies a rubric-based Reasoning Reward Model score by the performance uplift the reasoning trace provides to a frozen executor.

If this is right

  • The two-stage planner-executor system achieves higher accuracy on code and math benchmarks than training with execution-only rewards.
  • Reasoning quality becomes directly learnable from groups of high-quality and perturbed flawed traces.
  • Planners are incentivized to generate traces that support the consumer model rather than merely appearing correct in isolation.
  • Intermediate reasoning artifacts receive supervision that penalizes shortcuts and flawed intermediate states.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same grounding approach could apply to other multi-step systems where one model produces intermediate artifacts consumed by another.
  • Performance may vary if the executor is updated or replaced after planner training, suggesting a need for executor-robust reward design.
  • The method highlights a general distinction between surface quality of reasoning and its functional support for downstream computation.

Load-bearing premise

The frozen executor supplies an unbiased and generalizable signal of how useful any given reasoning trace is.

What would settle it

Train the planner with the executor-grounded reward, then replace the original frozen executor with a different model or human solver on the same tasks and measure whether the reported gains in planner performance disappear.

read the original abstract

Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it. Our code is available at: https://github.com/MasaiahHan/TraceLift

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces the TraceLift framework for training reasoning planners in a two-stage planner-executor architecture. It defines an executor-grounded reward as the product of a rubric-based Reasoning Reward Model (RM) score and the measured performance uplift that the planner's reasoning trace provides to a frozen executor. To support direct supervision of reasoning quality, the authors release TRACELIFT-GROUPS, a dataset of same-problem groups containing one high-quality reference trace and multiple locally perturbed flawed traces for math and code problems. Experiments on code and math benchmarks are reported to show that this reward yields better end-to-end performance than execution-only training.

Significance. If the reported gains hold under scrutiny, the work usefully shifts emphasis from final-answer correctness to the downstream utility of intermediate reasoning traces. The introduction of TRACELIFT-GROUPS supplies a concrete resource for learning reasoning quality, and the public code release aids reproducibility. These elements could inform future process-supervision and multi-step reasoning research, provided the executor-grounded signal proves robust beyond the training executor.

major comments (2)
  1. [§3] §3 (Reward formulation): The executor-grounded reward multiplies the RM score by uplift measured on the identical frozen executor that later consumes the trace. Because the uplift signal is generated by the same model that will execute the planner's output, the planner can learn to emit traces that compensate for that executor's specific failure modes rather than producing generally useful reasoning. This coupling is load-bearing for the central claim that the reward improves reasoning fidelity; the manuscript should either demonstrate generalization to held-out executors or provide an ablation that decouples the uplift term from the training executor.
  2. [§4] §4 (Experimental results): The claim of improvement over execution-only training rests on benchmark gains, yet no ablation isolates the contribution of the uplift component versus the RM score alone, and no statistical significance, variance across seeds, or cross-executor transfer results are described. Without these, it is unclear whether the reported uplift is robust or merely reflects adaptation to the particular frozen executor used during training.
minor comments (3)
  1. [Abstract] The abstract introduces the terms 'TraceLift framework' and 'TRACELIFT-GROUPS' without a one-sentence gloss; a brief parenthetical definition on first use would improve readability for readers unfamiliar with the acronyms.
  2. [§3] The reward function is described in prose; adding an explicit equation (e.g., R = RM_score × uplift) with variable definitions would make the formulation precise and easier to reference.
  3. The GitHub link is provided; the repository should include the exact scripts, hyperparameters, and dataset construction code used for the reported experiments to support full reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The concerns about potential specialization to the training executor and the need for more rigorous ablations and statistical reporting are valid, and we will revise the manuscript to address them directly.

read point-by-point responses
  1. Referee: [§3] §3 (Reward formulation): The executor-grounded reward multiplies the RM score by uplift measured on the identical frozen executor that later consumes the trace. Because the uplift signal is generated by the same model that will execute the planner's output, the planner can learn to emit traces that compensate for that executor's specific failure modes rather than producing generally useful reasoning. This coupling is load-bearing for the central claim that the reward improves reasoning fidelity; the manuscript should either demonstrate generalization to held-out executors or provide an ablation that decouples the uplift term from the training executor.

    Authors: We agree that tying the uplift measurement to the training executor introduces a risk of specialization to its particular weaknesses. While this coupling is intentional to optimize the planner for the downstream executor in the two-stage system, we will add a new ablation that decouples the terms by computing uplift on a held-out executor during planner training and then measuring transfer performance on the original executor. We will also report cross-executor results to demonstrate whether the learned reasoning generalizes beyond the training executor. revision: yes

  2. Referee: [§4] §4 (Experimental results): The claim of improvement over execution-only training rests on benchmark gains, yet no ablation isolates the contribution of the uplift component versus the RM score alone, and no statistical significance, variance across seeds, or cross-executor transfer results are described. Without these, it is unclear whether the reported uplift is robust or merely reflects adaptation to the particular frozen executor used during training.

    Authors: We acknowledge that the current experiments lack these controls. In the revision we will add: (i) an explicit ablation comparing the full reward (RM score × uplift) against RM-only and uplift-only variants on the same benchmarks; (ii) statistical significance testing (e.g., paired t-tests) on the reported gains; (iii) results averaged over at least three random seeds with standard deviations; and (iv) cross-executor transfer experiments. These additions will isolate the contribution of each reward component and quantify robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the proposed framework or reward definition

full rationale

The paper introduces TraceLift as an empirical training framework: a frozen executor computes uplift (performance delta when consuming the planner's trace), which is multiplied by an independent rubric-based RM score to form the reward signal for RL on the planner. This reward is externally defined and applied to train a separate component; the subsequent experiments compare against execution-only baselines on held-out benchmarks. No equations or steps reduce the claimed improvement to a fitted parameter or self-referential definition by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text. The method remains self-contained with independent experimental validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The framework introduces new training components and a custom dataset without citing prior independent evidence for their effectiveness; no free parameters or axioms are explicitly stated in the abstract.

invented entities (3)
  • TraceLift framework no independent evidence
    purpose: Planner-executor training with grounded rewards
    Newly proposed method combining planner output with executor feedback.
  • TRACELIFT-GROUPS dataset no independent evidence
    purpose: Rubric-annotated groups of high-quality and flawed reasoning traces
    Introduced as a new resource built from math and code seed problems.
  • Reasoning Reward Model (RM) no independent evidence
    purpose: Scores reasoning quality for the grounded reward
    Component of the proposed reward signal.

pith-pipeline@v0.9.0 · 5594 in / 1251 out tokens · 37886 ms · 2026-05-08T18:12:21.174348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 17 canonical work pages · 8 internal anchors

  1. [1]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  2. [2]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  3. [3]

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023. URLhttps://openreview.net/forum?id=YfZ4ZPt8zd

  4. [4]

    Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei

    Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvancesin Neural Information Processing Systems, volume 30, 2017

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  7. [7]

    PAL: Program-aided Language Models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models.arXiv preprint arXiv:2211.10435, 2022

  8. [8]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aieleen Letman, Akhil Mathur, Alan Schelten, Amy Yang, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  9. [9]

    The instinctive bias: Spurious images lead to illusion in mllms

    Tianyang Han, Qing Lian, Rui Pan, Renjie Pi, Jipeng Zhang, Shizhe Diao, Yong Lin, and Tong Zhang. The instinctive bias: Spurious images lead to illusion in mllms. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16163–16177, 2024

  10. [10]

    Beyond words and pixels: A benchmark for implicit world knowledge reasoning in generative models.ArXiv, abs/2511.18271,

    Tianyang Han, Junhao Su, Junjie Hu, Peizhen Yang, Hengyu Shi, Junfeng Luo, and Jialin Gao. Beyond words and pixels: A benchmark for implicit world knowledge reasoning in generative models.ArXiv, abs/2511.18271,

  11. [11]

    URLhttps://api.semanticscholar.org/CorpusID:283243797

  12. [12]

    LiveCodeBench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum? id=chfJJYC3iL

  13. [13]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvancesin Neural Information Processing Systems, volume 35, 2022

  14. [14]

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. InAdvances in Neural Information Processing Systems, volume 35, 2022

  15. [15]

    and Sutherland Robson, Esme and Kohli, Pushmeet and de Freitas, Nando and Kavukcuoglu, Koray and Vinyals, Oriol , year=

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode.Science, 378 (6624):1092–1097, 2022. doi: 10.1126/science.abq1158

  16. [16]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

  17. [17]

    RLTF: Reinforcement learning from unit test feedback.Transactions on Machine Learning Research, 2023

    Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han, Yang Wei, and Deheng Ye. RLTF: Reinforcement learning from unit test feedback.Transactions on Machine Learning Research, 2023. URL https://openreview. net/forum?id=hjYmsV6nXZ

  18. [18]

    Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. InAdvances in Neural Information Processing Systems, volume 36, 2023

  19. [19]

    OpenAI o1 system card

    OpenAI. OpenAI o1 system card. Technical report, OpenAI, 2024. URL https://openai.com/index/ openai-o1-system-card/. 34

  20. [20]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advancesin Neural Information Processing Systems, volume 35, 2022

  21. [21]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/...

  22. [22]

    Strengthening multimodal large language model with bootstrapped preference optimization

    Renjie Pi, Tianyang Han, Wei Xiong, Jipeng Zhang, Runtao Liu, Rui Pan, and Tong Zhang. Strengthening multimodal large language model with bootstrapped preference optimization. In European Conference on Computer Vision, pages 382–398. Springer, 2024

  23. [23]

    Mllm-protector: Ensuring mllm’s safety without hurting performance

    Renjie Pi, Tianyang Han, Jianshu Zhang, Yueqi Xie, Rui Pan, Qing Lian, Hanze Dong, Jipeng Zhang, and Tong Zhang. Mllm-protector: Ensuring mllm’s safety without hurting performance. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16012–16027, 2024

  24. [24]

    Manning, Stefano Ermon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, pages 53728–53741, 2023

  25. [25]

    Rewarding progress: Scaling automated process verifiers for LLM reasoning

    Amrith Rajagopal Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for LLM reasoning. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/ forum?id=A6Y7AqlzLW

  26. [26]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  27. [27]

    Narasimhan, and Shunyu Yao

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 36, 2023

  28. [28]

    Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K. Reddy. Execution-based code generation using deep reinforcement learning.Transactionson Machine Learning Research, 2023. URLhttps://openreview.net/ forum?id=0XBuaxqEcG

  29. [29]

    Momentum auxiliary network for supervised local learning

    Junhao Su, Changpeng Cai, Feiyu Zhu, Chenghao He, Xiaojie Xu, Dongzhi Guan, and Chenyang Si. Momentum auxiliary network for supervised local learning. InEuropean Conference on Computer Vision, pages 276–292. Springer, 2024

  30. [30]

    Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

    Junhao Su, Yuanliang Wan, Junwei Yang, Hengyu Shi, Tianyang Han, Junfeng Luo, and Yurui Qiu. Failure makes the agent stronger: Enhancing accuracy through structured reflection for reliable tool interactions.arXiv preprint arXiv:2509.18847, 2025

  31. [31]

    Man++: Scaling momentum auxiliary network for supervised local learning in vision tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

    Junhao Su, Feiyu Zhu, Hengyu Shi, Tianyang Han, Yurui Qiu, Junfeng Luo, Xiaoming Wei, and Jialin Gao. Man++: Scaling momentum auxiliary network for supervised local learning in vision tasks.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  32. [32]

    Solving math word problems with process- and outcome-based feedback

    Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

  33. [33]

    Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 9426–9439, Bangkok, Thailand, 2024. Association for Computational Linguistics....

  34. [34]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023. 35

  35. [35]

    Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, 2022

  36. [36]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

  37. [37]

    Qwen3 Technical Report

    An Yang, Baosong Yang, Beichen Ge, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  38. [38]

    Griffiths, Yuan Cao, and Karthik Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, volume 36, 2023

  39. [39]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

  40. [40]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Bootstrapping reasoning with reasoning. In Advancesin Neural Information Processing Systems, volume 35, 2022

  41. [41]

    Fine-Tuning Language Models from Human Preferences

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019. 36