Recognition: 2 theorem links
· Lean TheoremMath-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Pith reviewed 2026-05-14 22:29 UTC · model grok-4.3
The pith
Math-Shepherd trains reward models on auto-generated step labels to verify and reinforce LLM math solutions without human annotations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Math-Shepherd is a process-oriented reward model trained with automatically constructed process-wise supervision data that labels individual reasoning steps as correct or incorrect. When applied to verification through reranking of LLM outputs or to step-by-step PPO reinforcement, it produces measurable accuracy improvements such as raising Mistral-7B from 77.9 percent to 84.1 percent on GSM8K and from 28.6 percent to 33.0 percent on MATH, with further gains to 89.1 percent and 43.5 percent when verification is added.
What carries the argument
Math-Shepherd, a process reward model that assigns a scalar score to each reasoning step using automatically generated supervision signals.
Load-bearing premise
Automatically constructed process-wise supervision data accurately labels correct versus incorrect reasoning steps without systematic bias or noise from the generation process itself.
What would settle it
A large-scale human annotation study on held-out solution steps that finds the automatic labels disagree with expert judgments on a substantial fraction of steps would falsify the central claim.
read the original abstract
In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) \textit{Reinforcement Learning}: Math-Shepherd is employed to reinforce LLMs with step-by-step Proximal Policy Optimization (PPO). With Math-Shepherd, a series of open-source LLMs demonstrates exceptional performance. For instance, the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9\%$\to$84.1\% on GSM8K and 28.6\%$\to$33.0\% on MATH). The accuracy can be further enhanced to 89.1\% and 43.5\% on GSM8K and MATH with the verification of Math-Shepherd, respectively. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Math-Shepherd, a process reward model for step-level supervision in mathematical reasoning. It is trained solely on automatically constructed process-wise labels derived from sampling multiple solution trajectories per problem and back-propagating final-answer correctness, without human annotations. The model is applied in two settings: verification via reranking of LLM outputs and reinforcement learning via step-by-step PPO. Reported results include accuracy gains for Mistral-7B from 77.9% to 84.1% on GSM8K and 28.6% to 33.0% on MATH via PPO, with further lifts to 89.1% and 43.5% when combined with verification.
Significance. If the automatic labeling procedure reliably captures step correctness, the work would be significant for scaling process supervision in LLMs by removing the annotation bottleneck. The concrete benchmark lifts on GSM8K and MATH, achieved with open-source models and reproducible PPO training, would demonstrate practical value for both verification and RL pipelines in mathematical reasoning.
major comments (2)
- [§3] §3 (process-wise supervision construction): the label assignment procedure samples trajectories and assigns step rewards solely from final-answer match to ground truth. This is vulnerable to systematic noise (correct early steps followed by later errors receive negative labels; incorrect steps compensated later receive positive labels). No quantitative validation of label accuracy against human step-level annotations is reported, which directly undermines the central claim that gains arise from genuine process supervision rather than improved outcome filtering.
- [Experimental results] Experimental setup and results sections: no analysis or controls are described for potential data leakage between the automatically generated training trajectories and the GSM8K/MATH test sets, nor for error rates in the auto-labeling pipeline. These omissions are load-bearing because the reported PPO improvements (e.g., Mistral-7B GSM8K lift) cannot be confidently attributed to step-level credit assignment without such checks.
minor comments (2)
- [Abstract] Abstract and §4: results are detailed only for Mistral-7B while the text refers to 'a series of open-source LLMs'; listing the full set of evaluated models and their individual gains would improve completeness.
- [§3] Notation in §3: the precise definition of step reward (binary vs. continuous) and how it is aggregated across sampled trajectories should be stated more explicitly to allow reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our work. We address each major comment below and commit to revisions that strengthen the presentation of our automatic labeling approach and experimental controls.
read point-by-point responses
-
Referee: [§3] §3 (process-wise supervision construction): the label assignment procedure samples trajectories and assigns step rewards solely from final-answer match to ground truth. This is vulnerable to systematic noise (correct early steps followed by later errors receive negative labels; incorrect steps compensated later receive positive labels). No quantitative validation of label accuracy against human step-level annotations is reported, which directly undermines the central claim that gains arise from genuine process supervision rather than improved outcome filtering.
Authors: We acknowledge that propagating final-answer correctness to individual steps can introduce label noise, as early correct steps in failing trajectories receive negative labels and later compensating errors in successful trajectories receive positive labels. This is an inherent trade-off of our fully automatic method that avoids human annotations. We maintain that the resulting process reward model still provides net-positive step-level signals on average, as shown by consistent gains in both the verification (reranking) and RL (PPO) settings. In revision we will add an explicit limitations subsection discussing this noise source and will include a small-scale human validation study on a random sample of 200 steps to report estimated label accuracy. revision: partial
-
Referee: [Experimental results] Experimental setup and results sections: no analysis or controls are described for potential data leakage between the automatically generated training trajectories and the GSM8K/MATH test sets, nor for error rates in the auto-labeling pipeline. These omissions are load-bearing because the reported PPO improvements (e.g., Mistral-7B GSM8K lift) cannot be confidently attributed to step-level credit assignment without such checks.
Authors: All training trajectories are generated exclusively from the official training splits of GSM8K and MATH; the test sets are held out entirely. We will add an explicit statement and table confirming this separation in the revised experimental setup. For auto-labeling error rates we will add an analysis that measures label consistency across multiple independent trajectory samples per problem and reports the fraction of steps whose label flips when a different successful or unsuccessful trajectory is chosen. These additions will allow readers to assess the reliability of the step-level credit assignment. revision: partial
Circularity Check
No significant circularity; empirical gains on external benchmarks
full rationale
The paper's central claims rest on measured accuracy improvements for Mistral-7B and other LLMs on the fixed external benchmarks GSM8K and MATH. The automatic construction of process-wise labels (via sampling trajectories and final-answer matching to ground truth) is an input to training the reward model; the subsequent PPO and verification steps are evaluated against those same independent benchmarks rather than against quantities defined from the fitted model itself. No equation or derivation reduces a claimed prediction to a fitted parameter by construction, and no load-bearing self-citation chain is required for the reported results. The method therefore remains self-contained against external falsifiability.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Automatically generated process-wise supervision data accurately distinguishes correct from incorrect reasoning steps
Forward citations
Cited by 22 Pith papers
-
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....
-
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference
CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.
-
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...
-
Process Supervision of Confidence Margin for Calibrated LLM Reasoning
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
-
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
-
SeLaR: Selective Latent Reasoning in Large Language Models
SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Process Reinforcement through Implicit Rewards
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
-
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
-
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.
-
ReMedi: Reasoner for Medical Clinical Prediction
ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.
-
Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness
Groupwise Ranking Reward reduces reasoning-answer inconsistency in multimodal models and raises reliability-conditioned accuracy from 47.4% to 54.7% over standard RLVR.
-
Placing Puzzle Pieces Where They Matter: A Question Augmentation Framework for Reinforcement Learning
PieceHint strategically scores and injects critical reasoning hints in RL training to let a 1.5B model match 32B baselines on math benchmarks while preserving pass@k diversity.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
- Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
Reference graph
Works this paper leans on
-
[1]
Red Teaming Language Models with Language Models
Perez, Ethan and Huang, Saffron and Song, Francis and Cai, Trevor and Ring, Roman and Aslanides, John and Glaese, Amelia and McAleese, Nat and Irving, Geoffrey. Red Teaming Language Models with Language Models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.225
-
[10]
International Conference on Machine Learning , pages=
Fast inference from transformers via speculative decoding , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[11]
Proceedings of the 29th Symposium on Operating Systems Principles , pages=
Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th Symposium on Operating Systems Principles , pages=
-
[13]
Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , title =. NeurIPS , year =
-
[14]
Xuezhi Wang and Jason Wei and Dale Schuurmans and Quoc V. Le and Ed H. Chi and Sharan Narang and Aakanksha Chowdhery and Denny Zhou , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =
work page 2023
-
[29]
arXiv preprint arXiv:2306.17492 , year=
Preference ranking optimization for human alignment , author=. arXiv preprint arXiv:2306.17492 , year=
-
[31]
GitHub repository , howpublished =
DeepSeek , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[32]
Mastering the game of Go with deep neural networks and tree search , author=. nature , volume=. 2016 , publisher=
work page 2016
-
[38]
Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=
Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=
-
[43]
Artificial Intelligence Review , volume=
Monte Carlo tree search: A review of recent modifications and applications , author=. Artificial Intelligence Review , volume=. 2023 , publisher=
work page 2023
-
[44]
Alphazero-like tree-search can guide large language model decoding and training , author=. arXiv preprint arXiv:2309.17179 , year=
-
[45]
European conference on machine learning , pages=
Bandit based monte-carlo planning , author=. European conference on machine learning , pages=. 2006 , organization=
work page 2006
-
[46]
International conference on computers and games , pages=
Efficient selectivity and backup operators in Monte-Carlo tree search , author=. International conference on computers and games , pages=. 2006 , organization=
work page 2006
-
[49]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Jiang, Jia Deng, Stella Biderman, and Sean Welleck
Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023
-
[53]
When do program-of-thoughts work for reasoning? arXiv preprint arXiv:2308.15452, 2023
Zhen Bi, Ningyu Zhang, Yinuo Jiang, Shumin Deng, Guozhou Zheng, and Huajun Chen. When do program-of-thoughts work for reasoning? arXiv preprint arXiv:2308.15452, 2023
-
[54]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S \'e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Tianyu Liu, and Baobao Chang. Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond. arXiv preprint arXiv:2310.02071, 2023
-
[56]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[57]
Efficient selectivity and backup operators in monte-carlo tree search
R \'e mi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pp.\ 72--83. Springer, 2006
work page 2006
-
[58]
Deepseek llm: Let there be answers
DeepSeek. Deepseek llm: Let there be answers. https://github.com/deepseek-ai/DeepSeek-LLM, 2023
work page 2023
-
[59]
Complexity-based prompting for multi-step reasoning
Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720, 2022
-
[60]
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen, et al. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452, 2023
-
[61]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[62]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[63]
Towards reasoning in large language models: A survey
Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 1049--1065, Toronto, Canada, July 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.findings-acl.67. URL htt...
-
[64]
Large Language Models Cannot Self-Correct Reasoning Yet
Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Challenges and applications of large language models
Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169, 2023
-
[67]
Bandit based monte-carlo planning
Levente Kocsis and Csaba Szepesv \'a ri. Bandit based monte-carlo planning. In European conference on machine learning, pp.\ 282--293. Springer, 2006
work page 2006
-
[68]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp.\ 611--626, 2023
work page 2023
-
[69]
Fast inference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023
work page 2023
-
[70]
M3it: A large-scale dataset towards multi- modal multilingual instruction tuning
Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, et al. M3it: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023 a
-
[71]
Making language models better reasoners with step-aware verifier
Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5315--5333, Toronto, C...
-
[72]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023
-
[74]
Let's reward step by step: Step-level reward model as the navigators for reasoning
Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, and Hongxia Yang. Let's reward step by step: Step-level reward model as the navigators for reasoning. arXiv preprint arXiv:2310.10080, 2023
-
[75]
OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi:10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
-
[76]
Sarah Pan, Vladislav Lialin, Sherin Muckatira, and Anna Rumshisky. Let's reinforce step by step. arXiv preprint arXiv:2311.05821, 2023
-
[77]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp.\ 1--22, 2023
work page 2023
-
[78]
Mastering the game of go with deep neural networks and tree search
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529 0 (7587): 0 484--489, 2016
work page 2016
-
[79]
Yifan Song, Weimin Xiong, Dawei Zhu, Cheng Li, Ke Wang, Ye Tian, and Sujian Li. Restgpt: Connecting large language models with real-world applications via restful apis. corr, abs/2306.06624, 2023. doi: 10.48550. arXiv preprint arXiv.2306.06624
-
[80]
Monte carlo tree search: A review of recent modifications and applications
Maciej \'S wiechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Ma \'n dziuk. Monte carlo tree search: A review of recent modifications and applications. Artificial Intelligence Review, 56 0 (3): 0 2497--2562, 2023
work page 2023
-
[81]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[82]
Solving math word problems with process- and outcome-based feedback
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[83]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[84]
Making large language models better reasoners with alignment
Peiyi Wang, Lei Li, Liang Chen, Feifan Song, Binghuai Lin, Yunbo Cao, Tianyu Liu, and Zhifang Sui. Making large language models better reasoners with alignment. arXiv preprint arXiv:2309.02144, 2023 b
-
[85]
Large language models are not fair evaluators
Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023 c
-
[86]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023 d . URL https://openreview.net/pdf...
work page 2023
-
[87]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
work page 2022
-
[88]
arXiv preprint arXiv:2306.01693 , year=
Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023
-
[89]
Lossless speedup of autoregressive translation with generalized aggressive decoding
Heming Xia, Tao Ge, Furu Wei, and Zhifang Sui. Lossless speedup of autoregressive translation with generalized aggressive decoding. arXiv preprint arXiv:2203.16487, 2022
-
[90]
Ovm, outcome-supervised value models for planning in mathematical reasoning
Fei Yu, Anningzhe Gao, and Benyou Wang. Outcome-supervised verifiers for planning in mathematical reasoning. arXiv preprint arXiv:2311.09724, 2023 a
-
[91]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[92]
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023
work page internal anchor Pith review arXiv 2023
-
[93]
arXiv preprint arXiv:2309.05653 , year=
Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023
-
[94]
Cumulative reasoning with large language models
Yifan Zhang, Jingqin Yang, Yang Yuan, and Andrew Chi-Chih Yao. Cumulative reasoning with large language models. arXiv preprint arXiv:2308.04371, 2023
work page internal anchor Pith review arXiv 2023
-
[95]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[96]
Solving math word problems via cooperative reasoning induced language models
Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. Solving math word problems via cooperative reasoning induced language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.