arxiv: 2312.08935 · v3 · submitted 2023-12-14 · 💻 cs.AI · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang , Lei Li , Zhihong Shao , R.X. Xu , Damai Dai , Yifei Li , Deli Chen , Y.Wu

show 1 more author

Zhifang Sui

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:29 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords process reward modelautomatic supervisionmath reasoningstep-by-step PPOLLM verificationGSM8KMATHreinforcement learning

0 comments

The pith

Math-Shepherd trains reward models on auto-generated step labels to verify and reinforce LLM math solutions without human annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Math-Shepherd, a process reward model that scores each individual step in a math solution path. It trains this model on data built automatically by comparing complete correct and incorrect solution traces, removing the need for human step labels. The trained model supports verification by reranking multiple LLM outputs according to cumulative step scores and reinforcement by providing per-step signals during PPO training. Experiments demonstrate accuracy gains on GSM8K and MATH for models including Mistral-7B.

Core claim

Math-Shepherd is a process-oriented reward model trained with automatically constructed process-wise supervision data that labels individual reasoning steps as correct or incorrect. When applied to verification through reranking of LLM outputs or to step-by-step PPO reinforcement, it produces measurable accuracy improvements such as raising Mistral-7B from 77.9 percent to 84.1 percent on GSM8K and from 28.6 percent to 33.0 percent on MATH, with further gains to 89.1 percent and 43.5 percent when verification is added.

What carries the argument

Math-Shepherd, a process reward model that assigns a scalar score to each reasoning step using automatically generated supervision signals.

Load-bearing premise

Automatically constructed process-wise supervision data accurately labels correct versus incorrect reasoning steps without systematic bias or noise from the generation process itself.

What would settle it

A large-scale human annotation study on held-out solution steps that finds the automatic labels disagree with expert judgments on a substantial fraction of steps would falsify the central claim.

read the original abstract

In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by Large Language Models (LLMs); 2) \textit{Reinforcement Learning}: Math-Shepherd is employed to reinforce LLMs with step-by-step Proximal Policy Optimization (PPO). With Math-Shepherd, a series of open-source LLMs demonstrates exceptional performance. For instance, the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9\%$\to$84.1\% on GSM8K and 28.6\%$\to$33.0\% on MATH). The accuracy can be further enhanced to 89.1\% and 43.5\% on GSM8K and MATH with the verification of Math-Shepherd, respectively. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Math-Shepherd shows how to auto-build step-level math rewards and gets clear accuracy gains on GSM8K and MATH, though the label quality is unverified.

read the letter

The main point is that this paper gives a workable method for generating step-by-step supervision data automatically for math reasoning, then uses it to train a reward model that improves LLM outputs via PPO and reranking. No human step annotations are needed, which directly tackles the scaling limit in process supervision work. They sample multiple trajectories per problem, label each step by whether the final answer matches ground truth, train Math-Shepherd on those labels, and apply it in two ways. The numbers are straightforward: Mistral-7B rises from 77.9% to 84.1% on GSM8K and 28.6% to 33.0% on MATH after step-by-step PPO, with verification pushing further to 89.1% and 43.5%. That is useful evidence that the pipeline produces something deployable on open models. The automatic construction itself is the clearest addition over prior outcome-only or fully manual process reward papers. The results are reported on fixed external benchmarks, which keeps the evaluation clean. The soft spot sits in the label generation step. Because credit is assigned from the final answer backward, a locally correct step can receive a negative label if the rest of the solution fails, and an early error can receive a positive label if later steps recover. The abstract gives no human validation of label accuracy, no error-rate estimates, and no ablation separating step credit from simple outcome filtering. Without those checks, it is hard to know how much of the reported lift comes from genuine process supervision versus better final-answer selection. This work is aimed at groups trying to scale RL-style training on reasoning tasks without annotation budgets. Readers who need concrete numbers on what auto-generated process rewards can deliver will find it worth their time. It deserves a serious referee because the core idea is practical and the benchmark improvements are large enough to justify detailed scrutiny of the data pipeline.

Referee Report

2 major / 2 minor

Summary. The paper introduces Math-Shepherd, a process reward model for step-level supervision in mathematical reasoning. It is trained solely on automatically constructed process-wise labels derived from sampling multiple solution trajectories per problem and back-propagating final-answer correctness, without human annotations. The model is applied in two settings: verification via reranking of LLM outputs and reinforcement learning via step-by-step PPO. Reported results include accuracy gains for Mistral-7B from 77.9% to 84.1% on GSM8K and 28.6% to 33.0% on MATH via PPO, with further lifts to 89.1% and 43.5% when combined with verification.

Significance. If the automatic labeling procedure reliably captures step correctness, the work would be significant for scaling process supervision in LLMs by removing the annotation bottleneck. The concrete benchmark lifts on GSM8K and MATH, achieved with open-source models and reproducible PPO training, would demonstrate practical value for both verification and RL pipelines in mathematical reasoning.

major comments (2)

[§3] §3 (process-wise supervision construction): the label assignment procedure samples trajectories and assigns step rewards solely from final-answer match to ground truth. This is vulnerable to systematic noise (correct early steps followed by later errors receive negative labels; incorrect steps compensated later receive positive labels). No quantitative validation of label accuracy against human step-level annotations is reported, which directly undermines the central claim that gains arise from genuine process supervision rather than improved outcome filtering.
[Experimental results] Experimental setup and results sections: no analysis or controls are described for potential data leakage between the automatically generated training trajectories and the GSM8K/MATH test sets, nor for error rates in the auto-labeling pipeline. These omissions are load-bearing because the reported PPO improvements (e.g., Mistral-7B GSM8K lift) cannot be confidently attributed to step-level credit assignment without such checks.

minor comments (2)

[Abstract] Abstract and §4: results are detailed only for Mistral-7B while the text refers to 'a series of open-source LLMs'; listing the full set of evaluated models and their individual gains would improve completeness.
[§3] Notation in §3: the precise definition of step reward (binary vs. continuous) and how it is aggregated across sampled trajectories should be stated more explicitly to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major comment below and commit to revisions that strengthen the presentation of our automatic labeling approach and experimental controls.

read point-by-point responses

Referee: [§3] §3 (process-wise supervision construction): the label assignment procedure samples trajectories and assigns step rewards solely from final-answer match to ground truth. This is vulnerable to systematic noise (correct early steps followed by later errors receive negative labels; incorrect steps compensated later receive positive labels). No quantitative validation of label accuracy against human step-level annotations is reported, which directly undermines the central claim that gains arise from genuine process supervision rather than improved outcome filtering.

Authors: We acknowledge that propagating final-answer correctness to individual steps can introduce label noise, as early correct steps in failing trajectories receive negative labels and later compensating errors in successful trajectories receive positive labels. This is an inherent trade-off of our fully automatic method that avoids human annotations. We maintain that the resulting process reward model still provides net-positive step-level signals on average, as shown by consistent gains in both the verification (reranking) and RL (PPO) settings. In revision we will add an explicit limitations subsection discussing this noise source and will include a small-scale human validation study on a random sample of 200 steps to report estimated label accuracy. revision: partial
Referee: [Experimental results] Experimental setup and results sections: no analysis or controls are described for potential data leakage between the automatically generated training trajectories and the GSM8K/MATH test sets, nor for error rates in the auto-labeling pipeline. These omissions are load-bearing because the reported PPO improvements (e.g., Mistral-7B GSM8K lift) cannot be confidently attributed to step-level credit assignment without such checks.

Authors: All training trajectories are generated exclusively from the official training splits of GSM8K and MATH; the test sets are held out entirely. We will add an explicit statement and table confirming this separation in the revised experimental setup. For auto-labeling error rates we will add an analysis that measures label consistency across multiple independent trajectory samples per problem and reports the fraction of steps whose label flips when a different successful or unsuccessful trajectory is chosen. These additions will allow readers to assess the reliability of the step-level credit assignment. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical gains on external benchmarks

full rationale

The paper's central claims rest on measured accuracy improvements for Mistral-7B and other LLMs on the fixed external benchmarks GSM8K and MATH. The automatic construction of process-wise labels (via sampling trajectories and final-answer matching to ground truth) is an input to training the reward model; the subsequent PPO and verification steps are evaluated against those same independent benchmarks rather than against quantities defined from the fitted model itself. No equation or derivation reduces a claimed prediction to a fitted parameter by construction, and no load-bearing self-citation chain is required for the reported results. The method therefore remains self-contained against external falsifiability.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that automatically generated step labels are sufficiently accurate to train a useful reward model; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Automatically generated process-wise supervision data accurately distinguishes correct from incorrect reasoning steps
The entire training pipeline rests on this premise to replace human annotations.

pith-pipeline@v0.9.0 · 5570 in / 1150 out tokens · 41096 ms · 2026-05-14T22:29:38.954816+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
cs.CL 2026-04 unverdicted novelty 8.0

MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
cs.SE 2026-05 unverdicted novelty 7.0

POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models
cs.CL 2024-10 conditional novelty 7.0

Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference
cs.CV 2026-05 unverdicted novelty 6.0

CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
cs.AI 2026-05 unverdicted novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...
Process Supervision of Confidence Margin for Calibrated LLM Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
SeLaR: Selective Latent Reasoning in Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Process Reinforcement through Implicit Rewards
cs.LG 2025-02 conditional novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
cs.LG 2024-07 unverdicted novelty 6.0

Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
cs.CL 2024-06 conditional novelty 6.0

OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.
ReMedi: Reasoner for Medical Clinical Prediction
cs.CL 2026-05 unverdicted novelty 5.0

ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.
Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness
cs.CL 2026-04 unverdicted novelty 5.0

Groupwise Ranking Reward reduces reasoning-answer inconsistency in multimodal models and raises reliability-conditioned accuracy from 47.4% to 54.7% over standard RLVR.
Placing Puzzle Pieces Where They Matter: A Question Augmentation Framework for Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

PieceHint strategically scores and injects critical reasoning hints in RL training to let a 1.5B model match 32B baselines on math benchmarks while preserving pass@k diversity.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
cs.CL 2025-08

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 22 Pith papers · 17 internal anchors

[1]

Red Teaming Language Models with Language Models

Perez, Ethan and Huang, Saffron and Song, Francis and Cai, Trevor and Ring, Roman and Aslanides, John and Glaese, Amelia and McAleese, Nat and Irving, Geoffrey. Red Teaming Language Models with Language Models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.225

work page doi:10.18653/v1/2022.emnlp-main.225 2022
[10]

International Conference on Machine Learning , pages=

Fast inference from transformers via speculative decoding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[11]

Proceedings of the 29th Symposium on Operating Systems Principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th Symposium on Operating Systems Principles , pages=

work page
[13]

Chi and Quoc V

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , title =. NeurIPS , year =

work page
[14]

Le and Ed H

Xuezhi Wang and Jason Wei and Dale Schuurmans and Quoc V. Le and Ed H. Chi and Sharan Narang and Aakanksha Chowdhery and Denny Zhou , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

work page 2023
[29]

arXiv preprint arXiv:2306.17492 , year=

Preference ranking optimization for human alignment , author=. arXiv preprint arXiv:2306.17492 , year=

work page arXiv
[31]

GitHub repository , howpublished =

DeepSeek , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[32]

nature , volume=

Mastering the game of Go with deep neural networks and tree search , author=. nature , volume=. 2016 , publisher=

work page 2016
[38]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

work page
[43]

Artificial Intelligence Review , volume=

Monte Carlo tree search: A review of recent modifications and applications , author=. Artificial Intelligence Review , volume=. 2023 , publisher=

work page 2023
[44]

Alphazero-like tree-search can guide large lan- guage model decoding and training.arXiv preprint arXiv:2309.17179, 2023

Alphazero-like tree-search can guide large language model decoding and training , author=. arXiv preprint arXiv:2309.17179 , year=

work page arXiv
[45]

European conference on machine learning , pages=

Bandit based monte-carlo planning , author=. European conference on machine learning , pages=. 2006 , organization=

work page 2006
[46]

International conference on computers and games , pages=

Efficient selectivity and backup operators in Monte-Carlo tree search , author=. International conference on computers and games , pages=. 2006 , organization=

work page 2006
[49]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Jiang, Jia Deng, Stella Biderman, and Sean Welleck

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023

work page arXiv 2023
[53]

When do program-of-thoughts work for reasoning? arXiv preprint arXiv:2308.15452, 2023

Zhen Bi, Ningyu Zhang, Yinuo Jiang, Shumin Deng, Guozhou Zheng, and Huajun Chen. When do program-of-thoughts work for reasoning? arXiv preprint arXiv:2308.15452, 2023

work page arXiv 2023
[54]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S \'e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond

Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Tianyu Liu, and Baobao Chang. Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond. arXiv preprint arXiv:2310.02071, 2023

work page arXiv 2023
[56]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[57]

Efficient selectivity and backup operators in monte-carlo tree search

R \'e mi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pp.\ 72--83. Springer, 2006

work page 2006
[58]

Deepseek llm: Let there be answers

DeepSeek. Deepseek llm: Let there be answers. https://github.com/deepseek-ai/DeepSeek-LLM, 2023

work page 2023
[59]

Complexity-based prompting for multi-step reasoning

Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720, 2022

work page arXiv 2022
[60]

Tora: A tool-integrated reasoning agent for mathematical problem solving.arXiv preprint arXiv:2309.17452,

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen, et al. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452, 2023

work page arXiv 2023
[61]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[62]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[63]

Towards reasoning in large language models: A survey

Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 1049--1065, Toronto, Canada, July 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.findings-acl.67. URL htt...

work page doi:10.18653/v1/2023.findings-acl.67 2023
[64]

Large Language Models Cannot Self-Correct Reasoning Yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Challenges and applications of large language models

Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169, 2023

work page arXiv 2023
[67]

Bandit based monte-carlo planning

Levente Kocsis and Csaba Szepesv \'a ri. Bandit based monte-carlo planning. In European conference on machine learning, pp.\ 282--293. Springer, 2006

work page 2006
[68]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp.\ 611--626, 2023

work page 2023
[69]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023

work page 2023
[70]

M3it: A large-scale dataset towards multi- modal multilingual instruction tuning

Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, et al. M3it: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023 a

work page arXiv 2023
[71]

Making language models better reasoners with step-aware verifier

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5315--5333, Toronto, C...

work page doi:10.18653/v1/2023.acl-long.291 2023
[72]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023

work page arXiv 2023
[74]

Let's reward step by step: Step-level reward model as the navigators for reasoning

Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, and Hongxia Yang. Let's reward step by step: Step-level reward model as the navigators for reasoning. arXiv preprint arXiv:2310.10080, 2023

work page arXiv 2023
[75]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi:10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[76]

Let's reinforce step by step

Sarah Pan, Vladislav Lialin, Sherin Muckatira, and Anna Rumshisky. Let's reinforce step by step. arXiv preprint arXiv:2311.05821, 2023

work page arXiv 2023
[77]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp.\ 1--22, 2023

work page 2023
[78]

Mastering the game of go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529 0 (7587): 0 484--489, 2016

work page 2016
[79]

Restgpt: Connecting large language models with real-world restful apis.arXiv preprint arXiv:2306.06624, 2023

Yifan Song, Weimin Xiong, Dawei Zhu, Cheng Li, Ke Wang, Ye Tian, and Sujian Li. Restgpt: Connecting large language models with real-world applications via restful apis. corr, abs/2306.06624, 2023. doi: 10.48550. arXiv preprint arXiv.2306.06624

work page arXiv 2023
[80]

Monte carlo tree search: A review of recent modifications and applications

Maciej \'S wiechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Ma \'n dziuk. Monte carlo tree search: A review of recent modifications and applications. Artificial Intelligence Review, 56 0 (3): 0 2497--2562, 2023

work page 2023
[81]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[82]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[83]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[84]

Making large language models better reasoners with alignment

Peiyi Wang, Lei Li, Liang Chen, Feifan Song, Binghuai Lin, Yunbo Cao, Tianyu Liu, and Zhifang Sui. Making large language models better reasoners with alignment. arXiv preprint arXiv:2309.02144, 2023 b

work page arXiv 2023
[85]

Large language models are not fair evaluators

Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023 c

work page arXiv 2023
[86]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023 d . URL https://openreview.net/pdf...

work page 2023
[87]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

work page 2022
[88]

arXiv preprint arXiv:2306.01693 , year=

Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023

work page arXiv 2023
[89]

Lossless speedup of autoregressive translation with generalized aggressive decoding

Heming Xia, Tao Ge, Furu Wei, and Zhifang Sui. Lossless speedup of autoregressive translation with generalized aggressive decoding. arXiv preprint arXiv:2203.16487, 2022

work page arXiv 2022
[90]

Ovm, outcome-supervised value models for planning in mathematical reasoning

Fei Yu, Anningzhe Gao, and Benyou Wang. Outcome-supervised verifiers for planning in mathematical reasoning. arXiv preprint arXiv:2311.09724, 2023 a

work page arXiv 2023
[91]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[92]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023

work page internal anchor Pith review arXiv 2023
[93]

arXiv preprint arXiv:2309.05653 , year=

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023

work page arXiv 2023
[94]

Cumulative reasoning with large language models

Yifan Zhang, Jingqin Yang, Yang Yuan, and Andrew Chi-Chih Yao. Cumulative reasoning with large language models. arXiv preprint arXiv:2308.04371, 2023

work page internal anchor Pith review arXiv 2023
[95]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[96]

Solving math word problems via cooperative reasoning induced language models

Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. Solving math word problems via cooperative reasoning induced language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page doi:10.18653/v1/2023.acl-long.245 2023