Solving math word problems with process- and outcome-based feedback
Pith reviewed 2026-05-24 11:07 UTC · model grok-4.3
The pith
Outcome supervision matches final-answer accuracy on math word problems with less labeling but needs process supervision to cut reasoning errors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Pure outcome-based supervision produces similar final-answer error rates with less label supervision. For correct reasoning steps, process-based supervision or supervision from learned reward models that emulate process-based feedback is necessary. This improves previous best results from 16.8 percent to 12.7 percent final-answer error and from 14.0 percent to 3.4 percent reasoning error among final-answer-correct solutions.
What carries the argument
The distinction between outcome-based supervision (final answer only) and process-based supervision (each reasoning step) on GSM8K, including reward models trained to emulate process labels.
If this is right
- Outcome supervision can deliver comparable final accuracy at lower labeling cost.
- Process supervision or its reward-model proxy is required to minimize reasoning mistakes even when the answer is correct.
- Learned reward models can serve as a practical substitute for full process annotations.
- The resulting error rates set a new benchmark on GSM8K for both final answers and reasoning quality.
Where Pith is reading between the lines
- The same supervision distinction may matter for other step-by-step reasoning domains such as code generation.
- Reward models could let process supervision scale without a matching increase in human step-by-step labels.
- In tutoring applications, reduced reasoning error would lower the chance of correct answers reached by flawed logic.
Load-bearing premise
The human process annotations accurately identify correct reasoning and the learned reward models do not introduce new error modes when used for supervision.
What would settle it
A replication experiment in which models trained with process supervision or reward models show the same reasoning-error rate as pure outcome models, or in which the reward models increase errors beyond direct process labels.
read the original abstract
Recent work has shown that asking language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise such models: outcome-based approaches which supervise the final result, or process-based approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in final-answer errors but also in reasoning errors, which can be difficult to detect and are problematic in many real-world domains such as education. We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task, GSM8K. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use process-based supervision or supervision from learned reward models that emulate process-based feedback. In total, we improve the previous best results from 16.8% $\to$ 12.7% final-answer error and 14.0% $\to$ 3.4% reasoning error among final-answer-correct solutions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts the first comprehensive empirical comparison of process-based versus outcome-based supervision for training language models to solve math word problems on the GSM8K dataset. It claims that pure outcome supervision achieves comparable final-answer accuracy with reduced labeling cost, but that process supervision (or reward models trained to emulate it) is required to achieve low rates of reasoning errors even among correct final answers, yielding new state-of-the-art figures of 12.7% final-answer error and 3.4% reasoning error.
Significance. If the results are robust, the work supplies concrete evidence that process-level feedback is necessary for high-quality reasoning chains in arithmetic tasks and that learned reward models can serve as scalable proxies for human process annotations. The reported gains over prior best results (16.8% → 12.7% final error; 14.0% → 3.4% reasoning error) would be practically relevant for domains such as education where both answer correctness and reasoning transparency matter.
major comments (2)
- [Abstract / Results] Abstract and reported results: headline metrics are given without error bars, data-split details, training hyperparameters, or statistical significance tests. Because the central claim rests on the differential performance between outcome and process regimes, the absence of these elements prevents assessment of whether the observed improvements are reliable.
- [Process annotation / Reward model evaluation] Process annotation and reward-model sections: the claim that process supervision (or its reward-model emulation) is required to drive reasoning error from 14.0% to 3.4% depends on the assumption that the human process labels are a faithful proxy for correctness and that the learned reward model generalizes beyond the annotation distribution. Only a small-pilot inter-annotator agreement figure is mentioned, and no out-of-distribution probe (e.g., adversarial or model-generated incorrect chains) is described.
minor comments (2)
- Define precisely how 'reasoning error among final-answer-correct solutions' is operationalized and measured, including any inter-annotator protocol used at scale.
- Clarify the exact amount of label supervision used in each regime so that the statement 'with less label supervision' can be quantified.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our work comparing process- and outcome-based supervision for math word problems. We address each of the major comments below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and reported results: headline metrics are given without error bars, data-split details, training hyperparameters, or statistical significance tests. Because the central claim rests on the differential performance between outcome and process regimes, the absence of these elements prevents assessment of whether the observed improvements are reliable.
Authors: We agree with the referee that including error bars, data-split details, training hyperparameters, and statistical significance tests is important for assessing the reliability of our results, particularly given the central comparison between supervision regimes. In the revised version of the paper, we will add error bars from multiple training runs with different random seeds, provide explicit details on the data splits used for training and evaluation, include a comprehensive list of training hyperparameters, and report statistical significance tests (such as bootstrap confidence intervals or t-tests) for the key performance differences. These additions will directly address the concern about the robustness of the observed improvements. revision: yes
-
Referee: [Process annotation / Reward model evaluation] Process annotation and reward-model sections: the claim that process supervision (or its reward-model emulation) is required to drive reasoning error from 14.0% to 3.4% depends on the assumption that the human process labels are a faithful proxy for correctness and that the learned reward model generalizes beyond the annotation distribution. Only a small-pilot inter-annotator agreement figure is mentioned, and no out-of-distribution probe (e.g., adversarial or model-generated incorrect chains) is described.
Authors: We acknowledge that the validity of our claims hinges on the quality and generalizability of the human process annotations. The manuscript does report a small-pilot inter-annotator agreement figure, which we will expand with additional details on the annotation guidelines and agreement statistics in the revised version. To address the generalization concern for the reward model, we will include new experiments with out-of-distribution probes, such as evaluating the reward model on adversarial examples and model-generated incorrect reasoning chains. This will provide stronger evidence that the reward model serves as a reliable proxy for process-based feedback beyond the original annotation distribution. revision: yes
Circularity Check
Empirical comparison of supervision regimes shows no circular derivation
full rationale
The paper reports experimental results from training and evaluating language models on GSM8K under outcome-based versus process-based supervision (and reward-model variants). All headline metrics—final-answer error rates and reasoning-step error rates—are obtained by direct measurement against held-out ground truth, with no equations, fitted parameters, or self-citations that reduce the reported improvements to the experimental inputs by construction. The central claims rest on observable performance differences rather than any self-definitional or load-bearing self-referential step.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption GSM8K constitutes a representative natural-language reasoning task for comparing supervision methods
- domain assumption Human process annotations accurately reflect correct reasoning steps
Forward citations
Cited by 60 Pith papers
-
The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies
Corruption studies of CoT faithfulness largely measure explicit answer placement in prompt format rather than computational importance of reasoning steps.
-
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....
-
EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation
EDGE-OPD adds guided rollouts and evidence masking to on-policy self-distillation, enabling successful learning of target identities where standard OPSD and RLSD fail.
-
Argus: Evidence Assembly for Scalable Deep Research Agents
Argus coordinates a Navigator and multiple Searchers via an evidence graph to assemble complete, source-traced answers, yielding benchmark gains up to 12.7 points with 8 parallel agents and 86.2 on BrowseComp with 64 agents.
-
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
-
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
-
The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies
Corruption studies on CoT chains detect the position of explicit answer statements rather than computational steps, as evidenced by format ablations collapsing suffix sensitivity 19x and models following conflicting a...
-
Unsupervised Process Reward Models
Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
-
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
Conditional optimal transport calibrates PRMs by learning monotonic conditional quantile functions over success probabilities conditioned on hidden states, yielding improved calibration and downstream Best-of-N perfor...
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
Logic-Regularized Verifier Elicits Reasoning from LLMs
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
-
Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning
InfoTree casts intermediate state selection in tree search as monotone submodular maximization under fixed rollout budgets, yielding closed-form UUCB terms and lifting mixed-outcome ratios while outperforming flat GRP...
-
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured uplift on a frozen executor, outperforming execution-only training on math and code b...
-
Generalized Distributional Alignment Games for Unbiased Answer-Level Fine-Tuning
Generalized Bregman alignment games plus U-statistics and optimal minimax polynomial estimators remove Jensen bias and achieve optimal statistical rates for unbiased answer-level fine-tuning.
-
Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation
Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.
-
AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking
AgentEval evaluates agentic workflows via DAGs with step metrics, a 21-category failure taxonomy, and error propagation tracking, yielding 2.17x higher failure recall than end-to-end methods and strong human agreement.
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
Navigating the Conceptual Multiverse
The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choic...
-
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
-
AI Achieves a Perfect LSAT Score
Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.
-
Structural Evaluation Metrics for SVG Generation via Leave-One-Out Analysis
Element-level leave-one-out analysis yields per-element quality scores and four structural metrics (purity, coverage, compactness, locality) that quantify SVG modularity and enable artifact detection.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo
Sequential Monte Carlo sampling from a reward-augmented sequence distribution improves LLM performance on HumanEval by up to 54.9% and MATH500 by up to 8.8%, outperforming standard sampling and GRPO.
-
WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking
WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
-
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
-
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
-
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Seg-Zero uses cognitive reinforcement learning on a decoupled reasoning-plus-segmentation architecture to produce explicit reasoning chains and reach 57.5 zero-shot accuracy on ReasonSeg, beating prior supervised LISA...
-
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
-
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
-
Let's Verify Step by Step
Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
-
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
-
Manifold-Guided Attention Steering
MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.
-
Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents
ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficien...
-
Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution
SCA framework applies Information Bottleneck to assign step-level confidence in black-box LLM reasoning traces, flagging errors and boosting self-correction success by up to 13.5% on math and QA tasks.
-
SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation
SAPO computes per-reasoning-step group-relative advantages in RL to improve credit assignment for structured generation of semantic identifiers in recommendation systems.
-
Argus: Evidence Assembly for Scalable Deep Research Agents
Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmar...
-
Process Rewards with Learned Reliability
BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.
-
LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design
LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published ...
-
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
-
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.
-
Distribution Corrected Offline Data Distillation for Large Language Models
A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
-
STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning
STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.
-
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
-
Verifiable Process Rewards for Agentic Reasoning
Verifiable Process Rewards (VPR) converts symbolic oracles into dense turn-level supervision for reinforcement learning in agentic reasoning, outperforming outcome-only rewards and transferring to general benchmarks.
-
Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.
-
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
Conditional optimal transport is used to turn raw PRM outputs into monotonic quantile functions that improve calibration and downstream Best-of-N performance on MATH-500 and AIME.
-
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
RLearner-LLM's Hybrid-DPO fuses DeBERTa NLI and LLM verifier scores to deliver up to 6x higher NLI entailment than standard SFT while preserving answer coverage across academic domains.
-
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.
-
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
TraceLift trains reasoning planners using rewards that credit traces for both rubric quality and actual performance gains on a frozen executor, outperforming final-answer-only training on math and code tasks.
-
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
DGPO reinterprets distribution deviation as a guiding signal in a critic-free policy optimization framework to enable fine-grained credit assignment for LLM chain-of-thought reasoning.
-
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
DGPO is a critic-free RL framework that uses bounded Hellinger distance and entropy-gated advantage redistribution to enable fine-grained token-level credit assignment in long CoT generations for LLM alignment, report...
-
Controllable and Verifiable Process Data Synthesis for Process Reward Models
A controllable synthesis method creates prefix-invalid yet trajectory-consistent process supervision data for training and evaluating process reward models by injecting verifiable errors into symbolic reasoning chains.
-
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
-
Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations
An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning...
-
TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering
TPS-CalcBench is a new benchmark and evaluation framework that tests LLMs on analytical calculations in hypersonic aerodynamics and gas dynamics, using dual-track scoring and interventions to detect physically invalid...
-
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non...
-
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
-
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density c...
-
PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection
PRISM-MCTS improves MCTS-based reasoning efficiency by maintaining a shared memory of heuristics and fallacies reinforced by a process reward model, halving required trajectories on GPQA while outperforming prior methods.
-
Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment
Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.
Reference graph
Works this paper leans on
-
[1]
Maximum a Posteriori Policy Optimisation
A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a posteriori policy optimisation.arXiv preprint arXiv:1806.06920,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
A. Amini, S. Gabriel, P. Lin, R. Koncel-Kedziorski, Y. Choi, and H. Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[3]
Concrete Problems in AI Safety
D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565,
work page internal anchor Pith review Pith/arXiv arXiv
- [4]
-
[5]
J. Cai, R. Shin, and D. Song. Making neural programming architectures generalize via recursion. arXiv preprint arXiv:1704.06611,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
URLhttps://arxiv.org/abs/2107.03374. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Supervising strong learners by amplifying weak experts
P. Christiano, B. Shlegeris, and D. Amodei. Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V. Kosaraju, M. Bavarian, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
16 Solving math word problems with process- and outcome-based feedback A
URLhttps://www.alignmentforum.org/posts/pRkFkzwKZ2zfa3R6H/ without-specific-countermeasures-the-easiest-path-to . 16 Solving math word problems with process- and outcome-based feedback A. Creswell, M. Shanahan, and I. Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning.arXiv preprint arXiv:2205.09712,
- [11]
-
[12]
M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser. Universal transformers.arXiv preprint arXiv:1807.03819,
work page internal anchor Pith review Pith/arXiv arXiv
- [13]
-
[14]
Reinforcement Learning with a Corrupted Reward Channel
T. Everitt, V. Krakovna, L. Orseau, M. Hutter, and S. Legg. Reinforcement learning with a corrupted reward channel.arXiv preprint arXiv:1705.08417,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Weakly-supervised Semantic Parsing with Abstract Examples
O. Goldman, V. Latcinnik, U. Naveh, A. Globerson, and J. Berant. Weakly-supervised semantic parsing with abstract examples.arXiv preprint arXiv:1711.05240,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Adaptive Computation Time for Recurrent Neural Networks
A.Graves. Adaptivecomputationtimeforrecurrentneuralnetworks. arXivpreprintarXiv:1603.08983 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
A. Graves, G. Wayne, and I. Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Measuring Mathematical Problem Solving With the MATH Dataset
D.Hendrycks,C.Burns,S.Kadavath,A.Arora,S.Basart,E.Tang,D.Song,andJ.Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Training Compute-Optimal Large Language Models
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. v. d. Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre. Training compute- optimal large language models.arXiv preprint arXiv:2203.15556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
G. Irving, P. Christiano, and D. Amodei. Ai safety via debate.arXiv preprint arXiv:1805.00899,
work page internal anchor Pith review Pith/arXiv arXiv
- [21]
-
[22]
Large Language Models are Zero-Shot Reasoners
17 Solving math word problems with process- and outcome-based feedback T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916,
work page internal anchor Pith review Pith/arXiv arXiv
- [23]
-
[24]
URLhttps://arxiv.org/abs/2206.14858. C. Li, D. Tarlow, A. L. Gaunt, M. Brockschmidt, and N. Kushman. Neural program lattices
work page internal anchor Pith review Pith/arXiv arXiv
- [25]
-
[26]
W. Ling, D. Yogatama, C. Dyer, and P. Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems.arXiv preprint arXiv:1705.04146,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Teaching language models to support answers with verified quotes
J.Menick,M.Trebacz,V.Mikulik,J.Aslanides,F.Song,M.Chadwick,M.Glaese,S.Young,L.Campbell- Gillingham, G. Irving, et al. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
WebGPT: Browser-assisted question-answering with human feedback
R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saun- ders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
URL https://arxiv.org/abs/2203.02155. E. Perez, P. Lewis, W.-t. Yih, K. Cho, and D. Kiela. Unsupervised question decomposition for question answering. arXiv preprint arXiv:2002.09758,
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[32]
URL https://arxiv.org/abs/2009.03393. M. Rauh, J. Mellor, J. Uesato, P.-S. Huang, J. Welbl, L. Weidinger, S. Dathathri, A. Glaese, G. Irving, I. Gabriel, et al. Characteristics of harmful text: Towards rigorous benchmarking of language models. arXiv preprint arXiv:2206.08325,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[33]
Neural Programmer-Interpreters
S. Reed and N. De Freitas. Neural programmer-interpreters.arXiv preprint arXiv:1511.06279,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
V. Shwartz, P. West, R. L. Bras, C. Bhagavatula, and Y. Choi. Unsupervised commonsense question answering with self-talk.arXiv preprint arXiv:2004.05483,
-
[35]
URLhttps://ought.org/ updates/2022-04-06-process. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks.Advances in neural information processing systems, 27,
work page 2022
-
[36]
O. Tafjord, B. D. Mishra, and P. Clark. Proofwriter: Generating implications, proofs, and abductive statements over natural language.arXiv preprint arXiv:2012.13048,
- [37]
-
[38]
X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
J. Wu, L. Ouyang, D. M. Ziegler, N. Stiennon, R. Lowe, J. Leike, and P. Christiano. Recursively summarizing books with human feedback.arXiv preprint arXiv:2109.10862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
19 Solving math word problems with process- and outcome-based feedback Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600,
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
E.Zelikman, Y. Wu, andN.D.Goodman. Star: Bootstrappingreasoning withreasoning.arXivpreprint arXiv:2203.14465,
-
[42]
D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[43]
20 Solving math word problems with process- and outcome-based feedback A. Example GSM8K problems and solutions We include several examples to provide a qualitative sense of the task and learned model behavior. Table 4 contains 10 randomly sampled problems, and the output of the SFT+ORM-RL model with ORM reranking. Table 5 contains 5 trace errors, where th...
work page 2021
-
[44]
Table 6j An example question and answer from the Pre-algebra split of the MATH dataset (Hendrycks et al., 2021). Model Final-answer error Few-shot+Final-Answer RL 65.7 SFT 67.6 SFT, ORM reranking 65.4 SFT, PRM reranking 67.7 SFT+Final-Answer RL 63.3 SFT+ORM-RL 63.2 Table 7j Final-answer error on MATH Pre-algebra.Unless RM reranking is specified, all number...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.