Solving math word problems with process- and outcome-based feedback

Antonia Creswell; Francis Song; Geoffrey Irving; Irina Higgins; Jonathan Uesato; Lisa Wang; Nate Kushman; Noah Siegel; Ramana Kumar

arxiv: 2211.14275 · v1 · submitted 2022-11-25 · 💻 cs.LG · cs.AI· cs.CL

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato , Nate Kushman , Ramana Kumar , Francis Song , Noah Siegel , Lisa Wang , Antonia Creswell , Geoffrey Irving

show 1 more author

Irina Higgins

This is my paper

Pith reviewed 2026-05-24 11:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords math word problemsprocess supervisionoutcome supervisionlanguage modelsGSM8Kreasoning errorsreward modelsfinal answer accuracy

0 comments

The pith

Outcome supervision matches final-answer accuracy on math word problems with less labeling but needs process supervision to cut reasoning errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares training language models to solve math word problems by supervising only the final answer versus supervising each reasoning step. It finds that outcome-only supervision reaches similar final-answer accuracy while requiring fewer labels. However, to ensure the reasoning steps themselves are correct, either direct process supervision or reward models trained to mimic it becomes necessary. On the GSM8K benchmark this combination reduces final-answer error to 12.7 percent and reasoning error among correct solutions to 3.4 percent.

Core claim

Pure outcome-based supervision produces similar final-answer error rates with less label supervision. For correct reasoning steps, process-based supervision or supervision from learned reward models that emulate process-based feedback is necessary. This improves previous best results from 16.8 percent to 12.7 percent final-answer error and from 14.0 percent to 3.4 percent reasoning error among final-answer-correct solutions.

What carries the argument

The distinction between outcome-based supervision (final answer only) and process-based supervision (each reasoning step) on GSM8K, including reward models trained to emulate process labels.

If this is right

Outcome supervision can deliver comparable final accuracy at lower labeling cost.
Process supervision or its reward-model proxy is required to minimize reasoning mistakes even when the answer is correct.
Learned reward models can serve as a practical substitute for full process annotations.
The resulting error rates set a new benchmark on GSM8K for both final answers and reasoning quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same supervision distinction may matter for other step-by-step reasoning domains such as code generation.
Reward models could let process supervision scale without a matching increase in human step-by-step labels.
In tutoring applications, reduced reasoning error would lower the chance of correct answers reached by flawed logic.

Load-bearing premise

The human process annotations accurately identify correct reasoning and the learned reward models do not introduce new error modes when used for supervision.

What would settle it

A replication experiment in which models trained with process supervision or reward models show the same reasoning-error rate as pure outcome models, or in which the reward models increase errors beyond direct process labels.

read the original abstract

Recent work has shown that asking language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise such models: outcome-based approaches which supervise the final result, or process-based approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in final-answer errors but also in reasoning errors, which can be difficult to detect and are problematic in many real-world domains such as education. We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task, GSM8K. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use process-based supervision or supervision from learned reward models that emulate process-based feedback. In total, we improve the previous best results from 16.8% $\to$ 12.7% final-answer error and 14.0% $\to$ 3.4% reasoning error among final-answer-correct solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Outcome supervision gets similar final-answer accuracy on GSM8K with less labeling, but process supervision or its reward-model proxy is needed to cut reasoning-step errors from 14% to 3.4%.

read the letter

The paper runs the first head-to-head comparison of process versus outcome supervision on GSM8K. Pure outcome training matches process training on final-answer error rates while using fewer labels, but the authors report that only process supervision (or a learned reward model trained to emulate it) reduces reasoning errors among correct final answers. They improve the prior bests to 12.7% final error and 3.4% reasoning error. That distinction is the concrete new piece and it is useful for anyone choosing supervision strategies when step-by-step correctness matters, such as in education or technical problem solving. The experimental design is straightforward empirical comparison with externally measured error rates, so there is no obvious circularity. The soft spots are exactly where the stress-test note flags them: the abstract supplies no data splits, error bars, or significance tests, and the central claim rests on process annotations being a faithful proxy plus the reward model not introducing new error modes. Only a small pilot for inter-annotator agreement is mentioned, with no out-of-distribution probe for the reward model. If those assumptions are shaky, the reported differential benefit shrinks. This is for readers working on supervision choices for language-model reasoning. It deserves a serious referee because it supplies a direct, quantitative comparison on a standard task even if the full paper must still demonstrate that the process labels and reward model are robust.

Referee Report

2 major / 2 minor

Summary. The paper conducts the first comprehensive empirical comparison of process-based versus outcome-based supervision for training language models to solve math word problems on the GSM8K dataset. It claims that pure outcome supervision achieves comparable final-answer accuracy with reduced labeling cost, but that process supervision (or reward models trained to emulate it) is required to achieve low rates of reasoning errors even among correct final answers, yielding new state-of-the-art figures of 12.7% final-answer error and 3.4% reasoning error.

Significance. If the results are robust, the work supplies concrete evidence that process-level feedback is necessary for high-quality reasoning chains in arithmetic tasks and that learned reward models can serve as scalable proxies for human process annotations. The reported gains over prior best results (16.8% → 12.7% final error; 14.0% → 3.4% reasoning error) would be practically relevant for domains such as education where both answer correctness and reasoning transparency matter.

major comments (2)

[Abstract / Results] Abstract and reported results: headline metrics are given without error bars, data-split details, training hyperparameters, or statistical significance tests. Because the central claim rests on the differential performance between outcome and process regimes, the absence of these elements prevents assessment of whether the observed improvements are reliable.
[Process annotation / Reward model evaluation] Process annotation and reward-model sections: the claim that process supervision (or its reward-model emulation) is required to drive reasoning error from 14.0% to 3.4% depends on the assumption that the human process labels are a faithful proxy for correctness and that the learned reward model generalizes beyond the annotation distribution. Only a small-pilot inter-annotator agreement figure is mentioned, and no out-of-distribution probe (e.g., adversarial or model-generated incorrect chains) is described.

minor comments (2)

Define precisely how 'reasoning error among final-answer-correct solutions' is operationalized and measured, including any inter-annotator protocol used at scale.
Clarify the exact amount of label supervision used in each regime so that the statement 'with less label supervision' can be quantified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our work comparing process- and outcome-based supervision for math word problems. We address each of the major comments below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Results] Abstract and reported results: headline metrics are given without error bars, data-split details, training hyperparameters, or statistical significance tests. Because the central claim rests on the differential performance between outcome and process regimes, the absence of these elements prevents assessment of whether the observed improvements are reliable.

Authors: We agree with the referee that including error bars, data-split details, training hyperparameters, and statistical significance tests is important for assessing the reliability of our results, particularly given the central comparison between supervision regimes. In the revised version of the paper, we will add error bars from multiple training runs with different random seeds, provide explicit details on the data splits used for training and evaluation, include a comprehensive list of training hyperparameters, and report statistical significance tests (such as bootstrap confidence intervals or t-tests) for the key performance differences. These additions will directly address the concern about the robustness of the observed improvements. revision: yes
Referee: [Process annotation / Reward model evaluation] Process annotation and reward-model sections: the claim that process supervision (or its reward-model emulation) is required to drive reasoning error from 14.0% to 3.4% depends on the assumption that the human process labels are a faithful proxy for correctness and that the learned reward model generalizes beyond the annotation distribution. Only a small-pilot inter-annotator agreement figure is mentioned, and no out-of-distribution probe (e.g., adversarial or model-generated incorrect chains) is described.

Authors: We acknowledge that the validity of our claims hinges on the quality and generalizability of the human process annotations. The manuscript does report a small-pilot inter-annotator agreement figure, which we will expand with additional details on the annotation guidelines and agreement statistics in the revised version. To address the generalization concern for the reward model, we will include new experiments with out-of-distribution probes, such as evaluating the reward model on adversarial examples and model-generated incorrect reasoning chains. This will provide stronger evidence that the reward model serves as a reliable proxy for process-based feedback beyond the original annotation distribution. revision: yes

Circularity Check

0 steps flagged

Empirical comparison of supervision regimes shows no circular derivation

full rationale

The paper reports experimental results from training and evaluating language models on GSM8K under outcome-based versus process-based supervision (and reward-model variants). All headline metrics—final-answer error rates and reasoning-step error rates—are obtained by direct measurement against held-out ground truth, with no equations, fitted parameters, or self-citations that reduce the reported improvements to the experimental inputs by construction. The central claims rest on observable performance differences rather than any self-definitional or load-bearing self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the representativeness of GSM8K as a natural-language reasoning benchmark and on the assumption that human-provided process labels are both accurate and sufficient to train reliable reward models.

axioms (2)

domain assumption GSM8K constitutes a representative natural-language reasoning task for comparing supervision methods
The comparison and claimed improvements are presented as generalizable from this single dataset.
domain assumption Human process annotations accurately reflect correct reasoning steps
Process-based supervision and the learned reward models that emulate it depend on this labeling quality.

pith-pipeline@v0.9.0 · 5736 in / 1351 out tokens · 29160 ms · 2026-05-24T11:07:55.745563+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies
cs.LG 2026-05 accept novelty 8.0

Corruption studies of CoT faithfulness largely measure explicit answer placement in prompt format rather than computational importance of reasoning steps.
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
cs.CL 2026-04 unverdicted novelty 8.0

MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....
EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation
cs.AI 2026-05 unverdicted novelty 7.0

EDGE-OPD adds guided rollouts and evidence masking to on-policy self-distillation, enabling successful learning of target identities where standard OPSD and RLSD fail.
Argus: Evidence Assembly for Scalable Deep Research Agents
cs.CL 2026-05 unverdicted novelty 7.0

Argus coordinates a Navigator and multiple Searchers via an evidence graph to assemble complete, source-traced answers, yielding benchmark gains up to 12.7 points with 8 parallel agents and 86.2 on BrowseComp with 64 agents.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
cs.SE 2026-05 conditional novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
cs.LG 2026-05 unverdicted novelty 7.0

Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies
cs.LG 2026-05 accept novelty 7.0

Corruption studies on CoT chains detect the position of explicit answer statements rather than computational steps, as evidenced by format ablations collapsing suffix sensitivity 19x and models following conflicting a...
Unsupervised Process Reward Models
cs.LG 2026-05 unverdicted novelty 7.0

Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
cs.LG 2026-05 unverdicted novelty 7.0

Conditional optimal transport calibrates PRMs by learning monotonic conditional quantile functions over success probabilities conditioned on hidden states, yielding improved calibration and downstream Best-of-N perfor...
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Logic-Regularized Verifier Elicits Reasoning from LLMs
cs.CL 2026-05 unverdicted novelty 7.0

LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning
stat.ML 2026-05 unverdicted novelty 7.0

InfoTree casts intermediate state selection in tree search as monotone submodular maximization under fixed rollout budgets, yielding closed-form UUCB terms and lifting mixed-outcome ratios while outperforming flat GRP...
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
cs.AI 2026-05 unverdicted novelty 7.0

TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured uplift on a frozen executor, outperforming execution-only training on math and code b...
Generalized Distributional Alignment Games for Unbiased Answer-Level Fine-Tuning
cs.LG 2026-05 unverdicted novelty 7.0

Generalized Bregman alignment games plus U-statistics and optimal minimax polynomial estimators remove Jensen bias and achieve optimal statistical rates for unbiased answer-level fine-tuning.
Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation
cs.CL 2026-05 unverdicted novelty 7.0

Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.
AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking
cs.SE 2026-04 conditional novelty 7.0

AgentEval evaluates agentic workflows via DAGs with step metrics, a 21-category failure taxonomy, and error propagation tracking, yielding 2.17x higher failure recall than end-to-end methods and strong human agreement.
Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
Navigating the Conceptual Multiverse
cs.HC 2026-04 unverdicted novelty 7.0

The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choic...
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
cs.LG 2026-04 unverdicted novelty 7.0

RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
AI Achieves a Perfect LSAT Score
cs.AI 2026-04 unverdicted novelty 7.0

Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.
Structural Evaluation Metrics for SVG Generation via Leave-One-Out Analysis
cs.LG 2026-04 unverdicted novelty 7.0

Element-level leave-one-out analysis yields per-element quality scores and four structural metrics (purity, coverage, compactness, locality) that quantify SVG modularity and enable artifact detection.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo
cs.LG 2026-04 unverdicted novelty 7.0

Sequential Monte Carlo sampling from a reward-augmented sequence distribution improves LLM performance on HumanEval by up to 54.9% and MATH500 by up to 8.8%, outperforming standard sampling and GRPO.
WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking
cs.AI 2026-03 unverdicted novelty 7.0

WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
cs.AI 2025-10 unverdicted novelty 7.0

ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
cs.AI 2025-03 conditional novelty 7.0

Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
cs.CV 2025-03 unverdicted novelty 7.0

Seg-Zero uses cognitive reinforcement learning on a decoupled reasoning-plus-segmentation architecture to produce explicit reasoning chains and reach 57.5 zero-shot accuracy on ReasonSeg, beating prior supervised LISA...
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
cs.CL 2024-12 unverdicted novelty 7.0

o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
cs.LG 2024-10 accept novelty 7.0

LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
Let's Verify Step by Step
cs.LG 2023-05 accept novelty 7.0

Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
Manifold-Guided Attention Steering
cs.LG 2026-05 unverdicted novelty 6.0

MAGS learns low-dimensional subspaces from correct versus incorrect reasoning traces and applies targeted projection corrections to attention heads when they deviate from the correctness manifold during inference.
Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents
cs.CL 2026-05 unverdicted novelty 6.0

ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficien...
Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution
cs.CL 2026-05 unverdicted novelty 6.0

SCA framework applies Information Bottleneck to assign step-level confidence in black-box LLM reasoning traces, flagging errors and boosting self-correction success by up to 13.5% on math and QA tasks.
SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation
cs.AI 2026-05 unverdicted novelty 6.0

SAPO computes per-reasoning-step group-relative advantages in RL to improve credit assignment for structured generation of semantic identifiers in recommendation systems.
Argus: Evidence Assembly for Scalable Deep Research Agents
cs.CL 2026-05 unverdicted novelty 6.0

Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmar...
Process Rewards with Learned Reliability
cs.CL 2026-05 unverdicted novelty 6.0

BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.
LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design
cs.LG 2026-05 unverdicted novelty 6.0

LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published ...
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
cs.CL 2026-05 unverdicted novelty 6.0

CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
cs.LG 2026-05 conditional novelty 6.0

A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.
Distribution Corrected Offline Data Distillation for Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning
cs.LG 2026-05 unverdicted novelty 6.0

STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
cs.AI 2026-05 unverdicted novelty 6.0

CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
Verifiable Process Rewards for Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Verifiable Process Rewards (VPR) converts symbolic oracles into dense turn-level supervision for reinforcement learning in agentic reasoning, outperforming outcome-only rewards and transferring to general benchmarks.
Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
cs.LG 2026-05 unverdicted novelty 6.0

Conditional optimal transport is used to turn raw PRM outputs into monotonic quantile functions that improve calibration and downstream Best-of-N performance on MATH-500 and AIME.
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
cs.CL 2026-05 unverdicted novelty 6.0

RLearner-LLM's Hybrid-DPO fuses DeBERTa NLI and LLM verifier scores to deliver up to 6x higher NLI entailment than standard SFT while preserving answer coverage across academic domains.
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
cs.CL 2026-05 unverdicted novelty 6.0

RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
cs.AI 2026-05 unverdicted novelty 6.0

TraceLift trains reasoning planners using rewards that credit traces for both rubric quality and actual performance gains on a frozen executor, outperforming final-answer-only training on math and code tasks.
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
cs.LG 2026-05 unverdicted novelty 6.0

DGPO reinterprets distribution deviation as a guiding signal in a critic-free policy optimization framework to enable fine-grained credit assignment for LLM chain-of-thought reasoning.
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
cs.LG 2026-05 unverdicted novelty 6.0

DGPO is a critic-free RL framework that uses bounded Hellinger distance and entropy-gated advantage redistribution to enable fine-grained token-level credit assignment in long CoT generations for LLM alignment, report...
Controllable and Verifiable Process Data Synthesis for Process Reward Models
cs.AI 2026-05 unverdicted novelty 6.0

A controllable synthesis method creates prefix-invalid yet trajectory-consistent process supervision data for training and evaluating process reward models by injecting verifiable errors into symbolic reasoning chains.
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
cs.AI 2026-05 unverdicted novelty 6.0

CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
Adaptive Test-Time Compute Allocation with Evolving In-Context Demonstrations
cs.AI 2026-04 unverdicted novelty 6.0

An adaptive test-time framework uses a warm-up phase on the test set to build evolving in-context examples, then concentrates compute on unresolved queries to outperform static baselines on math, coding, and reasoning...
TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering
cs.AI 2026-04 unverdicted novelty 6.0

TPS-CalcBench is a new benchmark and evaluation framework that tests LLMs on analytical calculations in hypersonic aerodynamics and gas dynamics, using dual-track scoring and interventions to detect physically invalid...
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
cs.CL 2026-04 unverdicted novelty 6.0

PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non...
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
cs.LG 2026-04 unverdicted novelty 6.0

MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density c...
PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection
cs.AI 2026-04 unverdicted novelty 6.0

PRISM-MCTS improves MCTS-based reasoning efficiency by maintaining a shared memory of heuristics and fallacies reinforced by a process reward model, halving required trajectories on GPQA while outperforming prior methods.
Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment
cs.LG 2026-04 unverdicted novelty 6.0

Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 80 Pith papers · 29 internal anchors

[1]

Maximum a Posteriori Policy Optimisation

A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a posteriori policy optimisation.arXiv preprint arXiv:1806.06920,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

A. Amini, S. Gabriel, P. Lin, R. Koncel-Kedziorski, Y. Choi, and H. Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[3]

Concrete Problems in AI Safety

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[5]

J. Cai, R. Shin, and D. Song. Making neural programming architectures generalize via recursion. arXiv preprint arXiv:1704.06611,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

URLhttps://arxiv.org/abs/2107.03374. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Supervising strong learners by amplifying weak experts

P. Christiano, B. Shlegeris, and D. Amodei. Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training veriﬁers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

16 Solving math word problems with process- and outcome-based feedback A

URLhttps://www.alignmentforum.org/posts/pRkFkzwKZ2zfa3R6H/ without-specific-countermeasures-the-easiest-path-to . 16 Solving math word problems with process- and outcome-based feedback A. Creswell, M. Shanahan, and I. Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning.arXiv preprint arXiv:2205.09712,

work page arXiv
[11]

Dalvi, P

B. Dalvi, P. A. Jansen, O. Tafjord, Z. Xie, H. Smith, L. Pipatanangkura, and P. Clark. Explaining answers with entailment trees.ArXiv, abs/2104.08661,

work page arXiv
[12]

Universal Transformers

M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser. Universal transformers.arXiv preprint arXiv:1807.03819,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Dohan, W

D. Dohan, W. Xu, A. Lewkowycz, J. Austin, D. Bieber, R. G. Lopes, Y. Wu, H. Michalewski, R. A. Saurous, J. Sohl-dickstein, et al. Language model cascades.arXiv preprint arXiv:2207.10342,

work page arXiv
[14]

Reinforcement Learning with a Corrupted Reward Channel

T. Everitt, V. Krakovna, L. Orseau, M. Hutter, and S. Legg. Reinforcement learning with a corrupted reward channel.arXiv preprint arXiv:1705.08417,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Weakly-supervised Semantic Parsing with Abstract Examples

O. Goldman, V. Latcinnik, U. Naveh, A. Globerson, and J. Berant. Weakly-supervised semantic parsing with abstract examples.arXiv preprint arXiv:1711.05240,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Adaptive Computation Time for Recurrent Neural Networks

A.Graves. Adaptivecomputationtimeforrecurrentneuralnetworks. arXivpreprintarXiv:1603.08983 ,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Neural Turing Machines

A. Graves, G. Wayne, and I. Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Measuring Mathematical Problem Solving With the MATH Dataset

D.Hendrycks,C.Burns,S.Kadavath,A.Arora,S.Basart,E.Tang,D.Song,andJ.Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Training Compute-Optimal Large Language Models

J. Hoﬀmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. v. d. Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre. Training compute- optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

AI safety via debate

G. Irving, P. Christiano, and D. Amodei. Ai safety via debate.arXiv preprint arXiv:1805.00899,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Kenton, T

Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel, V. Mikulik, and G. Irving. Alignment of language agents. arXiv preprint arXiv:2103.14659,

work page arXiv
[22]

Large Language Models are Zero-Shot Reasoners

17 Solving math word problems with process- and outcome-based feedback T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

URL https://deepmindsafetyresearch.medium.com/ specification-gaming-the-flip-side-of-ai-ingenuity-c85bdb0deeb4 . R. Kumar, J. Uesato, R. Ngo, T. Everitt, V. Krakovna, and S. Legg. REALab: An embedded perspective on tampering.arXiv preprint arXiv:2011.08820,

work page arXiv 2011
[24]

URLhttps://arxiv.org/abs/2206.14858. C. Li, D. Tarlow, A. L. Gaunt, M. Brockschmidt, and N. Kushman. Neural program lattices

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen. On the advance of making language models better reasoners.arXiv preprint arXiv:2206.02336,

work page arXiv
[26]

W. Ling, D. Yogatama, C. Dyer, and P. Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems.arXiv preprint arXiv:1705.04146,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Teaching language models to support answers with verified quotes

J.Menick,M.Trebacz,V.Mikulik,J.Aslanides,F.Song,M.Chadwick,M.Glaese,S.Young,L.Campbell- Gillingham, G. Irving, et al. Teaching language models to support answers with veriﬁed quotes. arXiv preprint arXiv:2203.11147,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

WebGPT: Browser-assisted question-answering with human feedback

R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saun- ders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

URL https://arxiv.org/abs/2203.02155. E. Perez, P. Lewis, W.-t. Yih, K. Cho, and D. Kiela. Unsupervised question decomposition for question answering. arXiv preprint arXiv:2002.09758,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[32]

URL https://arxiv.org/abs/2009.03393. M. Rauh, J. Mellor, J. Uesato, P.-S. Huang, J. Welbl, L. Weidinger, S. Dathathri, A. Glaese, G. Irving, I. Gabriel, et al. Characteristics of harmful text: Towards rigorous benchmarking of language models. arXiv preprint arXiv:2206.08325,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[33]

Neural Programmer-Interpreters

S. Reed and N. De Freitas. Neural programmer-interpreters.arXiv preprint arXiv:1511.06279,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Shwartz, P

V. Shwartz, P. West, R. L. Bras, C. Bhagavatula, and Y. Choi. Unsupervised commonsense question answering with self-talk.arXiv preprint arXiv:2004.05483,

work page arXiv 2004
[35]

URLhttps://ought.org/ updates/2022-04-06-process. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks.Advances in neural information processing systems, 27,

work page 2022
[36]

Tafjord, B

O. Tafjord, B. D. Mishra, and P. Clark. Proofwriter: Generating implications, proofs, and abductive statements over natural language.arXiv preprint arXiv:2012.13048,

work page arXiv 2012
[37]

Uesato, R

J. Uesato, R. Kumar, V. Krakovna, T. Everitt, R. Ngo, and S. Legg. Avoiding tampering incentives in deep RL via decoupled approval.arXiv preprint arXiv:2011.08827,

work page arXiv 2011
[38]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

J. Wu, L. Ouyang, D. M. Ziegler, N. Stiennon, R. Lowe, J. Leike, and P. Christiano. Recursively summarizing books with human feedback.arXiv preprint arXiv:2109.10862,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

19 Solving math word problems with process- and outcome-based feedback Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Wu, andN.D.Goodman

E.Zelikman, Y. Wu, andN.D.Goodman. Star: Bootstrappingreasoning withreasoning.arXivpreprint arXiv:2203.14465,

work page arXiv
[42]

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[43]

Example GSM8K problems and solutions We include several examples to provide a qualitative sense of the task and learned model behavior

20 Solving math word problems with process- and outcome-based feedback A. Example GSM8K problems and solutions We include several examples to provide a qualitative sense of the task and learned model behavior. Table 4 contains 10 randomly sampled problems, and the output of the SFT+ORM-RL model with ORM reranking. Table 5 contains 5 trace errors, where th...

work page 2021
[44]

Table 6j An example question and answer from the Pre-algebra split of the MATH dataset (Hendrycks et al., 2021). Model Final-answer error Few-shot+Final-Answer RL 65.7 SFT 67.6 SFT, ORM reranking 65.4 SFT, PRM reranking 67.7 SFT+Final-Answer RL 63.3 SFT+ORM-RL 63.2 Table 7j Final-answer error on MATH Pre-algebra.Unless RM reranking is speciﬁed, all number...

work page 2021

[1] [1]

Maximum a Posteriori Policy Optimisation

A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a posteriori policy optimisation.arXiv preprint arXiv:1806.06920,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

A. Amini, S. Gabriel, P. Lin, R. Koncel-Kedziorski, Y. Choi, and H. Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[3] [3]

Concrete Problems in AI Safety

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901

[5] [5]

J. Cai, R. Shin, and D. Song. Making neural programming architectures generalize via recursion. arXiv preprint arXiv:1704.06611,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [7]

URLhttps://arxiv.org/abs/2107.03374. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [8]

Supervising strong learners by amplifying weak experts

P. Christiano, B. Shlegeris, and D. Amodei. Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [9]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training veriﬁers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [10]

16 Solving math word problems with process- and outcome-based feedback A

URLhttps://www.alignmentforum.org/posts/pRkFkzwKZ2zfa3R6H/ without-specific-countermeasures-the-easiest-path-to . 16 Solving math word problems with process- and outcome-based feedback A. Creswell, M. Shanahan, and I. Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning.arXiv preprint arXiv:2205.09712,

work page arXiv

[10] [11]

Dalvi, P

B. Dalvi, P. A. Jansen, O. Tafjord, Z. Xie, H. Smith, L. Pipatanangkura, and P. Clark. Explaining answers with entailment trees.ArXiv, abs/2104.08661,

work page arXiv

[11] [12]

Universal Transformers

M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser. Universal transformers.arXiv preprint arXiv:1807.03819,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [13]

Dohan, W

D. Dohan, W. Xu, A. Lewkowycz, J. Austin, D. Bieber, R. G. Lopes, Y. Wu, H. Michalewski, R. A. Saurous, J. Sohl-dickstein, et al. Language model cascades.arXiv preprint arXiv:2207.10342,

work page arXiv

[13] [14]

Reinforcement Learning with a Corrupted Reward Channel

T. Everitt, V. Krakovna, L. Orseau, M. Hutter, and S. Legg. Reinforcement learning with a corrupted reward channel.arXiv preprint arXiv:1705.08417,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [15]

Weakly-supervised Semantic Parsing with Abstract Examples

O. Goldman, V. Latcinnik, U. Naveh, A. Globerson, and J. Berant. Weakly-supervised semantic parsing with abstract examples.arXiv preprint arXiv:1711.05240,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [16]

Adaptive Computation Time for Recurrent Neural Networks

A.Graves. Adaptivecomputationtimeforrecurrentneuralnetworks. arXivpreprintarXiv:1603.08983 ,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [17]

Neural Turing Machines

A. Graves, G. Wayne, and I. Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [18]

Measuring Mathematical Problem Solving With the MATH Dataset

D.Hendrycks,C.Burns,S.Kadavath,A.Arora,S.Basart,E.Tang,D.Song,andJ.Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [19]

Training Compute-Optimal Large Language Models

J. Hoﬀmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. v. d. Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre. Training compute- optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [20]

AI safety via debate

G. Irving, P. Christiano, and D. Amodei. Ai safety via debate.arXiv preprint arXiv:1805.00899,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [21]

Kenton, T

Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel, V. Mikulik, and G. Irving. Alignment of language agents. arXiv preprint arXiv:2103.14659,

work page arXiv

[21] [22]

Large Language Models are Zero-Shot Reasoners

17 Solving math word problems with process- and outcome-based feedback T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [23]

URL https://deepmindsafetyresearch.medium.com/ specification-gaming-the-flip-side-of-ai-ingenuity-c85bdb0deeb4 . R. Kumar, J. Uesato, R. Ngo, T. Everitt, V. Krakovna, and S. Legg. REALab: An embedded perspective on tampering.arXiv preprint arXiv:2011.08820,

work page arXiv 2011

[23] [24]

URLhttps://arxiv.org/abs/2206.14858. C. Li, D. Tarlow, A. L. Gaunt, M. Brockschmidt, and N. Kushman. Neural program lattices

work page internal anchor Pith review Pith/arXiv arXiv

[24] [25]

Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen. On the advance of making language models better reasoners.arXiv preprint arXiv:2206.02336,

work page arXiv

[25] [26]

W. Ling, D. Yogatama, C. Dyer, and P. Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems.arXiv preprint arXiv:1705.04146,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [27]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [28]

Teaching language models to support answers with verified quotes

J.Menick,M.Trebacz,V.Mikulik,J.Aslanides,F.Song,M.Chadwick,M.Glaese,S.Young,L.Campbell- Gillingham, G. Irving, et al. Teaching language models to support answers with veriﬁed quotes. arXiv preprint arXiv:2203.11147,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [29]

WebGPT: Browser-assisted question-answering with human feedback

R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saun- ders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [30]

M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luan, et al. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [31]

URL https://arxiv.org/abs/2203.02155. E. Perez, P. Lewis, W.-t. Yih, K. Cho, and D. Kiela. Unsupervised question decomposition for question answering. arXiv preprint arXiv:2002.09758,

work page internal anchor Pith review Pith/arXiv arXiv 2002

[31] [32]

URL https://arxiv.org/abs/2009.03393. M. Rauh, J. Mellor, J. Uesato, P.-S. Huang, J. Welbl, L. Weidinger, S. Dathathri, A. Glaese, G. Irving, I. Gabriel, et al. Characteristics of harmful text: Towards rigorous benchmarking of language models. arXiv preprint arXiv:2206.08325,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[32] [33]

Neural Programmer-Interpreters

S. Reed and N. De Freitas. Neural programmer-interpreters.arXiv preprint arXiv:1511.06279,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [34]

Shwartz, P

V. Shwartz, P. West, R. L. Bras, C. Bhagavatula, and Y. Choi. Unsupervised commonsense question answering with self-talk.arXiv preprint arXiv:2004.05483,

work page arXiv 2004

[34] [35]

URLhttps://ought.org/ updates/2022-04-06-process. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks.Advances in neural information processing systems, 27,

work page 2022

[35] [36]

Tafjord, B

O. Tafjord, B. D. Mishra, and P. Clark. Proofwriter: Generating implications, proofs, and abductive statements over natural language.arXiv preprint arXiv:2012.13048,

work page arXiv 2012

[36] [37]

Uesato, R

J. Uesato, R. Kumar, V. Krakovna, T. Everitt, R. Ngo, and S. Legg. Avoiding tampering incentives in deep RL via decoupled approval.arXiv preprint arXiv:2011.08827,

work page arXiv 2011

[37] [38]

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv

[38] [39]

J. Wu, L. Ouyang, D. M. Ziegler, N. Stiennon, R. Lowe, J. Leike, and P. Christiano. Recursively summarizing books with human feedback.arXiv preprint arXiv:2109.10862,

work page internal anchor Pith review Pith/arXiv arXiv

[39] [40]

19 Solving math word problems with process- and outcome-based feedback Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600,

work page internal anchor Pith review Pith/arXiv arXiv

[40] [41]

Wu, andN.D.Goodman

E.Zelikman, Y. Wu, andN.D.Goodman. Star: Bootstrappingreasoning withreasoning.arXivpreprint arXiv:2203.14465,

work page arXiv

[41] [42]

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review Pith/arXiv arXiv 1909

[42] [43]

Example GSM8K problems and solutions We include several examples to provide a qualitative sense of the task and learned model behavior

20 Solving math word problems with process- and outcome-based feedback A. Example GSM8K problems and solutions We include several examples to provide a qualitative sense of the task and learned model behavior. Table 4 contains 10 randomly sampled problems, and the output of the SFT+ORM-RL model with ORM reranking. Table 5 contains 5 trace errors, where th...

work page 2021

[43] [44]

Table 6j An example question and answer from the Pre-algebra split of the MATH dataset (Hendrycks et al., 2021). Model Final-answer error Few-shot+Final-Answer RL 65.7 SFT 67.6 SFT, ORM reranking 65.4 SFT, PRM reranking 67.7 SFT+Final-Answer RL 63.3 SFT+ORM-RL 63.2 Table 7j Final-answer error on MATH Pre-algebra.Unless RM reranking is speciﬁed, all number...

work page 2021