GRPO's group-mean baseline assigns identical advantages to all tokens under output-only rewards, inducing gradient sparsity and an intrinsic rank-2 structure proven from the zero-sum constraint and confirmed by SVD on Nemotron-4B gradients.
super hub Canonical reference
Let's Verify Step by Step
Canonical reference. 81% of citing Pith papers cite this work as background.
abstract
In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, bu
authors
co-cited works
representative citing papers
Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.
A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.
Flow models reach 99.2% Sudoku accuracy in 7 passes and 96.1% on out-of-distribution Sudoku-Extreme by selecting dynamically stable candidates and training with self-conditioning plus DPO to avoid failed outputs.
GILP trains a parameterized backbone for valid actions and state predictions, then uses a consistency gate with LLM drafts to reduce hallucinated-state rate from 0.176 to 0.035 on GPT-4o-mini while raising success from 0.668 to 0.838.
VCT abstracts non-linear LLM operations into authenticated state transitions via atomic Q&A hash chains, session Merkle roots, and account-level roots with joint signatures, plus protocols for deletions and concurrency detection.
Verifiable search procedures cannot be learned as forward chain-of-thought by language models; they instead learn memorization, verification, or require precomputed catalogs.
ICT framework applies JS divergence to token logits to select critical tokens for selective RLVR updates, claiming 4.58% average pass@4 gains on Qwen2.5 models across seven reasoning benchmarks.
DivInit improves agentic search breadth scaling by selecting diverse first-turn queries from a single model generation, delivering 5-7 point gains on multi-hop QA across five models and eight benchmarks at matched compute.
This paper introduces a taxonomy of four LLM failure modes on research math proofs and empirically shows premise smuggling in all eight audited Gemini outputs, with a new audit instrument achieving 100% precision.
SWITCH uses explicit <swi> and </swi> boundary tokens to make latent chain-of-thought compatible with on-policy RL (GRPO) and open to causal mechanistic probing, outperforming prior hidden-state recurrence methods.
EBA clusters sampled LLM generations in representation space to estimate agreement, outperforming random selection with stable scaling and showing that central positions correlate with higher generation quality.
Establishes a quadratic lower bound on query complexity for sampling from large classes of distributions given approximate density oracles, answers an open question on optimality of random walks, and shows circumvention for bounded classes as an abstraction of TTT.
AR-OPD disentangles privileged supervision via anchored residual guidance to reduce hindsight leakage in on-policy distillation, reporting gains of 2.3 points over full privileged OPD and 7.9 over SFT on reasoning tasks.
A paired-image benchmark reveals that many MLLMs fail to update predictions when task-critical visual evidence changes, even when they answer individual images correctly.
Introduces CHARM framework that detects cascading hallucinations in agentic RAG at 89.4% rate with 5.3% false positives and reduces error propagation by 82.1% on multi-hop QA benchmarks.
ChemCoTBench-V2 is a new rule-verifiable benchmark with 5,620 samples across 18 tasks that evaluates LLM chemical reasoning traces using deterministic chemistry rules and reference traces rather than final answers alone.
ResMerge improves merging of RL expert LLMs via a stable residual consensus backbone plus gated head correction, outperforming task-vector and spectral baselines in capability preservation.
Chunk-Level Guided Generation uses off-the-shelf large LLMs to score fixed-length chunks from small models via likelihoods, matching trained PRM performance on math benchmarks without reward-model training.
PEFT-Arena reveals distinct stability-plasticity profiles across PEFT methods, with orthogonal finetuning achieving the best Pareto frontier under comparable parameter budgets, supported by weight-space spectral and activation-space retention analyses.
ARBITER models reasoning trajectory basins in test-time sampling and uses model-internal signals to correct majority-vote failures, recovering part of the oracle gap on math benchmarks.
citing papers explorer
-
Llemma: An Open Language Model For Mathematics
Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
-
Large Language Models Cannot Self-Correct Reasoning Yet
LLMs cannot reliably self-correct reasoning mistakes using only their internal capabilities and often degrade in performance without external feedback.
-
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.
-
Chain-of-Verification Reduces Hallucination in Large Language Models
Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.
-
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.
-
Predict, Reuse, and Repair: Accelerating Dynamic Sparse Attention for Long-Context LLM Decoding
PRR accelerates dynamic sparse attention decoding in long-context LLMs via EMA-based prediction, speculative attention, and FlashAttention repair, achieving up to 40% latency reduction.
-
Prefix-Guided On-Policy Distillation: Mining Golden Trajectories from Rollouts
PG-OPD uses early prefix overlap to selectively continue only high-compatibility rollouts in on-policy distillation, reporting up to 4.8 accuracy points gained and 2.46x less training time on math benchmarks.
-
Sakana Fugu Technical Report
Sakana Fugu trains LLM orchestrators using fine-tuning, evolutionary algorithms, and RL to build query-adaptive multi-agent scaffolds, claiming SOTA results on benchmarks including SWE-Bench Pro and GPQA-Diamond.
-
Nothing from Something: Can a Language Model Discover 0?
Language models require explicit examples to learn zero in arithmetic but language pretraining halves the examples needed.
-
Order Is Not Control
Order is distinct from control, where control is defined as a local receiver-gated response law demonstrated across biological circuits and LLM response panels with reported prediction accuracies of 72-84%.
-
Evaluating Research-Level Math Proofs via Strict Step-Level Verification
A step-level verification framework for LLMs on research-level proofs from the FirstProof benchmark outperforms global methods by enforcing per-step context and theorem constraints, shifting errors from hallucinations to pedantic rejections.
-
Ishigaki-IDS: An Open-Weight Verifier-Aware Model for Information Delivery Specification Drafting in Building Information Modeling
Ishigaki-IDS is a verifier-aware LLM for generating validator-passing IDS files in BIM, reaching IDSAuditPass scores of 0.651-0.753 on a 166-case benchmark and cutting practitioner work time by 54.7%.
-
When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference
PPV delegation using letter entropy and per-question embedding cosine beats majority voting by 1.5 pp overall on MMLU-Pro in an unsupervised setting.
-
Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models
SA-AH-GRPO applies asymmetric entropy-based discounting only to negative-advantage trajectories in GRPO, yielding similar peak Pass@1 accuracy with 3.6x lower training variance on GSM8K for Qwen 2.5 models.
-
Evaluating Reasoning Fidelity in Visual Text Generation
T2I models frequently exhibit semantic errors, logical inconsistencies, and incorrect reasoning steps in visual text generation tasks, unlike text-only models.
-
AXIOM: A Trust-First Neuro-Symbolic Execution Architecture for Verifiable Mathematical Reasoning
AXIOM routes math problems via LLM canonicalization to 3100+ deterministic CAS handlers, reporting 94.36% correctness at 100% trust on parseable MATH benchmark items with no confident-wrong answers.
-
MESA: Improving MoE Safety Alignment via Decentralized Expertise
MESA decentralizes safety duties in MoE LLMs via expert capacity reallocation and dynamic routing refinement based on optimal transport theory, yielding robust defense on harmful benchmarks while preserving helpfulness.
-
Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback
Critic-R uses a critic model for natural-language introspective feedback to refine queries at inference time and optimize retrievers from successful/failed trajectories on multi-hop QA tasks.
-
Capability Self-Assessment: Teaching LLMs to Know Their Limits
Reinforcement learning teaches LLMs to assess their own capabilities more effectively than supervised fine-tuning, preserves original skills, generalizes out of distribution, and aids local-cloud routing and data selection.
-
CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO
CAST adds non-privileged self-teacher scoring and bidirectional advantage flipping to GRPO so that zero-variance groups still produce verifier-signed token gradients.
-
Automating Formal Verification with Reinforcement Learning and Recursive Inference
RLVR training raises verified Dafny pass rates from 9.7% to 31.1% on a filtered benchmark while a Lean proof scaffold lifts success from 46.2% to 69.2% on a pilot set and solves 7 of 42 prior unsolved tasks.
-
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.
-
Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking
SBBT separates Brier-score calibration gains from AUROC ranking gains in prefix-conditioned success estimation for LLM math reasoning, with structure-aware signals yielding up to +0.110 AUROC over baselines.
-
Less is More: Early Stopping Rollout for On-Policy Distillation
Early Stopping Rollout limits on-policy distillation to initial tokens to counter teacher decay, outperforming full rollouts in accuracy, efficiency, and stability while revealing cascading alignment effects.
-
Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models
STARS trains looped language models with Jacobian spectral radius regularization and random loop sampling to drive latent states toward asymptotically stable fixed points, yielding reliable test-time scaling on arithmetic and mathematical reasoning tasks.
-
Cassandra: Enabling Reasoning LLMs at Edge via Self-Speculative Decoding
Cassandra is a self-speculative decoding system that builds a draft model via fine-grained data selection and optimized pruning/mantissa truncation, achieving up to 2.41x speedup over BF16 and 1.81x more tokens than Eagle-3 on Llama 3 8B without training.
-
Hide to Guide: Learning via Semantic Masking
SMEPO applies fine-grained semantic masking to expert guidance in RLVR, turning hard problems into fill-in-the-blank tasks while preserving structure, yielding up to 3.2 point accuracy gains and 4.2x faster training.
-
StepGap: A Hybrid NLI-LLM Checker for Step-Level Evidence-Gap Detectionin Multi-Hop Question Answering
StepGap hybrid checker detects typed evidence gaps in multi-hop QA steps with sF1 72.0 and boosts downstream model EM by 3.3 points when used as GRPO reward.
-
CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision
Presents CaptchaBench benchmark and CaptchaMind RL solver achieving 82.9% success on benchmark tasks and 71% on real-world CAPTCHAs via explicit reasoning process supervision.
-
LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
LambdaPO introduces pairwise preference-based advantage estimation and a semantic density reward to extract more optimization signal from trajectory groups than GRPO's monolithic baseline.
-
R2V Agent: Teaching SLMs When to Ask for Help
R2V-Agent combines an SLM policy trained via BC and DPO with a step-level risk-calibrated router using Brier scores and CVaR to escalate to LLM only on high residual failure risk, improving success-cost tradeoffs on HumanEval+, TextWorld, and TerminalBench.
-
YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning
YFPO augments standard preference optimization with neuron-level activation margins from math-related features to improve LLM reasoning on math tasks.
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.
-
Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems
Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.
-
Trajectory Supervision for Continual Tool-Use Learning in LLMs
Retaining tool-use trajectories during sequential fine-tuning on API domains improves next-call prediction accuracy by 17.7 points over stripped-history training.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
-
Robust Multi-Agent LLMs under Byzantine Faults
SAC is a decentralized iterative filter-and-refine protocol that achieves (F+1)-robustness in LLM multi-agent systems, suppressing Byzantine influence and improving performance on reasoning benchmarks where prior methods fail.
-
Matrix-Decoupled Concentration for Autoregressive Sequences: Dimension-Free Guarantees for Sparse Long-Context Rewards
Develops a McDiarmid-type concentration inequality for causal autoregressive processes that preserves sparsity to achieve O(1) variance proxies instead of O(N).
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining
DeepImagine trains LLMs on counterfactual pairs from clinical trials using supervised fine-tuning and reinforcement learning to improve outcome prediction by approximating causal mechanisms.
-
Language as a Latent Variable for Reasoning Optimization
Treating language as a latent variable via polyGRPO RL improves Qwen2.5-7B-Instruct by 6.72% on English reasoning benchmarks and 6.89% on multilingual ones, with cross-task gains on commonsense reasoning from math-only training.
-
Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness
Groupwise Ranking Reward reduces reasoning-answer inconsistency in multimodal models and raises reliability-conditioned accuracy from 47.4% to 54.7% over standard RLVR.
-
Reducing Credit Assignment Variance via Counterfactual Reasoning Paths
Introduces IBPO, a counterfactual credit assignment method that turns sparse terminal rewards into process-level advantage estimates for more stable LLM reasoning training.
-
Targeted Exploration via Unified Entropy Control for Reinforcement Learning
UEC-RL improves RL reasoning performance in LLMs and VLMs by activating exploration on hard prompts and stabilizing entropy, delivering a 37.9% relative gain over GRPO on Geometry3K.
-
Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis
CRAFT constructs a consensus Reasoning Knowledge Graph from multiple LLM reasoning traces and synthesizes improved chain-of-thought via topological generation, raising label-prediction accuracy by over 10% on logic and math benchmarks.
-
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments
An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
-
Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning
A teacher-driven sampling method selects appropriately difficult questions for student models in GRPO-based RL to improve reasoning performance under fixed compute on OpenMathReasoning.
-
Hard Negative Sample-Augmented DPO Post-Training for Small Language Models
A six-dimensional MathVerifier supplies hard negatives and per-sample weights that improve DPO performance on math reasoning for a 1.5B Qwen2.5 model over standard SFT and unweighted DPO.
-
Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning
Sequential SFT followed by RL, guided by the Plasticity-Ceiling Framework, achieves higher performance ceilings in LLM mathematical reasoning than synchronized methods by optimizing data scale and transition timing.
-
DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference
DiffAdapt detects problem difficulty via entropy in reasoning traces and applies one of three fixed inference strategies per question, cutting token usage up to 22.4% with comparable or better accuracy across five models and eight benchmarks.