Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.
super hub Canonical reference
Let's Verify Step by Step
Canonical reference. 81% of citing Pith papers cite this work as background.
abstract
In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, bu
authors
co-cited works
representative citing papers
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.
A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
PEFT-Arena reveals distinct stability-plasticity profiles across PEFT methods, with orthogonal finetuning achieving the best Pareto frontier under comparable parameter budgets, supported by weight-space spectral and activation-space retention analyses.
EDGE-OPD adds guided rollouts and evidence masking to on-policy self-distillation, enabling successful learning of target identities where standard OPSD and RLSD fail.
GS-QA is a new benchmark of 2,800 QA pairs on 28 templates using OSM and Wikipedia data to evaluate LLMs on spatial predicates, multi-source reasoning, and diverse answer types including distances and counts.
Causal diagnosis identifies the routing module as bottleneck in LLM agents but prompt patching there degrades results due to linguistic co-adaptation, while upstream patching improves them.
In 1-3B instruction-tuned LMs on GSM8K, arithmetic CoT readout is dominated by positional copying of the trailing number before the answer delimiter, accounting for 54-92 percentage points of accuracy.
LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.
DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Hallucination is detected as a transport-cost excursion in hidden-state trajectories, localized via contrastive PCA in a teacher model and distilled to a BiLSTM student.
RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-training while staying competitive on reasoning and coding.
Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.
Reward-Weighted On-Policy Distillation with an open property-equivalence verifier produces a 7B model that surpasses prior SOTA on NL-to-SVA generation across pass@1/5/10 metrics.
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.
LLMs rely on semantic cues for matrix-game equilibria but can acquire approximate computation via residual training on small instances, with a Lipschitz proof enabling transfer to larger anonymous games.
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
CIKA uses LLM-based interventions to probe causal effects of concepts on math reasoning success, achieving competitive results on benchmarks like Omni-MATH and GSM8K with a frozen 7B model.
citing papers explorer
-
EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation
EDGE-OPD adds guided rollouts and evidence masking to on-policy self-distillation, enabling successful learning of target identities where standard OPSD and RLSD fail.
-
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.
-
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
-
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
-
IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning
IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-hop tasks.
-
AI Achieves a Perfect LSAT Score
Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.
-
WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking
WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
-
Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya
Fine-tuning LLMs on Navya-Nyaya's six-phase reasoning structure yields 100% semantic correctness on held-out logical problems despite only 40% strict format adherence.
-
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
-
Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents
GILP combines a small parameterized world model with LLM agent reasoning via a consistency gate, reducing hallucinated-state rate from 0.176 to 0.035 and raising success from 0.668 to 0.838 on graph planning benchmarks.
-
FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search
FALAT improves failure attribution in LLM agent trajectories via dependency-guided search, achieving 46.0% step-level accuracy on algorithm-generated and 29.1% on hand-crafted trajectories in the Who&When benchmark.
-
Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs
LRS trains a latent reward model on final-answer correctness to steer SAE states during inference, improving reasoning performance and implicitly encouraging better cognitive behaviors.
-
Rubric-Guided Process Reward for Stepwise Model Routing
RoRo uses alternating optimization of a Rubricor and Judge to create process rewards from outcome-cost-process preference pairs, then combines them with outcome rewards via GRPO to train stepwise model routers that outperform baselines on five reasoning benchmarks.
-
HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs
HRBench evaluates 12 switching strategy settings across 6 LLMs and 5 benchmarks, showing prompt-based, routing, and speculative methods occupy distinct effectiveness-efficiency trade-off regions.
-
Verifiable Benchmarking of Long-Horizon Spatial Biology
Introduces SpatialBench-Long benchmark with 24 evaluations on spatial biology datasets from PDAC, glioblastoma, lung adenocarcinoma and optic nerve systems, reporting top model performance at 8/72 runs (11.1%).
-
Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching
DecoR routes LLM queries by decomposing them into capability dimensions and matching to historical examples, yielding higher accuracy and lower inference costs than direct-mapping routers on both in-distribution and OOD data.
-
Credit Assignment with Resets in Language Model Reasoning
The paper introduces Random-Reset Policy Optimization (RRPO) and Self-Reset Policy Optimization (SRPO) that use resets to enable more precise credit assignment in RL for language model reasoning, with SRPO outperforming GRPO and RRPO across benchmarks.
-
GRACE: Gradient-aligned Reasoning Data Curation for Efficient Post-training
GRACE scores reasoning steps via gradient alignment and trajectory consistency to select data subsets that match full performance with 5% of the data on Qwen3-VL-2B-Instruct.
-
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
-
Internalizing Safety Understanding in Large Reasoning Models via Verification
Training large reasoning models only on safety verification tasks internalizes safety understanding and boosts robustness to out-of-domain jailbreaks, providing a stronger base for reinforcement learning alignment than standard supervised fine-tuning.
-
Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.
-
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering
TPS-CalcBench is a new benchmark and evaluation framework that tests LLMs on analytical calculations in hypersonic aerodynamics and gas dynamics, using dual-track scoring and interventions to detect physically invalid reasoning.
-
ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis
ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.
-
Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier
ISOPro replaces learned reward models with deterministic verifiers in a continuous evaluation setup for LLMs, delivering larger average capability gains than GRPO-LoRA across small models in scheduling and MBPP domains while characterizing a buffer-skew failure mode.
-
Structured Abductive-Deductive-Inductive Reasoning for LLMs via Algebraic Invariants
A symbolic protocol operationalizes Peirce's tripartite reasoning for LLMs using five algebraic invariants including a Weakest Link bound to enforce logical consistency and prevent weak premises from supporting strong conclusions.
-
HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models
A cooperative system with one SLM distilling stepwise hints from a large model to guide another SLM's math reasoning yields consistent accuracy gains on benchmarks.
-
SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
SPPO enables stable, sample-efficient alignment of LLMs on long-horizon reasoning tasks by using a decoupled scalar value function for low-variance advantages without multi-sampling.
-
FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models
Errors in large reasoning models form a forest structure that grows with more steps, making the first solution best; RED refines the first and prunes the rest for higher performance with less compute.
-
Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models
An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.
-
Token-Level LLM Collaboration via FusionRoute
FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and merging methods on reasoning and generation benchmarks.
-
SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning
SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.
-
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
-
Nothing from Something: Can a Language Model Discover 0?
Language models require explicit examples to learn zero in arithmetic but language pretraining halves the examples needed.
-
AXIOM: A Trust-First Neuro-Symbolic Execution Architecture for Verifiable Mathematical Reasoning
AXIOM routes math problems via LLM canonicalization to 3100+ deterministic CAS handlers, reporting 94.36% correctness at 100% trust on parseable MATH benchmark items with no confident-wrong answers.
-
Capability Self-Assessment: Teaching LLMs to Know Their Limits
Reinforcement learning teaches LLMs to assess their own capabilities more effectively than supervised fine-tuning, preserves original skills, generalizes out of distribution, and aids local-cloud routing and data selection.
-
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.
-
Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking
SBBT separates Brier-score calibration gains from AUROC ranking gains in prefix-conditioned success estimation for LLM math reasoning, with structure-aware signals yielding up to +0.110 AUROC over baselines.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
Targeted Exploration via Unified Entropy Control for Reinforcement Learning
UEC-RL improves RL reasoning performance in LLMs and VLMs by activating exploration on hard prompts and stabilizing entropy, delivering a 37.9% relative gain over GRPO on Geometry3K.
-
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments
An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
-
PRL: Prompts from Reinforcement Learning
PRL is a reinforcement learning method that generates novel prompts and achieves state-of-the-art results on text classification, simplification, and summarization benchmarks, outperforming APE and EvoPrompt.
-
Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems
Introduces the SMARt four-layer model with timed guarded Petri nets to formalize detection of epistemic drift, recovery, and controlled surrender of autonomy in AI agents.
-
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
- Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning