hub

Process reward models that think

BioSimSpace: An interoperable Python framework for biomolecular simulation · 2015 · arXiv 2504.16828

27 Pith papers cite this work. Polarity classification is still indexing.

27 Pith papers citing it

read on arXiv browse 27 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 1

citation-polarity summary

background 3 use method 1

representative citing papers

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

cs.AI · 2026-03-30 · conditional · novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.

Evidence-State Rewards for Long-Context Reasoning

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

Maven is an RL method using answer-conditioned evidence-state values to assign rewards to add, link, and drop actions on evidence memory, outperforming outcome-only baselines on LongBench v2, LongReason, and RULER.

The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.

Verifiable Counterfactual Supervision for Process Reward Models

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

Presents verifiable counterfactual process supervision that generates annotated trajectories via template-aware error injection on symbolic chains, improving Best-of-8 reranking on logical reasoning benchmarks with preliminary math transfer.

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

cs.CL · 2026-04-27 · unverdicted · novelty 7.0 · 2 refs

DataPRM is an environment-aware generative process reward model that improves LLM data analysis agents by 7-11% on benchmarks via active verification and reflection-aware ternary rewards.

GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

cs.AI · 2026-04-25 · unverdicted · novelty 7.0

GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.

Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

cs.LG · 2026-04-12 · unverdicted · novelty 7.0

GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

cs.AI · 2026-06-11 · unverdicted · novelty 6.0

MDForge uses an LLM agent with multi-agent debate to densify sparse simulator feedback for automatic MD pipeline design, matching human experts on SAMPL benchmarks and identifying a lab-confirmed picomolar CB[7] binder.

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

cs.AI · 2026-05-30 · unverdicted · novelty 6.0

LRS trains a latent reward model on final-answer correctness to steer SAE states during inference, improving reasoning performance and implicitly encouraging better cognitive behaviors.

SDR: Set-Distance Rewards for Radiology Report Generation

cs.AI · 2026-05-30 · unverdicted · novelty 6.0

Set-to-set distances on sentence embeddings provide a permutation-invariant reward signal that improves GRPO training and enables efficient test-time scaling for vision-language models generating chest X-ray reports.

ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

ExpGraph builds a graph of summarized agent experiences and uses graph diffusion plus an RL-trained retrieval copilot to improve frozen LLM executors on QA, math, code, and agentic tasks without parameter updates.

VeriGate: Verifier-Gated Step-Level Supervision for GRPO

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

VeriGate adds verifier-gated step-level supervision to GRPO via cumulated PRM rewards and group-normalized token advantages, raising accuracy 20% and 12% on 1.5B and 7B models on MATH and six benchmarks.

In LLM Reasoning, there is Irrationality on top of Value Misalignment

cs.AI · 2026-05-26 · unverdicted · novelty 6.0

LLMs display widespread rational value risk in reasoning that value alignment reduces but does not remove, with risk sensitive to inference strategy and showing diminishing returns from longer reasoning.

Process Rewards with Learned Reliability

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.

STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.

AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

cs.CL · 2026-04-24 · unverdicted · novelty 6.0

AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.

Pause or Fabricate? Training Language Models for Grounded Reasoning

cs.CL · 2026-04-21 · conditional · novelty 6.0

GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task success on insufficient math datasets.

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

cs.CL · 2025-12-08 · unverdicted · novelty 6.0

NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

cs.LG · 2025-09-03 · unverdicted · novelty 6.0

PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.

Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

cs.CL · 2025-08-21 · unverdicted · novelty 6.0

Fin-PRM is a domain-specialized process reward model that supplies binary step-level and trajectory-level supervision signals for financial reasoning in LLMs and outperforms general PRMs on CFLUE and FinQA benchmarks.

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

cs.LG · 2025-07-23 · unverdicted · novelty 6.0

RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.

TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation

cs.AI · 2026-05-28 · unverdicted · novelty 5.0

TRACE is a new metric for assessing LLM CoT reasoning structure via Toulmin and Flavell frameworks, showing r=0.74 correlation with accuracy on 26.3K samples and utility as an RL reward.

DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

cs.AI · 2026-05-27 · unverdicted · novelty 4.0

DenoiseRL optimizes recovery from noisy prefixes in weak-model reasoning failures to improve performance and self-correction on math and general reasoning benchmarks without external supervision.

A Survey of Reinforcement Learning for Large Reasoning Models

cs.CL · 2025-09-10 · accept · novelty 3.0

A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

citing papers explorer

Showing 22 of 22 citing papers after filters.

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology cs.AI · 2026-03-30 · conditional · none · ref 7
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
Evidence-State Rewards for Long-Context Reasoning cs.AI · 2026-07-02 · unverdicted · none · ref 30
Maven is an RL method using answer-conditioned evidence-state values to assign rewards to add, link, and drop actions on evidence memory, outperforming outcome-only baselines on LongBench v2, LongReason, and RULER.
The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning cs.LG · 2026-06-08 · unverdicted · none · ref 39
PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.
Verifiable Counterfactual Supervision for Process Reward Models cs.AI · 2026-05-04 · unverdicted · none · ref 7
Presents verifiable counterfactual process supervision that generates annotated trajectories via template-aware error injection on symbolic chains, improving Best-of-8 reranking on logical reasoning benchmarks with preliminary math transfer.
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis cs.CL · 2026-04-27 · unverdicted · none · ref 21 · 2 links
DataPRM is an environment-aware generative process reward model that improves LLM data analysis agents by 7-11% on benchmarks via active verification and reflection-aware ternary rewards.
GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs cs.AI · 2026-04-25 · unverdicted · none · ref 8
GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning cs.LG · 2026-04-12 · unverdicted · none · ref 37
GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback cs.AI · 2026-06-11 · unverdicted · none · ref 7
MDForge uses an LLM agent with multi-agent debate to densify sparse simulator feedback for automatic MD pipeline design, matching human experts on SAMPL benchmarks and identifying a lab-confirmed picomolar CB[7] binder.
Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs cs.AI · 2026-05-30 · unverdicted · none · ref 48
LRS trains a latent reward model on final-answer correctness to steer SAE states during inference, improving reasoning performance and implicitly encouraging better cognitive behaviors.
SDR: Set-Distance Rewards for Radiology Report Generation cs.AI · 2026-05-30 · unverdicted · none · ref 13
Set-to-set distances on sentence embeddings provide a permutation-invariant reward signal that improves GRPO training and enables efficient test-time scaling for vision-language models generating chest X-ray reports.
ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents cs.CL · 2026-05-29 · unverdicted · none · ref 20
ExpGraph builds a graph of summarized agent experiences and uses graph diffusion plus an RL-trained retrieval copilot to improve frozen LLM executors on QA, math, code, and agentic tasks without parameter updates.
VeriGate: Verifier-Gated Step-Level Supervision for GRPO cs.LG · 2026-05-28 · unverdicted · none · ref 8
VeriGate adds verifier-gated step-level supervision to GRPO via cumulated PRM rewards and group-normalized token advantages, raising accuracy 20% and 12% on 1.5B and 7B models on MATH and six benchmarks.
In LLM Reasoning, there is Irrationality on top of Value Misalignment cs.AI · 2026-05-26 · unverdicted · none · ref 136
LLMs display widespread rational value risk in reasoning that value alignment reduces but does not remove, with risk sensitive to inference strategy and showing diminishing returns from longer reasoning.
Process Rewards with Learned Reliability cs.CL · 2026-05-15 · unverdicted · none · ref 29
BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.
STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning cs.LG · 2026-05-13 · unverdicted · none · ref 18
STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs cs.CL · 2026-04-24 · unverdicted · none · ref 11
AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
Pause or Fabricate? Training Language Models for Grounded Reasoning cs.CL · 2026-04-21 · conditional · none · ref 17
GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task success on insufficient math datasets.
TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation cs.AI · 2026-05-28 · unverdicted · none · ref 5
TRACE is a new metric for assessing LLM CoT reasoning structure via Toulmin and Flavell frameworks, showing r=0.74 correlation with accuracy on 26.3K samples and utility as an RL reward.
DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes cs.AI · 2026-05-27 · unverdicted · none · ref 13
DenoiseRL optimizes recovery from noisy prefixes in weak-model reasoning failures to improve performance and self-correction on math and general reasoning benchmarks without external supervision.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning cs.LG · 2026-04-19 · unreviewed · ref 6
Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning cs.CL · 2026-04-17 · unreviewed · ref 12
Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization cs.CL · 2026-04-14 · unreviewed · ref 6

Process reward models that think

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer