Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
OpenClaw-RL recovers evaluative and directive signals from next-state interactions to enable online RL training of agents across terminal, GUI, SWE, and tool environments via a server-client architecture and hybrid objective.
PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.
Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.
CoLD mitigates length bias in process reward models for mathematical reasoning via counterfactual guidance, length penalties, bias estimation, and joint training, improving step selection accuracy and conciseness on MATH500 and GSM-Plus while boosting downstream RL performance.
citing papers explorer
-
Unsupervised Process Reward Models
Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
-
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
DataPRM is a new process reward model for data analysis agents that detects silent errors via environment interaction and ternary rewards, yielding 7-11% gains on benchmarks and further RL improvements.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
OpenClaw-RL: Train Any Agent Simply by Talking
OpenClaw-RL recovers evaluative and directive signals from next-state interactions to enable online RL training of agents across terminal, GUI, SWE, and tool environments via a server-client architecture and hybrid objective.
-
Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.
-
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR
Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.
-
CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning
CoLD mitigates length bias in process reward models for mathematical reasoning via counterfactual guidance, length penalties, bias estimation, and joint training, improving step selection accuracy and conciseness on MATH500 and GSM-Plus while boosting downstream RL performance.
- Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning
- VRPRM: Process Reward Modeling via Visual Reasoning