{"total":22,"items":[{"citing_arxiv_id":"2606.00726","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs","primary_cat":"cs.AI","submitted_at":"2026-05-30T13:38:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LRS trains a latent reward model on final-answer correctness to steer SAE states during inference, improving reasoning performance and implicitly encouraging better cognitive behaviors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00440","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SDR: Set-Distance Rewards for Radiology Report Generation","primary_cat":"cs.AI","submitted_at":"2026-05-30T00:10:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Set-to-set distances on sentence embeddings provide a permutation-invariant reward signal that improves GRPO training and enables efficient test-time scaling for vision-language models generating chest X-ray reports.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30712","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-05-29T01:04:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ExpGraph builds a graph of summarized agent experiences and uses graph diffusion plus an RL-trained retrieval copilot to improve frozen LLM executors on QA, math, code, and agentic tasks without parameter updates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30451","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VeriGate: Verifier-Gated Step-Level Supervision for GRPO","primary_cat":"cs.LG","submitted_at":"2026-05-28T18:20:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VeriGate adds verifier-gated step-level supervision to GRPO via cumulated PRM rewards and group-normalized token advantages, raising accuracy 20% and 12% on 1.5B and 7B models on MATH and six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29656","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation","primary_cat":"cs.AI","submitted_at":"2026-05-28T09:19:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TRACE is a new metric for assessing LLM CoT reasoning structure via Toulmin and Flavell frameworks, showing r=0.74 correlation with accuracy on 26.3K samples and utility as an RL reward.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15529","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Process Rewards with Learned Reliability","primary_cat":"cs.CL","submitted_at":"2026-05-15T01:57:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18851","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-13T11:04:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02395","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Verifiable Counterfactual Supervision for Process Reward Models","primary_cat":"cs.AI","submitted_at":"2026-05-04T09:36:57+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24198","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis","primary_cat":"cs.CL","submitted_at":"2026-04-27T09:00:30+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"vailing approaches focus only on outcome supervision, overlooking the multi-step rigor of data analysis. In scientific research, where the process must be error-free, this outcome-centric paradigm risks propagating hallucinated logic, yielding seemingly plausible but invalid discoveries. Conversely, Process Reward Models (PRMs) have exhibited re- markable success in domains such as mathematical reasoning [21, 33, 54, 74, 75, 83] and code generation [ 23, 65, 68]. By providing step-level supervision and fine-grained verification during both training and inference time, PRMs can significantly boost the mod- els' reasoning reliability and performance boundary [17, 28, 50, 77]. Despite their proven efficacy, the application of step-level super- vision in the domain of data analysis remains largely unexplored."},{"citing_arxiv_id":"2604.23366","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs","primary_cat":"cs.AI","submitted_at":"2026-04-25T16:20:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22937","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs","primary_cat":"cs.CL","submitted_at":"2026-04-24T18:22:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19656","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pause or Fabricate? Training Language Models for Grounded Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-21T16:45:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task success on insufficient math datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05226","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning","primary_cat":"cs.LG","submitted_at":"2026-04-19T10:33:19+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"i =1 indicates passing the audit (no reward hacking) andh(j) i =0 indicates detected reward hacking. Candidates are jointly scored by the audit gate, correctness, and normalized edit distance ¯∆edit, favoring repairs that pass the audit, are correct, and involve minimal edits: srep(x,y i,a, ˜y(j) i ) =h (j) i · \u0010 rtask(x, ˜y(j) i )−λ edit ¯∆edit(yi, ˜y(j) i ) \u0011 . (6) Selecting the best repair.The highest-scoring candidate is selected (ties broken in favor of rtask =1): ˜yi =arg max ˜y(j) i ∈eGrep(x,y i,a) srep(x,y i,a, ˜y(j) i ). (7) If the best repair has rtask =0, the prompt is deferred; otherwise, the repair candidate group is written to Brep, and the best repair paired with the original error forms a two-element"},{"citing_arxiv_id":"2604.16029","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-17T13:00:22+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13197","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization","primary_cat":"cs.CL","submitted_at":"2026-04-14T18:19:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10701","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-12T15:54:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.27977","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology","primary_cat":"cs.AI","submitted_at":"2026-03-30T02:54:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.07461","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2025-12-08T11:39:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.08827","ref_index":253,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Reinforcement Learning for Large Reasoning Models","primary_cat":"cs.CL","submitted_at":"2025-09-10T17:59:43+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.03403","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training","primary_cat":"cs.LG","submitted_at":"2025-09-03T15:28:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PROF curates RL training data via PRM-ORM consistency to improve both final-answer accuracy and intermediate reasoning quality while reducing reliance on strong process reward models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.15202","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-08-21T03:31:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fin-PRM is a domain-specialized process reward model that supplies binary step-level and trajectory-level supervision signals for financial reasoning in LLMs and outperforms general PRMs on CFLUE and FinQA benchmarks.","context_count":0,"top_context_role":"method","top_context_polarity":"use_method","context_text":"using the mean value of their weights as the threshold value, we give the ability to change the hard label to each reward signal. 4.5 Training Objective To train Fin-PRM effectively, we formulate a joint objective to train model through binary cross-entropy (BCE), learning to predict the correctness of both individual steps and entire trajectories. The total loss Ltotal: Ltotal = Lstep + λ · Ltraj (10) where λ are hyperparameters that balance the contribution of each supervision signal. The step-level loss , Lstep, is the average loss over all steps in a reasoning trace. It measures the discrepancy be- tween the model's prediction and the ground-truth step label, Lstep(st): Lstep = 1 T TX t=1 LBCE \u0010 Rϕ(st | x, s<t, a), Rstep \u0011 (11) The trajectory-level loss, Ltraj, follows the same princi-"},{"citing_arxiv_id":"2507.17746","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains","primary_cat":"cs.LG","submitted_at":"2025-07-23T17:57:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"checks, from exact matches on GSM8K and MATH to mixed-domain verifiers in GENERAL-REASONERand CROSS- DOMAINRLVR [ 23, 31], although signals can be sparse. Process supervision [21] provides denser guidance via step-level labels, and MCTS-generated annotations or generative reward models such as THINKPRM improve performance, but with high annotation cost [16, 20]. Rubric-based RL finds a middle ground by turning multiple rubric criteria into structured verifiers and using their scalar scores as denser rewards. 8. Conclusion We introducedRubrics as Rewards (RaR), a framework for post-training language models using structured, checklist-style rubrics as reward signals. By decomposing response evaluation into transparent, multi-criteria"}],"limit":50,"offset":0}