hub

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu · 2025 · cs.LG · arXiv 2507.17746

24 Pith papers cite this work. Polarity classification is still indexing.

24 Pith papers citing it

open full Pith review browse 24 citing papers arXiv PDF

abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for complex reasoning tasks with clear correctness signals such as math and coding. However, extending it to real-world reasoning tasks is challenging, as evaluation depends on nuanced, multi-criteria judgments rather than binary correctness. Instance-specific rubrics have recently been used in evaluation benchmarks to capture such judgments, but their potential as reward signals for on-policy post-training remains underexplored. We introduce $\textbf{Rubrics as Rewards}$ (RaR), an on-policy reinforcement learning method that extends RLVR beyond verifiable domains by using rubric-based feedback. Across both medical and science domains, we evaluate multiple strategies for aggregating rubric feedback into rewards. The best RaR variant achieves relative improvements of up to $31\%$ on HealthBench and $7\%$ on GPQA-Diamond over popular LLM-as-judge baselines that rely on direct Likert-based rewards. These results demonstrate that RaR-trained policies adapt well to diverse evaluation formats, performing strongly on both rubric-based and multiple-choice tasks. Moreover, we find that using rubrics as structured reward signals yields better alignment for smaller judges and reduces performance variance across judge scales.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

From Context to Skills: Can Language Models Learn from Context Skillfully?

cs.AI · 2026-04-30 · unverdicted · novelty 8.0

Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.

Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.

Rubric-based On-policy Distillation

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.

Visual Preference Optimization with Rubric Rewards

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.

Revisiting DAgger in the Era of LLM-Agents

cs.LG · 2026-05-13 · conditional · novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

Reward Hacking in Rubric-Based Reinforcement Learning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.

SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

SAGE trains a rubric-based verifier and an RL-optimized generator on seed human data to scalably augment LLM knowledge benchmarks, matching human-annotated quality on HellaSwag at lower cost and generalizing to MMLU.

RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

RubricRefine raises average tool-use reliability to 0.86 on M3ToolEval across seven models by scoring candidate code against generated contract rubrics before execution, beating prior inference-time methods at 2.6X lower latency.

CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.

SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

SHARP is a neuro-symbolic method that evolves bounded, auditable rule rubrics for LLM trading agents via cross-sample attribution and walk-forward validation, raising compact-model performance by 10-20 percentage points across equity sectors.

RVPO: Risk-Sensitive Alignment via Variance Regularization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.

Leveraging Verifier-Based Reinforcement Learning in Image Editing

cs.CV · 2026-04-30 · unverdicted · novelty 6.0

Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.

Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instruction following.

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

cs.CL · 2026-04-15 · unverdicted · novelty 6.0

C2 synthesizes contrastive helpful/misleading rubric pairs from binary preferences to train cooperative generators and critical verifiers, yielding up to 6.5-point gains on RM-Bench and enabling smaller models to match larger rubric-augmented ones.

ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment Framework

cs.AI · 2026-04-08 · unverdicted · novelty 6.0

ReflectRM improves generative reward models by adding self-reflection on analysis quality within a unified training setup for response and analysis preferences, yielding accuracy gains and reduced positional bias on benchmarks.

Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.

Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants

cs.CL · 2026-05-10 · unverdicted · novelty 5.0

Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.

SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering

cs.AI · 2026-05-04 · unverdicted · novelty 5.0

SCPRM adds prefix conditioning and schema distance to process reward models so that Monte Carlo Tree Search can explore knowledge-graph reasoning paths with both cumulative and future guidance, yielding a 1.18% average Hits@k gain on medical, legal, and CWQ tasks.

LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models

cs.CL · 2026-04-26 · unverdicted · novelty 5.0

LegalDrill uses diagnosis-driven synthesis and self-reflective verification to create high-quality training data that improves small language models' legal reasoning without expert annotations.

PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

cs.LG · 2026-04-14 · unverdicted · novelty 5.0

PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.

SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility

cs.AI · 2026-04-09 · unverdicted · novelty 4.0

SPARD dynamically tunes multi-objective reward weights and data importance in LLM reinforcement learning alignment using a self-paced curriculum driven by reward dynamics and data utility.

citing papers explorer

Showing 24 of 24 citing papers.

From Context to Skills: Can Language Models Learn from Context Skillfully? cs.AI · 2026-04-30 · unverdicted · none · ref 16 · internal anchor
Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance cs.CL · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.
Rubric-based On-policy Distillation cs.LG · 2026-05-08 · unverdicted · none · ref 24 · internal anchor
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
Visual Preference Optimization with Rubric Rewards cs.CV · 2026-04-14 · unverdicted · none · ref 47 · internal anchor
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models cs.CL · 2026-04-08 · unverdicted · none · ref 5 · internal anchor
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
Revisiting DAgger in the Era of LLM-Agents cs.LG · 2026-05-13 · conditional · none · ref 14 · internal anchor
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
Reward Hacking in Rubric-Based Reinforcement Learning cs.AI · 2026-05-12 · unverdicted · none · ref 14 · internal anchor
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation cs.CL · 2026-05-12 · unverdicted · none · ref 24 · internal anchor
SAGE trains a rubric-based verifier and an RL-optimized generator on seed human data to scalably augment LLM knowledge benchmarks, matching human-annotated quality on HellaSwag at lower cost and generalizing to MMLU.
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement cs.LG · 2026-05-10 · unverdicted · none · ref 31 · internal anchor
RubricRefine raises average tool-use reliability to 0.86 on M3ToolEval across seven models by scoring candidate code against generated contract rubrics before execution, beating prior inference-time methods at 2.6X lower latency.
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics cs.CL · 2026-05-10 · unverdicted · none · ref 36 · internal anchor
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification cs.CL · 2026-05-10 · unverdicted · none · ref 12 · internal anchor
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning cs.AI · 2026-05-08 · unverdicted · none · ref 4 · internal anchor
Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.
SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents cs.LG · 2026-05-07 · unverdicted · none · ref 23 · internal anchor
SHARP is a neuro-symbolic method that evolves bounded, auditable rule rubrics for LLM trading agents via cross-sample attribution and walk-forward validation, raising compact-model performance by 10-20 percentage points across equity sectors.
RVPO: Risk-Sensitive Alignment via Variance Regularization cs.LG · 2026-05-07 · unverdicted · none · ref 14 · internal anchor
RVPO penalizes variance across multiple reward signals during RLHF advantage aggregation, using a LogSumExp operator as a smooth variance penalty to reduce constraint neglect in LLM alignment.
Leveraging Verifier-Based Reinforcement Learning in Image Editing cs.CV · 2026-04-30 · unverdicted · none · ref 20 · internal anchor
Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text cs.CL · 2026-04-21 · unverdicted · none · ref 10 · internal anchor
POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instruction following.
C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences cs.CL · 2026-04-15 · unverdicted · none · ref 4 · internal anchor
C2 synthesizes contrastive helpful/misleading rubric pairs from binary preferences to train cooperative generators and critical verifiers, yielding up to 6.5-point gains on RM-Bench and enabling smaller models to match larger rubric-augmented ones.
ReflectRM: Boosting Generative Reward Models via Self-Reflection within a Unified Judgment Framework cs.AI · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
ReflectRM improves generative reward models by adding self-reflection on analysis quality within a unified training setup for response and analysis preferences, yielding accuracy gains and reduced positional bias on benchmarks.
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR cs.LG · 2026-04-06 · unverdicted · none · ref 4 · internal anchor
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants cs.CL · 2026-05-10 · unverdicted · none · ref 45 · internal anchor
Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
SCPRM: A Schema-aware Cumulative Process Reward Model for Knowledge Graph Question Answering cs.AI · 2026-05-04 · unverdicted · none · ref 11 · internal anchor
SCPRM adds prefix conditioning and schema distance to process reward models so that Monte Carlo Tree Search can explore knowledge-graph reasoning paths with both cumulative and future guidance, yielding a 1.18% average Hits@k gain on medical, legal, and CWQ tasks.
LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models cs.CL · 2026-04-26 · unverdicted · none · ref 3 · internal anchor
LegalDrill uses diagnosis-driven synthesis and self-reflective verification to create high-quality training data that improves small language models' legal reasoning without expert annotations.
PubSwap: Public-Data Off-Policy Coordination for Federated RLVR cs.LG · 2026-04-14 · unverdicted · none · ref 3 · internal anchor
PubSwap uses a small public dataset for selective off-policy response swapping in federated RLVR to improve coordination and performance over standard baselines on math and medical reasoning tasks.
SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility cs.AI · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
SPARD dynamically tunes multi-objective reward weights and data importance in LLM reinforcement learning alignment using a self-paced curriculum driven by reward dynamics and data utility.

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer