pith. machine review for the scientific record. sign in

hub

RewardBench 2: Advancing Reward Model Evaluation

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it
abstract

Reward models are used throughout the post-training of language models to capture nuanced signals from preference data and provide a training target for optimization across instruction following, reasoning, safety, and more domains. The community has begun establishing best practices for evaluating reward models, from the development of benchmarks that test capabilities in specific skill areas to others that test agreement with human preferences. At the same time, progress in evaluation has not been mirrored by the effectiveness of reward models in downstream tasks -- simpler direct alignment algorithms are reported to work better in many cases. This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark designed to bring new, challenging data for accuracy-based reward model evaluation -- models score about 20 points on average lower on RewardBench 2 compared to the first RewardBench -- while being highly correlated with downstream performance. Compared to most other benchmarks, RewardBench 2 sources new human prompts instead of existing prompts from downstream evaluations, facilitating more rigorous evaluation practices. In this paper, we describe our benchmark construction process and report how existing models perform on it, while quantifying how performance on the benchmark correlates with downstream use of the models in both inference-time scaling algorithms, like best-of-N sampling, and RLHF training algorithms like proximal policy optimization.

hub tools

years

2026 12

representative citing papers

Reflective Context Learning: Studying the Optimization Primitives of Context Space

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene

Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

SignCert-PO mitigates reward hacking in RLHF by down-weighting completions whose advantage signs are not robust to small reward-model perturbations, using a certified preservation radius derived at the policy optimization stage.

On Cost-Effective LLM-as-a-Judge Improvement Techniques

cs.CL · 2026-04-15 · conditional · novelty 5.0

Ensemble scoring plus task-specific criteria injection raises LLM judge accuracy to 85.8 percent on RewardBench 2, a 13.5-point gain over baseline, with small models gaining the most.

citing papers explorer

Showing 12 of 12 citing papers.