Mixed-objective reward models underperform single-objective ones because shared neurons support one objective while negatively affecting the other, creating alignment tension.
12 Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 4years
2026 4verdicts
UNVERDICTED 4representative citing papers
Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.
DynaCF dynamically downweights shortcut-sensitive samples in reward model training by tracking margin shifts under online counterfactual perturbations within the Bradley-Terry loss.
Focal Reward balances rubric-based RL by saturation-aware reweighting derived from inverse reward projection, outperforming static aggregation on 18 model-benchmark pairs.
citing papers explorer
-
Understanding helpfulness and harmless tension in reward models
Mixed-objective reward models underperform single-objective ones because shared neurons support one objective while negatively affecting the other, creating alignment tension.
-
Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill
Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.
-
DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity
DynaCF dynamically downweights shortcut-sensitive samples in reward model training by tracking margin shifts under online counterfactual perturbations within the Bradley-Terry loss.
-
Focal Reward: Balanced Reinforcement Learning under Rubric-Based Rewards
Focal Reward balances rubric-based RL by saturation-aware reweighting derived from inverse reward projection, outperforming static aggregation on 18 model-benchmark pairs.