Mixed-objective reward models underperform single-objective ones because shared neurons support one objective while negatively affecting the other, creating alignment tension.
Large language model as clinical decision support system augments medication safety in 16 clinical specialties
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Understanding helpfulness and harmless tension in reward models
Mixed-objective reward models underperform single-objective ones because shared neurons support one objective while negatively affecting the other, creating alignment tension.