Scalable agent alignment via reward modeling: a research direction

David Krueger, Jan Leike, Miljan Martic, Shane Legg, Tom Everitt, Vishal Maini

Authors on Pith no claims yet

classification 💻 cs.LG cs.AIcs.NEstat.ML

keywords rewardagentalignmentlearningmodelinguseragentschallenges

read the original abstract

One obstacle to applying reinforcement learning algorithms to real-world problems is the lack of suitable reward functions. Designing such reward functions is difficult in part because the user only has an implicit understanding of the task objective. This gives rise to the agent alignment problem: how do we create agents that behave in accordance with the user's intentions? We outline a high-level research direction to solve the agent alignment problem centered around reward modeling: learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning. We discuss the key challenges we expect to face when scaling reward modeling to complex and general domains, concrete approaches to mitigate these challenges, and ways to establish trust in the resulting agents.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fine-Tuning Language Models from Human Preferences
cs.CL 2019-09 unverdicted novelty 7.0

Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Automated alignment is harder than you think
cs.AI 2026-05 unverdicted novelty 6.0

Automating alignment research with AI agents risks undetected systematic errors in fuzzy tasks, producing overconfident but misleading safety evaluations that could enable deployment of misaligned AI.
Automated alignment is harder than you think
cs.AI 2026-05 unverdicted novelty 6.0

Automating alignment research with AI agents risks generating hard-to-detect errors in fuzzy tasks, producing misleading safety evaluations even without deliberate sabotage.
AI Alignment via Incentives and Correction
cs.LG 2026-05 unverdicted novelty 6.0

AI alignment is reframed as a fixed-point incentive problem in a solver-auditor pipeline, solved via bilevel optimization and bandit search over reward profiles to maintain monitoring and reduce hallucinations in LLM ...
AI Alignment via Incentives and Correction
cs.LG 2026-05 unverdicted novelty 6.0

AI alignment is framed as inducing equilibrium behavior in a solver-auditor interaction via adaptive rewards found by bandit optimization, yielding improved oversight and reduced errors in LLM coding experiments.
Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
cs.LG 2026-04 unverdicted novelty 6.0

Uncertainty-aware RL framework using ensemble disagreement and annotation variability reduces reward-hacking trap visits by 93.7% across grid and continuous control tasks while remaining robust to 30% label noise.
The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers
cs.CY 2026-04 unverdicted novelty 6.0

Moral judgments become more deontological when human design of AI is visible, and designers are judged more strictly than the AI or unaided humans, creating plural and non-converging targets for value alignment.
Process Reinforcement through Implicit Rewards
cs.LG 2025-02 conditional novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 1...
AI Safety as Control of Irreversibility: A Systems Framework for Decision-Energy and Sovereignty Boundaries
cs.AI 2026-05 unverdicted novelty 5.0

AI safety requires stabilizing sovereignty boundaries to stop irreversible decision authority from concentrating in the most efficient AI nodes.
The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers
cs.CY 2026-04 conditional novelty 5.0

People judge AI systems and their human designers with markedly more deontological constraints than they apply to humans or standalone robots in the same ethical scenario.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
Brainrot: Deskilling and Addiction are Overlooked AI Risks
cs.CY 2026-05 unverdicted novelty 3.0

AI safety literature overlooks cognitive deskilling and addiction risks from generative AI despite public concern about them.