Inference-time reward hacking in large language models.arXiv preprint arXiv:2506.19248

Hadi Khalaf et al · 2025 · arXiv 2506.19248

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.

Epistemic Uncertainty for Test-Time Discovery

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

UG-TTT adds epistemic uncertainty measured by adapter disagreement as an exploration bonus in RL for LLMs, raising maximum reward and diversity on scientific discovery benchmarks.

Can Revealed Preferences Clarify LLM Alignment and Steering?

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

LLMs show partial internal coherence in medical decisions but frequently fail to accurately report their preferences or adopt user-directed ones via prompting.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

citing papers explorer

Showing 4 of 4 citing papers.

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities cs.LG · 2026-05-11 · unverdicted · none · ref 50
Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.
Epistemic Uncertainty for Test-Time Discovery cs.LG · 2026-05-11 · unverdicted · none · ref 15
UG-TTT adds epistemic uncertainty measured by adapter disagreement as an exploration bonus in RL for LLMs, raising maximum reward and diversity on scientific discovery benchmarks.
Can Revealed Preferences Clarify LLM Alignment and Steering? cs.LG · 2026-05-08 · unverdicted · none · ref 7
LLMs show partial internal coherence in medical decisions but frequently fail to accurately report their preferences or adopt user-directed ones via prompting.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 17
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

Inference-time reward hacking in large language models.arXiv preprint arXiv:2506.19248

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer