The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
Preprint, arXiv:2503.04793
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
method 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
method 1polarities
use method 1representative citing papers
Inserting sentence-boundary delimiters into LLM inputs yields consistent gains of up to 7.7% on GSM8k and 12.5% on DROP across 7B to 600B models via in-context learning and supervised fine-tuning.
citing papers explorer
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
-
Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities
Inserting sentence-boundary delimiters into LLM inputs yields consistent gains of up to 7.7% on GSM8k and 12.5% on DROP across 7B to 600B models via in-context learning and supervised fine-tuning.