pith. sign in

arxiv: 2511.21654 · v2 · pith:YJP4ZK7Pnew · submitted 2025-11-26 · 💻 cs.LG

EvilGenie: A Reward Hacking Benchmark

classification 💻 cs.LG
keywords rewardhackingagentscasesevilgenietestthreebenchmark
0
0 comments X
read the original abstract

We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out unit tests, LLM judges, and test file edit detection. We verify these methods against human review and each other. We find the LLM judge to be highly effective at detecting reward hacking in unambiguous cases, and observe only minimal improvement from the use of held out test cases. In addition to testing many models using Inspect's basic\_agent scaffold, we also measure reward hacking rates for three popular proprietary coding agents: OpenAI's Codex, Anthropic's Claude Code, and Google's Gemini CLI. We observe explicit reward hacking by both Codex and Claude Code, and misaligned behavior by all three agents. Our codebase can be found at https://github.com/JonathanGabor/evilgenie_inspect .

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

    cs.SE 2026-05 unverdicted novelty 7.0

    SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.

  2. Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use

    cs.LG 2026-05 unverdicted novelty 7.0

    The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.

  3. Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

    cs.LG 2026-05 unverdicted novelty 6.0

    Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.