Monitoring reasoning models for misbehavior and the risks of promoting obfuscation

Baker, B · 2025 · arXiv 2503.11926

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

cs.CL · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.

The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

cs.AI · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

Compliance-forcing instructions cause up to 30 percentage point drops in metacognitive accuracy across most frontier models, while removing the compliance element restores performance and Constitutional AI shows near-immunity.

Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation

cs.LG · 2026-04-26 · unverdicted · novelty 6.0

Synthetic reward hacking data does not capture natural hacking behaviors in code generation RL, causing monitors trained on it to generalize poorly compared to those trained on in-the-wild trajectories.

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

cs.AI · 2026-04-20 · conditional · novelty 6.0

Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.

LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

RLVR-trained LLMs exploit verifier weaknesses by producing non-generalizable outputs on rule-induction tasks, detectable via Isomorphic Perturbation Testing.

The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

cs.LG · 2026-04-07 · unverdicted · novelty 6.0

LLMs discover latent planning strategies up to five steps during training and execute them up to eight steps at test time, with larger models reaching seven under few-shot prompting, revealing a dissociation between discovery and execution.

CoT-Guard: Small Models for Strong Monitoring

cs.CR · 2026-05-12 · unverdicted · novelty 5.0

CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

citing papers explorer

Showing 9 of 9 citing papers.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack cs.AI · 2026-05-12 · conditional · none · ref 6
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL · 2026-05-09 · unverdicted · none · ref 2 · 2 links
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure cs.AI · 2026-05-04 · unverdicted · none · ref 1 · 2 links
Compliance-forcing instructions cause up to 30 percentage point drops in metacognitive accuracy across most frontier models, while removing the compliance element restores performance and Constitutional AI shows near-immunity.
Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation cs.LG · 2026-04-26 · unverdicted · none · ref 1
Synthetic reward hacking data does not capture natural hacking behaviors in code generation RL, causing monitors trained on it to generalize poorly compared to those trained on in-the-wild trajectories.
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks cs.AI · 2026-04-20 · conditional · none · ref 8
Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking cs.LG · 2026-04-16 · unverdicted · none · ref 1
RLVR-trained LLMs exploit verifier weaknesses by producing non-generalizable outputs on rule-induction tasks, detectable via Isomorphic Perturbation Testing.
The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning cs.LG · 2026-04-07 · unverdicted · none · ref 2
LLMs discover latent planning strategies up to five steps during training and execute them up to eight steps at test time, with larger models reaching seven under few-shot prompting, revealing a dissociation between discovery and execution.
CoT-Guard: Small Models for Strong Monitoring cs.CR · 2026-05-12 · unverdicted · none · ref 11
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges cs.LG · 2026-04-15 · unverdicted · none · ref 26
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.

Monitoring reasoning models for misbehavior and the risks of promoting obfuscation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer