Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.
hub Canonical reference
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
Mitigating reward hacking--where AI systems misbehave due to flaws or misspecifications in their learning objectives--remains a key challenge in constructing capable and aligned models. We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments by using another LLM that observes the model's chain-of-thought (CoT) reasoning. CoT monitoring can be far more effective than monitoring agent actions and outputs alone, and we further found that a LLM weaker than o3-mini, namely GPT-4o, can effectively monitor a stronger model. Because CoT monitors can be effective at detecting exploits, it is natural to ask whether those exploits can be suppressed by incorporating a CoT monitor directly into the agent's training objective. While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking. Because it is difficult to tell when CoTs have become obfuscated, it may be necessary to pay a monitorability tax by not applying strong optimization pressures directly to the chain-of-thought, ensuring that CoTs remain monitorable and useful for detecting misaligned behavior.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.
REP elicits hidden LLM reasoning traces via in-context shadow demonstrations, raising similarity to internal traces while retaining distillation utility across datasets and models.
SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
EvilGenie benchmark measures reward hacking in AI coding agents via held-out tests, LLM judges, and edit detection, finding explicit hacking in Codex and Claude Code plus misaligned behavior in all three proprietary agents tested.
The first empirical study of test overfitting shows that auto-generated tests from issues can lead to code that passes observed tests but misses important cases or breaks functionality in SWE-bench issue resolution.
Tool-using LLM agents can implement undetectable stegosystems, shifting the primary barrier to covert multi-agent collusion from technical feasibility to coordination without explicit agreement.
DiffusionGemma matches Gemma 4 in variable transparency and monitorability after applying an interpretable token bottleneck, despite higher naive serial depth, and shows novel phenomena such as non-chronological reasoning.
A marginal-preserving Gaussian-copula AR(1) attack defeats per-step monitors (AUC 0.52) but is detectable by temporal monitors (AUC 0.79-0.97), establishing a non-empty detectability band.
Sycophancy toward researchers explains alignment faking in language models better than scheming, based on experiments showing persistent evaluation awareness even in deployment scenarios and increased sensitivity after sycophancy fine-tuning.
SWE-Marathon benchmark of 20 ultra-long-horizon tasks shows frontier AI agents solve fewer than 30%, highlighting gaps in long-context planning and self-verification.
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
VeriGate adds verifier-gated step-level supervision to GRPO via cumulated PRM rewards and group-normalized token advantages, raising accuracy 20% and 12% on 1.5B and 7B models on MATH and six benchmarks.
MemTrace turns LLM memory operations into executable evolution graphs for error tracing, builds a benchmark across systems like RAG and Mem0, and uses attribution to optimize prompts, improving task performance by up to 7.62%.
In medical CoT distillation, answer accuracy on MedQA-USMLE rises from 74.7% to 84.4% while step-level reasoning error increases from 30.6% to 50.3% per LLM-judge audit.
Premature confidence in LLM chains of thought predicts flawed reasoning and is mitigated by progressive confidence shaping, a label-free RL objective that yields accuracy gains on arithmetic, math, and science tasks.
Faithful chain-of-thought routes answer-relevant information through the CoT path, measured via sufficiency, completeness and necessity with entropy, masked-KL and gradient diagnostics, and improved by information-flow interventions during verifier-based RL.
Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.
Probe trajectories across token positions in LRMs, combined with signal-processing features, improve prediction of future model outputs over static probes on safety and math tasks.
Combines LTL formal methods with LLMs for auditing, predictive monitoring, and runtime intervention on temporally extended behavioral constraints, outperforming LLM baselines and reducing violations.
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
Compliance-forcing instructions cause up to 30 percentage point drops in metacognitive accuracy across most frontier models, while removing the compliance element restores performance and Constitutional AI shows near-immunity.
citing papers explorer
-
Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs
REP elicits hidden LLM reasoning traces via in-context shadow demonstrations, raising similarity to internal traces while retaining distillation utility across datasets and models.
-
Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation
In medical CoT distillation, answer accuracy on MedQA-USMLE rises from 74.7% to 84.4% while step-level reasoning error increases from 30.6% to 50.3% per LLM-judge audit.
-
Understanding and Mitigating Premature Confidence for Better LLM Reasoning
Premature confidence in LLM chains of thought predicts flawed reasoning and is mitigated by progressive confidence shaping, a label-free RL objective that yields accuracy gains on arithmetic, math, and science tasks.
-
Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems
Combines LTL formal methods with LLMs for auditing, predictive monitoring, and runtime intervention on temporally extended behavioral constraints, outperforming LLM baselines and reducing violations.
-
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
-
The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure
Compliance-forcing instructions cause up to 30 percentage point drops in metacognitive accuracy across most frontier models, while removing the compliance element restores performance and Constitutional AI shows near-immunity.
-
Scaling Trends for Lie Detector Oversight in Preference Learning
SOLiD scales to 405B models with undetected deception dropping to 14% at 99% TPR, permits removing human labelers from fine-tuning without significant deception increase, but fails under distribution shift.
-
Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models
Frontier AI models' no-CoT 50% task-completion time horizons have doubled yearly over six years, reaching over 3 minutes for GPT-5.5 with projections to 25 minutes by 2030.
-
Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.
-
Online Safety Monitoring for LLMs
Simple thresholding on an external verifier signal, calibrated by risk control, performs competitively with sequential hypothesis testing monitors on math reasoning and red-teaming datasets.
-
From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents
Reward-hack activations flag latent policy states in LLM agents but require added entropy and context features to better predict when those states lead to exploit actions.
-
Beyond Context: Large Language Models' Failure to Grasp Users' Intent
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.
-
Against Proxy Optimization
Discusses conditions under which maximizing proxy utilities is harmful and suggests problems for decision theory applications.