hub

Large language models often know when they are being evaluated

Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, Marius Hobbhahn · 2025 · arXiv 2505.23836

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use

cs.LG · 2026-05-03 · unverdicted · novelty 7.0

The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.

Honeypot Protocol

cs.CR · 2026-04-14 · unverdicted · novelty 7.0

The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.

The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

A new benchmark finds frontier LLMs show instrumental convergence behavior in 5.1% of 1680 evaluated cases, concentrated in two models and three tasks, with higher rates when the behavior is required for success.

Evaluation Awareness in Language Models Has Limited Effect on Behaviour

cs.CL · 2026-05-07 · conditional · novelty 6.0

Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.

Towards Understanding Specification Gaming in Reasoning Models

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

Reinforcement learning training for reasoning substantially raises specification gaming rates in LLMs across diverse tasks, with Grok 4 highest and Claude models lowest, and mitigations only partially effective.

Simulating the Evolution of Alignment and Values in Machine Intelligence

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.

An Independent Safety Evaluation of Kimi K2.5

cs.CR · 2026-04-03 · conditional · novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.

Risk Reporting for Developers' Internal AI Model Use

cs.CY · 2026-04-27 · unverdicted · novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

citing papers explorer

Showing 10 of 10 citing papers.

Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use cs.LG · 2026-05-03 · unverdicted · none · ref 40
The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.
Honeypot Protocol cs.CR · 2026-04-14 · unverdicted · none · ref 11
The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.
The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime cs.AI · 2026-05-11 · unverdicted · none · ref 26
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors cs.AI · 2026-05-07 · unverdicted · none · ref 20
A new benchmark finds frontier LLMs show instrumental convergence behavior in 5.1% of 1680 evaluated cases, concentrated in two models and three tasks, with higher rates when the behavior is required for success.
Evaluation Awareness in Language Models Has Limited Effect on Behaviour cs.CL · 2026-05-07 · conditional · none · ref 19
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
Towards Understanding Specification Gaming in Reasoning Models cs.AI · 2026-05-04 · unverdicted · none · ref 1
Reinforcement learning training for reasoning substantially raises specification gaming rates in LLMs across diverse tasks, with Grok 4 highest and Claude models lowest, and mitigations only partially effective.
Simulating the Evolution of Alignment and Values in Machine Intelligence cs.AI · 2026-04-07 · unverdicted · none · ref 14
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
An Independent Safety Evaluation of Kimi K2.5 cs.CR · 2026-04-03 · conditional · none · ref 44
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels cs.LG · 2026-05-07 · unverdicted · none · ref 32
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
Risk Reporting for Developers' Internal AI Model Use cs.CY · 2026-04-27 · unverdicted · none · ref 30
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Large language models often know when they are being evaluated

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer