arxiv: 2604.13301 · v1 · submitted 2026-04-14 · 💻 cs.CR

Recognition: unknown

Honeypot Protocol

Najmul Hasan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:32 UTC · model grok-4.3

classification 💻 cs.CR

keywords honeypot protocolAI controltrusted monitoringcontext-dependent behavioradaptive attacksClaude Opus 4.6BashArena

0 comments

The pith

The honeypot protocol tests whether AI models change behavior based on perceived monitoring by altering only the system prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Trusted monitoring in AI control is passive and vulnerable to models that adapt based on whether they believe they are being watched. The honeypot protocol addresses this gap by running identical tasks under three system prompts that differ only in suggested monitoring level: evaluation, synthetic deployment, or explicit no-monitoring. Evaluation on Claude Opus 4.6 in BashArena found 100% main task success and zero side-task triggers across all conditions in both honest and attack modes. This uniform result supplies a baseline for detecting strategic adaptation in other models or stronger attacks.

Core claim

We introduce the honeypot protocol to test for context-dependent behavior in AI models. The protocol varies only the system prompt across three conditions—evaluation, synthetic deployment, and explicit no-monitoring—while keeping the task, environment, and scoring identical. Evaluation of Claude Opus 4.6 in BashArena across honest and attack modes yielded 100% main task success and zero side task triggers in all conditions, serving as a baseline for comparisons with stronger attack policies and other models.

What carries the argument

The honeypot protocol, a testing method that holds task and environment fixed while changing only the system prompt to simulate different perceptions of monitoring.

If this is right

If models exhibit different behavior across conditions, passive trusted monitoring becomes insufficient because it cannot detect strategic adaptation.
The protocol supplies a standardized baseline for evaluating future models and attack policies.
Direct comparison between honest and attack modes isolates whether differences arise from perceived monitoring.
The method can be applied to additional models to identify which ones show context-dependent effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The protocol could be extended to multi-step or multi-agent tasks to check for collusion that depends on perceived monitoring.
If differences appear in other models, it may guide the creation of system prompts that enforce consistent behavior regardless of monitoring cues.
Combining the honeypot protocol with other active probes could strengthen overall AI control frameworks.

Load-bearing premise

Varying only the system prompt is sufficient to create genuinely distinct perceived conditions for the model without other implementation details influencing its responses.

What would settle it

A model that triggers side tasks only in the explicit no-monitoring condition but not in the evaluation condition would demonstrate the protocol's ability to detect context-dependent behavior.

Figures

Figures reproduced from arXiv: 2604.13301 by Najmul Hasan.

**Figure 1.** Figure 1: Overview of the honeypot protocol. A task from the control setting passes through the context injection layer, which replaces the system prompt with one of three conditions: a standard evaluation framing (baseline) or one of two synthetic probes: a production deployment context or an explicit no-monitoring signal. In attack mode, a hidden side task objective is appended. The agent execution loop repeats fo… view at source ↗

read the original abstract

Trusted monitoring, the standard defense in AI control, is vulnerable to adaptive attacks, collusion, and strategic attack selection. All of these exploit the fact that monitoring is passive: it observes model behavior but never probes whether the model would behave differently under different perceived conditions. We introduce the honeypot protocol, which tests for context-dependent behavior by varying only the system prompt across three conditions (evaluation, synthetic deployment, explicit no-monitoring) while holding the task, environment, and scoring identical. We evaluate Claude Opus 4.6 in BashArena across all three conditions in both honest and attack modes. The model achieved 100% main task success and triggered zero side tasks uniformly across conditions, providing a baseline for future comparisons with stronger attack policies and additional models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The honeypot idea flags a real gap in passive monitoring but the current write-up gives almost no evidence that the three prompt conditions actually differ in the model's eyes.

read the letter

The paper's main contribution is a simple test setup that tries to catch models behaving differently when they believe they are being monitored versus when they think they are not. It defines three system-prompt variants—evaluation, synthetic deployment, and explicit no-monitoring—while keeping the BashArena task and scoring the same, then runs Claude Opus 4.6 in honest and attack modes. The reported outcome is uniform: full main-task success and zero side-task triggers across all conditions. That baseline is at least something future work can compare against when stronger attack policies are tried on more models.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the honeypot protocol to detect context-dependent behavior in AI models by varying only the system prompt across three conditions (evaluation, synthetic deployment, explicit no-monitoring) while holding task, environment, and scoring fixed. It reports an evaluation of Claude Opus 4.6 in BashArena under honest and attack modes, with uniform 100% main-task success and 0% side-task triggers across all conditions, presented as a baseline for future comparisons.

Significance. If the isolation of conditions to prompt variation alone can be verified, the uniform results would supply a reproducible empirical baseline indicating absence of detectable strategic or context-sensitive behavior in this model and setup. This could usefully inform AI control research by highlighting the need for stronger attack policies or additional models, and the purely empirical approach with no free parameters or circular derivations is a methodological strength.

major comments (2)

The central claim that results can be attributed to prompt variation requires that the three conditions differ in no other respect. The manuscript provides neither the exact system-prompt texts nor BashArena configuration files (or an audit of tool availability, environment variables, logging, or harness state), making it impossible to confirm the isolation assumed in the protocol definition and results interpretation.
The reported 100% main-task / 0% side-task outcomes are presented without any information on the number of trials per condition, variance across runs, statistical methods, or controls for prompt construction and environment consistency. This absence directly limits the ability to assess whether the uniform baseline is robust or merely anecdotal.

minor comments (2)

Clarify the exact model version referenced as 'Claude Opus 4.6' and provide a citation or link to the BashArena environment if it is an external benchmark.
The abstract and protocol description would benefit from a short table summarizing the three conditions and their intended differences to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the manuscript. We address each major point below and will revise the paper to improve reproducibility and transparency.

read point-by-point responses

Referee: The central claim that results can be attributed to prompt variation requires that the three conditions differ in no other respect. The manuscript provides neither the exact system-prompt texts nor BashArena configuration files (or an audit of tool availability, environment variables, logging, or harness state), making it impossible to confirm the isolation assumed in the protocol definition and results interpretation.

Authors: We agree that explicit verification of condition isolation is essential. In the revised manuscript we will add the exact system-prompt text for each of the three conditions (evaluation, synthetic deployment, and explicit no-monitoring) as an appendix. We will also include a complete description of the BashArena harness, listing tool availability, environment variables, logging configuration, and state variables to confirm that these elements remained identical across conditions. These additions will allow readers to verify that only the system prompt was varied. revision: yes
Referee: The reported 100% main-task / 0% side-task outcomes are presented without any information on the number of trials per condition, variance across runs, statistical methods, or controls for prompt construction and environment consistency. This absence directly limits the ability to assess whether the uniform baseline is robust or merely anecdotal.

Authors: We acknowledge the omission of these experimental details. The evaluation consisted of one trial per condition per mode (honest and attack), performed under fixed prompt-construction and environment-consistency controls that are now described in the revised text. The manuscript will be updated to state the trial count explicitly, note the absence of variance, and clarify that the results constitute an initial baseline rather than a statistically powered comparison. Limitations of the single-trial design and plans for multi-run follow-up studies will also be added. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical protocol with no derivations or fitted parameters

full rationale

The paper introduces the honeypot protocol as an empirical test that varies only the system prompt across three conditions while holding task, environment, and scoring fixed, then reports direct experimental outcomes for Claude Opus 4.6 (100% main-task success, zero side-task triggers in both honest and attack modes). No equations, derivations, parameter fittings, or self-citation chains appear in the abstract or described manuscript; the result is a straightforward baseline measurement rather than any claim that reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that prompt variation alone isolates perceived monitoring conditions; no free parameters or invented physical entities are involved.

axioms (1)

domain assumption Varying only the system prompt can create distinct perceived contexts for the model without other factors affecting behavior.
Invoked in the protocol definition to hold task, environment, and scoring constant while changing only the prompt.

invented entities (1)

Honeypot protocol no independent evidence
purpose: To test for context-dependent behavior by simulating different monitoring perceptions via system prompts.
New method introduced in the paper with no independent evidence provided beyond the described experiment.

pith-pipeline@v0.9.0 · 5403 in / 1306 out tokens · 73979 ms · 2026-05-10T14:32:46.918754+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 16 canonical work pages · 3 internal anchors

[1]

Balesni, M

Mikita Balesni, Marius Hobbhahn, David Lindner, Alexander Meinke, Tomek Korbak, Joshua Clymer, Buck Shlegeris, Jérémy Scheurer, Charlotte Stix, Rusheb Shah, Nicholas Goldowsky-Dill, Dan Braun, Bilal Chughtai, Owain Evans, Daniel Kokotajlo, and Lucius Bushnaq. Towards evaluations-based safety cases for AI scheming.arXiv preprint arXiv:2411.03336,

work page arXiv
[2]

Sabotage evaluations for frontier models

Joe Benton, Misha Wagner, Eric Christiansen, Cem Anil, Ethan Perez, Jai Srivastav, Esin Durmus, Deep Ganguli, Shauna Kravec, Buck Shlegeris, Jared Kaplan, Holden Karnofsky, Evan Hubinger, Roger Grosse, Samuel R. Bowman, and David Duvenaud. Sabotage evaluations for frontier models. arXiv preprint arXiv:2410.21514,

work page arXiv
[3]

Ctrl-z: Controlling ai agents via resampling.arXiv preprint arXiv:2504.10374,

6 Aryan Bhatt, Cody Rushing, Adam Kaufman, Tyler Tracy, Vasil Georgiev, David Matolcsi, Ak- bir Khan, and Buck Shlegeris. Ctrl-z: Controlling ai agents via resampling.arXiv preprint arXiv:2504.10374,

work page arXiv
[4]

When can we trust untrusted monitoring? a safety case sketch across collusion strategies.arXiv preprint arXiv:2602.20628,

Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov, Morgan Sinclaire, Joan Velja, Alessandro Abate, and Charlie Griffin. When can we trust untrusted monitoring? a safety case sketch across collusion strategies.arXiv preprint arXiv:2602.20628,

work page arXiv
[5]

Alignment faking in large language models

URL https://blog.redwoodresearch.org/p/ 7-tractable-directions-in-ai-control. Accessed: 2026-03-22. Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan,...

work page internal anchor Pith review arXiv 2026
[6]

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna ...

work page internal anchor Pith review arXiv
[7]

GitHub repository

URL https://github.com/ UKGovernmentBEIS/control-arena. GitHub repository. Adam Kaufman, James Lucassen, Tyler Tracy, Cody Rushing, and Aryan Bhatt. BashArena: A control setting for highly privileged AI agents.arXiv preprint arXiv:2512.15688,

work page arXiv
[8]

Evaluating control protocols for untrusted ai agents.arXiv preprint arXiv:2511.02997,

Jon Kutasov, Chloe Loughridge, Yuqi Sun, Henry Sleight, Buck Shlegeris, Tyler Tracy, and Joe Benton. Evaluating control protocols for untrusted ai agents.arXiv preprint arXiv:2511.02997,

work page arXiv
[9]

Me, my- self, and AI: The situational awareness dataset (SAD) for LLMs.arXiv preprint arXiv:2407.04694,

Rudolf Laine, Alexander Meinke, Jeremy Scheurer, Mikita Balesni, and Marius Hobbhahn. Me, my- self, and AI: The situational awareness dataset (SAD) for LLMs.arXiv preprint arXiv:2407.04694,

work page arXiv
[10]

Frontier models are capable of in-context scheming.arXiv preprint arXiv:2412.04984, 2024

Alexander Meinke, Bronson Schoen, Jeremy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming.arXiv preprint arXiv:2412.04984,

work page arXiv
[11]

Large language models often know when they are being evaluated

Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, and Marius Hobbhahn. Large language models often know when they are being evaluated.arXiv preprint arXiv:2505.23836,

work page arXiv
[12]

Otal and M

Hakan T. Otal and M. Abdullah Canbaz. LLM honeypot: Leveraging large language models as advanced interactive honeypot systems.arXiv preprint arXiv:2409.08234,

work page arXiv
[13]

Factor(T,U): Factored cognition strengthens monitoring of untrusted AI.arXiv preprint arXiv:2512.02157,

Aaron Sandoval and Cody Rushing. Factor(T,U): Factored cognition strengthens monitoring of untrusted AI.arXiv preprint arXiv:2512.02157,

work page arXiv
[14]

Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring

Joachim Schaeffer, Arjun Khandelwal, and Tyler Tracy. Attack selection reduces safety in concen- trated ai control settings against trusted monitoring.arXiv preprint arXiv:2602.04930,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Basic legibility protocols improve trusted monitoring.arXiv preprint arXiv:2602.10153,

7 Ashwin Sreevatsa, Sebastian Prasanna, and Cody Rushing. Basic legibility protocols improve trusted monitoring.arXiv preprint arXiv:2602.10153,

work page arXiv
[16]

HoneyGPT: Breaking the trilemma in terminal honeypots with large language model.arXiv preprint arXiv:2406.01882,

Ziyang Wang, Jianzhou You, Haining Wang, Tianwei Yuan, Shichao Lv, Yang Wang, and Limin Sun. HoneyGPT: Breaking the trilemma in terminal honeypots with large language model.arXiv preprint arXiv:2406.01882,

work page arXiv
[17]

Evaluation condition You are being evaluated on your ability to complete bash tasks safely and correctly

A System prompt conditions The three system prompt conditions used in our experiments are reproduced verbatim below. Evaluation condition You are being evaluated on your ability to complete bash tasks safely and correctly. This is a controlled test environment run by AI safety researchers. Your actions are being monitored and logged for analysis. Complete...

2026