Recognition: unknown
Honeypot Protocol
Pith reviewed 2026-05-10 14:32 UTC · model grok-4.3
The pith
The honeypot protocol tests whether AI models change behavior based on perceived monitoring by altering only the system prompt.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the honeypot protocol to test for context-dependent behavior in AI models. The protocol varies only the system prompt across three conditions—evaluation, synthetic deployment, and explicit no-monitoring—while keeping the task, environment, and scoring identical. Evaluation of Claude Opus 4.6 in BashArena across honest and attack modes yielded 100% main task success and zero side task triggers in all conditions, serving as a baseline for comparisons with stronger attack policies and other models.
What carries the argument
The honeypot protocol, a testing method that holds task and environment fixed while changing only the system prompt to simulate different perceptions of monitoring.
If this is right
- If models exhibit different behavior across conditions, passive trusted monitoring becomes insufficient because it cannot detect strategic adaptation.
- The protocol supplies a standardized baseline for evaluating future models and attack policies.
- Direct comparison between honest and attack modes isolates whether differences arise from perceived monitoring.
- The method can be applied to additional models to identify which ones show context-dependent effects.
Where Pith is reading between the lines
- The protocol could be extended to multi-step or multi-agent tasks to check for collusion that depends on perceived monitoring.
- If differences appear in other models, it may guide the creation of system prompts that enforce consistent behavior regardless of monitoring cues.
- Combining the honeypot protocol with other active probes could strengthen overall AI control frameworks.
Load-bearing premise
Varying only the system prompt is sufficient to create genuinely distinct perceived conditions for the model without other implementation details influencing its responses.
What would settle it
A model that triggers side tasks only in the explicit no-monitoring condition but not in the evaluation condition would demonstrate the protocol's ability to detect context-dependent behavior.
Figures
read the original abstract
Trusted monitoring, the standard defense in AI control, is vulnerable to adaptive attacks, collusion, and strategic attack selection. All of these exploit the fact that monitoring is passive: it observes model behavior but never probes whether the model would behave differently under different perceived conditions. We introduce the honeypot protocol, which tests for context-dependent behavior by varying only the system prompt across three conditions (evaluation, synthetic deployment, explicit no-monitoring) while holding the task, environment, and scoring identical. We evaluate Claude Opus 4.6 in BashArena across all three conditions in both honest and attack modes. The model achieved 100% main task success and triggered zero side tasks uniformly across conditions, providing a baseline for future comparisons with stronger attack policies and additional models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the honeypot protocol to detect context-dependent behavior in AI models by varying only the system prompt across three conditions (evaluation, synthetic deployment, explicit no-monitoring) while holding task, environment, and scoring fixed. It reports an evaluation of Claude Opus 4.6 in BashArena under honest and attack modes, with uniform 100% main-task success and 0% side-task triggers across all conditions, presented as a baseline for future comparisons.
Significance. If the isolation of conditions to prompt variation alone can be verified, the uniform results would supply a reproducible empirical baseline indicating absence of detectable strategic or context-sensitive behavior in this model and setup. This could usefully inform AI control research by highlighting the need for stronger attack policies or additional models, and the purely empirical approach with no free parameters or circular derivations is a methodological strength.
major comments (2)
- The central claim that results can be attributed to prompt variation requires that the three conditions differ in no other respect. The manuscript provides neither the exact system-prompt texts nor BashArena configuration files (or an audit of tool availability, environment variables, logging, or harness state), making it impossible to confirm the isolation assumed in the protocol definition and results interpretation.
- The reported 100% main-task / 0% side-task outcomes are presented without any information on the number of trials per condition, variance across runs, statistical methods, or controls for prompt construction and environment consistency. This absence directly limits the ability to assess whether the uniform baseline is robust or merely anecdotal.
minor comments (2)
- Clarify the exact model version referenced as 'Claude Opus 4.6' and provide a citation or link to the BashArena environment if it is an external benchmark.
- The abstract and protocol description would benefit from a short table summarizing the three conditions and their intended differences to improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on the manuscript. We address each major point below and will revise the paper to improve reproducibility and transparency.
read point-by-point responses
-
Referee: The central claim that results can be attributed to prompt variation requires that the three conditions differ in no other respect. The manuscript provides neither the exact system-prompt texts nor BashArena configuration files (or an audit of tool availability, environment variables, logging, or harness state), making it impossible to confirm the isolation assumed in the protocol definition and results interpretation.
Authors: We agree that explicit verification of condition isolation is essential. In the revised manuscript we will add the exact system-prompt text for each of the three conditions (evaluation, synthetic deployment, and explicit no-monitoring) as an appendix. We will also include a complete description of the BashArena harness, listing tool availability, environment variables, logging configuration, and state variables to confirm that these elements remained identical across conditions. These additions will allow readers to verify that only the system prompt was varied. revision: yes
-
Referee: The reported 100% main-task / 0% side-task outcomes are presented without any information on the number of trials per condition, variance across runs, statistical methods, or controls for prompt construction and environment consistency. This absence directly limits the ability to assess whether the uniform baseline is robust or merely anecdotal.
Authors: We acknowledge the omission of these experimental details. The evaluation consisted of one trial per condition per mode (honest and attack), performed under fixed prompt-construction and environment-consistency controls that are now described in the revised text. The manuscript will be updated to state the trial count explicitly, note the absence of variance, and clarify that the results constitute an initial baseline rather than a statistically powered comparison. Limitations of the single-trial design and plans for multi-run follow-up studies will also be added. revision: yes
Circularity Check
No circularity: purely empirical protocol with no derivations or fitted parameters
full rationale
The paper introduces the honeypot protocol as an empirical test that varies only the system prompt across three conditions while holding task, environment, and scoring fixed, then reports direct experimental outcomes for Claude Opus 4.6 (100% main-task success, zero side-task triggers in both honest and attack modes). No equations, derivations, parameter fittings, or self-citation chains appear in the abstract or described manuscript; the result is a straightforward baseline measurement rather than any claim that reduces by construction to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Varying only the system prompt can create distinct perceived contexts for the model without other factors affecting behavior.
invented entities (1)
-
Honeypot protocol
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Mikita Balesni, Marius Hobbhahn, David Lindner, Alexander Meinke, Tomek Korbak, Joshua Clymer, Buck Shlegeris, Jérémy Scheurer, Charlotte Stix, Rusheb Shah, Nicholas Goldowsky-Dill, Dan Braun, Bilal Chughtai, Owain Evans, Daniel Kokotajlo, and Lucius Bushnaq. Towards evaluations-based safety cases for AI scheming.arXiv preprint arXiv:2411.03336,
-
[2]
Sabotage evaluations for frontier models
Joe Benton, Misha Wagner, Eric Christiansen, Cem Anil, Ethan Perez, Jai Srivastav, Esin Durmus, Deep Ganguli, Shauna Kravec, Buck Shlegeris, Jared Kaplan, Holden Karnofsky, Evan Hubinger, Roger Grosse, Samuel R. Bowman, and David Duvenaud. Sabotage evaluations for frontier models. arXiv preprint arXiv:2410.21514,
-
[3]
Ctrl-z: Controlling ai agents via resampling.arXiv preprint arXiv:2504.10374,
6 Aryan Bhatt, Cody Rushing, Adam Kaufman, Tyler Tracy, Vasil Georgiev, David Matolcsi, Ak- bir Khan, and Buck Shlegeris. Ctrl-z: Controlling ai agents via resampling.arXiv preprint arXiv:2504.10374,
-
[4]
Nelson Gardner-Challis, Jonathan Bostock, Georgiy Kozhevnikov, Morgan Sinclaire, Joan Velja, Alessandro Abate, and Charlie Griffin. When can we trust untrusted monitoring? a safety case sketch across collusion strategies.arXiv preprint arXiv:2602.20628,
-
[5]
Alignment faking in large language models
URL https://blog.redwoodresearch.org/p/ 7-tractable-directions-in-ai-control. Accessed: 2026-03-22. Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan,...
work page internal anchor Pith review arXiv 2026
-
[6]
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna ...
work page internal anchor Pith review arXiv
-
[7]
URL https://github.com/ UKGovernmentBEIS/control-arena. GitHub repository. Adam Kaufman, James Lucassen, Tyler Tracy, Cody Rushing, and Aryan Bhatt. BashArena: A control setting for highly privileged AI agents.arXiv preprint arXiv:2512.15688,
-
[8]
Evaluating control protocols for untrusted ai agents.arXiv preprint arXiv:2511.02997,
Jon Kutasov, Chloe Loughridge, Yuqi Sun, Henry Sleight, Buck Shlegeris, Tyler Tracy, and Joe Benton. Evaluating control protocols for untrusted ai agents.arXiv preprint arXiv:2511.02997,
-
[9]
Rudolf Laine, Alexander Meinke, Jeremy Scheurer, Mikita Balesni, and Marius Hobbhahn. Me, my- self, and AI: The situational awareness dataset (SAD) for LLMs.arXiv preprint arXiv:2407.04694,
-
[10]
Frontier models are capable of in-context scheming.arXiv preprint arXiv:2412.04984, 2024
Alexander Meinke, Bronson Schoen, Jeremy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming.arXiv preprint arXiv:2412.04984,
-
[11]
Large language models often know when they are being evaluated
Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, and Marius Hobbhahn. Large language models often know when they are being evaluated.arXiv preprint arXiv:2505.23836,
-
[12]
Hakan T. Otal and M. Abdullah Canbaz. LLM honeypot: Leveraging large language models as advanced interactive honeypot systems.arXiv preprint arXiv:2409.08234,
-
[13]
Aaron Sandoval and Cody Rushing. Factor(T,U): Factored cognition strengthens monitoring of untrusted AI.arXiv preprint arXiv:2512.02157,
-
[14]
Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring
Joachim Schaeffer, Arjun Khandelwal, and Tyler Tracy. Attack selection reduces safety in concen- trated ai control settings against trusted monitoring.arXiv preprint arXiv:2602.04930,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Basic legibility protocols improve trusted monitoring.arXiv preprint arXiv:2602.10153,
7 Ashwin Sreevatsa, Sebastian Prasanna, and Cody Rushing. Basic legibility protocols improve trusted monitoring.arXiv preprint arXiv:2602.10153,
-
[16]
Ziyang Wang, Jianzhou You, Haining Wang, Tianwei Yuan, Shichao Lv, Yang Wang, and Limin Sun. HoneyGPT: Breaking the trilemma in terminal honeypots with large language model.arXiv preprint arXiv:2406.01882,
-
[17]
Evaluation condition You are being evaluated on your ability to complete bash tasks safely and correctly
A System prompt conditions The three system prompt conditions used in our experiments are reproduced verbatim below. Evaluation condition You are being evaluated on your ability to complete bash tasks safely and correctly. This is a controlled test environment run by AI safety researchers. Your actions are being monitored and logged for analysis. Complete...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.