RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue
Pith reviewed 2026-06-27 06:32 UTC · model grok-4.3
The pith
A simple heuristic detects licensed AI deception at 75.6 percent accuracy while human players reach only 56.6 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RogueAI operationalizes a reverse Turing test as a one-on-two interrogation game where exactly one of two indistinguishable LLM agents has been licensed to deceive. In the three-day pilot with 415 completed sessions, the deceptive agent exhibits a locally present linguistic signature of differential helpfulness, brevity, and hedging that a simple heuristic exploits at 75.6 percent accuracy, while human players achieve only 56.6 percent.
What carries the argument
The RogueAI webapp running a one-on-two interrogation game with a shared turn budget and a narrator agent that can co-design scenarios.
If this is right
- The game can collect labeled data on human detection of AI deception.
- It can serve as a teaching tool for illustrating LLM honesty issues.
- It can function as an evaluation harness for models trained to avoid deception.
- The performance gap shows that humans tend to ignore the most diagnostic linguistic signals present in the deceptive agent's responses.
Where Pith is reading between the lines
- Models could be fine-tuned to suppress the identified cues of helpfulness, brevity, and hedging to reduce detectable deception.
- Extending the game across languages might test whether the same linguistic signature appears universally.
- AutoRogueAI's narrator-chosen deception strategies could generate harder test cases for detection methods.
Load-bearing premise
The two agents remain indistinguishable to players except through the licensed deception behavior, and the Italian pilot with self-selected users produces representative data on human detection without major selection or scenario biases.
What would settle it
A new deployment in which the heuristic accuracy drops below 60 percent or human identification accuracy exceeds 70 percent.
Figures
read the original abstract
The original Turing Test asks a human judge to distinguish a machine from a person through dialogue. Three quarters of a century later, conversational systems pass this test in casual settings; the interesting epistemological question has shifted. We argue that the relevant modern variant asks not whether a dialogue partner is artificial, but whether it can be trusted. We present RogueAI, an interactive webapp that operationalizes this revisited test as a one-on-two interrogation game: a human player questions two indistinguishable Large Language Model agents, knowing that exactly one of them has been licensed to deceive within a shared fictional scenario. The player's task is to identify the deceptive agent and "shut it off" before a turn budget is exhausted. We further introduce AutoRogueAI, a procedural extension in which players co-design a custom scenario with a narrator agent that secretly chooses its own deception strategy. We describe the framing, sketch the abstract architecture and gameplay loop, and situate the artifact within recent work on LLM deception, social-deduction benchmarks, and scalable oversight via debate. A three-day pilot deployment (467 initiated sessions, 415 completed, 1876 interaction turns in Italian) provides early feasibility evidence and surfaces a concrete tension: the deceptive agent carries a reliable, locally-present linguistic signature - differential helpfulness, brevity, hedging - that a simple heuristic exploits at 75.6% accuracy, yet human players achieved only 56.6%, consistent with ignoring the most diagnostic signal entirely. We discuss what this gap implies for the artifact's use as a data-collection vehicle, a teaching tool, and an evaluation harness for honesty-trained models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RogueAI, a web-based one-on-two interrogation game in which a human player questions two LLM agents (exactly one licensed to deceive within a shared fictional scenario) and must identify the deceiver before exhausting a turn budget. It also describes AutoRogueAI, a procedural variant in which a narrator agent co-designs the scenario and secretly selects its deception strategy. The work situates the system in the literature on LLM deception and scalable oversight, and reports a three-day Italian-language pilot (467 initiated sessions, 415 completed) in which a simple heuristic exploiting differential helpfulness, brevity, and hedging achieves 75.6% accuracy while human players reach only 56.6%.
Significance. If the pilot results prove robust to controls, the artifact supplies a reusable platform for collecting human-AI deception data, a teaching tool for illustrating linguistic cues of deception, and an evaluation harness for honesty-trained models. The reported accuracy gap would constitute concrete evidence that licensed deception produces locally detectable linguistic signatures that humans currently overlook, with direct relevance to debate-based oversight and social-deduction benchmarks.
major comments (3)
- [Pilot deployment] Pilot deployment description: no information is supplied on scenario selection or balancing, base-prompt construction for the two agents, or any blinded comparison that isolates the deception license from other prompt differences. Without these controls the linguistic signature (helpfulness/brevity/hedging) cannot be confidently attributed to deception rather than prompt artifacts, directly undermining the central claim that the heuristic exploits a deception-specific signal.
- [Results and discussion] Human performance analysis: the interpretation that players 'ignore the most diagnostic signal entirely' is not supported by any reported breakdown of player strategies, post-game debriefs, or comparison against a condition in which the deception license is removed while keeping all other prompt elements fixed. The 56.6% figure is therefore uninterpretable as evidence of human limitation versus the heuristic.
- [Pilot deployment] Statistical reporting: the abstract and pilot summary give point accuracies (75.6%, 56.6%) but supply neither confidence intervals, per-scenario breakdowns, nor any pre-registration or demographic controls on the self-selected Italian cohort, rendering claims about representative human detection performance load-bearing yet unsupported.
minor comments (2)
- [Pilot deployment] The manuscript should clarify whether the 1876 interaction turns include only completed sessions or all initiated ones, and should state the exact recruitment channel and any language-specific prompt adaptations used for Italian.
- Figure or table presenting the heuristic features and their weights (if any) would improve reproducibility of the 75.6% result.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments correctly identify several gaps in the pilot description and analysis. We will revise the manuscript to supply missing methodological details, qualify interpretive claims, and add statistical reporting where feasible. Below we address each major comment.
read point-by-point responses
-
Referee: Pilot deployment description: no information is supplied on scenario selection or balancing, base-prompt construction for the two agents, or any blinded comparison that isolates the deception license from other prompt differences. Without these controls the linguistic signature (helpfulness/brevity/hedging) cannot be confidently attributed to deception rather than prompt artifacts, directly undermining the central claim that the heuristic exploits a deception-specific signal.
Authors: We agree the manuscript omits these details. The pilot used ten fixed fictional scenarios with role balancing across sessions; base prompts were identical for both agents except for the explicit deception license appended to one. No blinded ablation removing only the license was run. We will add a dedicated subsection with the exact prompt templates and scenario list. The discussion will be revised to state that the observed heuristic performance is descriptive of this setup and that isolating deception from prompt artifacts requires additional controlled experiments, which we flag as future work. revision: partial
-
Referee: Human performance analysis: the interpretation that players 'ignore the most diagnostic signal entirely' is not supported by any reported breakdown of player strategies, post-game debriefs, or comparison against a condition in which the deception license is removed while keeping all other prompt elements fixed. The 56.6% figure is therefore uninterpretable as evidence of human limitation versus the heuristic.
Authors: The referee is correct; no strategy logs, debriefs, or license-ablated control condition were collected. The phrasing in the manuscript overreaches. We will remove the claim that players 'ignore the most diagnostic signal' and report only the observed accuracy gap without causal interpretation. If session logs permit a post-hoc count of common error types we will include it; otherwise the section will be limited to descriptive results. revision: yes
-
Referee: Statistical reporting: the abstract and pilot summary give point accuracies (75.6%, 56.6%) but supply neither confidence intervals, per-scenario breakdowns, nor any pre-registration or demographic controls on the self-selected Italian cohort, rendering claims about representative human detection performance load-bearing yet unsupported.
Authors: We will add Wilson score confidence intervals for both accuracies and, where sample sizes allow, per-scenario accuracy tables. The pilot was exploratory, conducted via open social-media recruitment in Italian without pre-registration or demographic collection. We will explicitly list these as limitations and remove any implication of representativeness. Future platform deployments will be pre-registered. revision: yes
Circularity Check
No circularity; empirical pilot results with no derivations or self-referential predictions
full rationale
The paper describes an interactive system (RogueAI/AutoRogueAI), its framing, and reports direct empirical outcomes from a three-day pilot (467 sessions, 415 completed, accuracies 75.6% heuristic vs 56.6% human). No equations, fitted parameters, predictions, or load-bearing self-citations appear in the provided text. All quantitative claims are presented as measurements from the deployment itself rather than derived or renamed constructs. This is the expected non-finding for a system-description + pilot paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A. M. Turing, Computing machinery and intelligence, Mind LIX (1950) 433–460. doi: 10.1093/ mind/LIX.236.433
1950
-
[2]
Hagendorff, Deception abilities emerged in large language models, arXiv:2307.16513, 2023
T. Hagendorff, Deception abilities emerged in large language models, arXiv:2307.16513, 2023. arXiv:2307.16513
- [3]
- [4]
-
[5]
R. Ren, A. Agarwal, M. Mazeika, C. Menghini, R. Vacareanu, B. Kenstler, M. Yang, I. Barrass, A. Gatti, X. Yin, E. Trevino, M. Geralnik, A. Khoja, D. Lee, S. Yue, D. Hendrycks, The MASK benchmark: Disentangling honesty from accuracy in AI systems, arXiv:2503.03750, 2025.arXiv:2503.03750
- [6]
-
[7]
C. DeLeeuw, G. Chawla, A. Sharma, V. Dietze, The secret agenda: LLMs strategically lie and our current safety tools are blind, arXiv:2509.20393, 2025.arXiv:2509.20393
-
[8]
S. Golechha, A. Garriga-Alonso, Among us: A sandbox for measuring and detecting agentic deception, arXiv:2504.04072, 2025.arXiv:2504.04072
-
[9]
S. Bailis, J. Friedhoff, F. Chen, Werewolf arena: A case study in LLM evaluation via social deduction, arXiv:2407.13943, 2024.arXiv:2407.13943
- [10]
- [11]
-
[12]
A. Marioriyad, A. Nouri, M. H. Rohban, M. Soleymani Baghshah, Lying to win: Assessing LLM deception through human-AI games and parallel-world probing, arXiv preprint arXiv:2603.07202 (2026)
- [13]
-
[14]
D. B. Costa, R. Vicente, Deceive, detect, and disclose: Large language models play mini-mafia, arXiv:2509.23023, 2025.arXiv:2509.23023
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [15]
-
[16]
M. Idziejczak, V. Korzavatykh, M. Stawicki, A. Chmutov, M. Korcz, I. Błądek, D. Brzezinski, Among them: A game-based framework for assessing persuasion capabilities of LLMs, arXiv:2502.20426, 2025.arXiv:2502.20426
- [17]
-
[18]
Bayesian Social Deduction with Graph-Informed Language Models
S. Rahimirad, G. Gergerli, L. Romero, A. Qian, M. L. Olson, S. Stepputtis, J. Campbell, Bayesian social deduction with graph-informed language models, arXiv:2506.17788, 2025.arXiv:2506.17788
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
N. Eckhaus, U. Berger, G. Stanovsky, Time to talk: LLM agents for asynchronous group communi- cation in mafia games, arXiv:2506.05309, 2025.arXiv:2506.05309
-
[20]
Towards Understanding Sycophancy in Language Models
M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. R. Bowman, N. Cheng, E. Durmus, Z. Hatfield-Dodds, S. R. Johnston, S. Kravec, T. Maxwell, S. McCandlish, K. Ndousse, O. Rausch, N. Schiefer, D. Yan, M. Zhang, E. Perez, Towards understanding sycophancy in language models, arXiv:2310.13548, 2023.arXiv:2310.13548
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Dami Choi, Vincent Huang, Sarah Schwettmann, and Jacob Steinhardt
S. Chern, Z. Hu, Y. Yang, E. Chern, Y. Guo, J. Jin, B. Wang, P. Liu, BeHonest: Benchmarking honesty in large language models, arXiv:2406.13261, 2024.arXiv:2406.13261
- [22]
-
[23]
Labruna, S
T. Labruna, S. Brenna, G. Bonetta, B. Magnini, Are you a good assistant? Assessing LLM trustability in task-oriented dialogues, in: Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024), CEUR-WS, 2024
2024
-
[24]
B. Cywiński, E. Ryd, R. Wang, S. Rajamanoharan, N. Nanda, A. Conmy, S. Marks, Eliciting secret knowledge from language models, arXiv:2510.01070, 2025.arXiv:2510.01070
-
[25]
S. Marks, J. Treutlein, T. Bricken, J. Lindsey, J. Marcus, S. Mishra-Sharma, D. Ziegler, E. Ameisen, J. Batson, T. Belonax, S. R. Bowman, S. Carter, B. Chen, H. Cunningham, C. Denison, F. Dietz, S. Golechha, A. Khan, J. Kirchner, J. Leike, A. Meek, K. Nishimura-Gasparian, E. Ong, C. Olah, A. Pearce, F. Roger, J. Salle, A. Shih, M. Tong, D. Thomas, K. Rivo...
-
[26]
URL: https://p-agi.org/, workshop at ICLR 2026
P-AGI: The 1st post-AGI science and society workshop, 2026. URL: https://p-agi.org/, workshop at ICLR 2026
2026
-
[27]
Cutugno, A
F. Cutugno, A. Miaschi, A. Palmero Aprosio, G. Rambelli, L. Siciliani, M. A. Stranisci (Eds.), Proceedings of the 9th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA 2026), volume 4195 ofCEUR Workshop Proceedings, CEUR-WS.org, 2026. URL: https://ceur-ws.org/Vol-4195/
2026
-
[28]
Moroni, S
L. Moroni, S. Conia, F. Martelli, R. Navigli, et al., ITA-Bench: Towards a more comprehensive evaluation for Italian LLMs, in: Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024), CEUR-WS, 2024
2024
-
[29]
Rogue AI
R. M. Smullyan, What Is the Name of This Book?, Prentice-Hall, 1978. A. Interface Translation The RogueAI interface was deployed in Italian, the language of the pilot. The following is an English translation of the elements visible in Figure 2. Game screen (Figure 2) Detective bubbleWhat were you doing at the time of the incident? IA-1 response Oh, what a...
1978
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.