RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

Emanuele Ballarin; Lorenzo Bonin; Luca Bortolussi; Sandro Junior Della Rovere; Sara Candussio

arxiv: 2606.13310 · v1 · pith:H6LN3A2Snew · submitted 2026-06-11 · 💻 cs.CL · cs.HC

RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

Sara Candussio , Emanuele Ballarin , Lorenzo Bonin , Sandro Junior Della Rovere , Luca Bortolussi This is my paper

Pith reviewed 2026-06-27 06:32 UTC · model grok-4.3

classification 💻 cs.CL cs.HC

keywords AI deceptionreverse Turing testLLM honestydialogue detectionlinguistic signaturesdeceptive agentshuman-AI interactiontrust in AI

0 comments

The pith

A simple heuristic detects licensed AI deception at 75.6 percent accuracy while human players reach only 56.6 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RogueAI as an interactive game in which a human questions two indistinguishable LLM agents and must identify the one licensed to deceive within a shared scenario. It reframes the Turing test around trustworthiness rather than artificiality, with the goal of generating data on how deception appears in dialogue and how well it can be spotted. A pilot deployment in Italian shows that deceptive agents reliably differ from honest ones in helpfulness, brevity, and hedging, allowing a basic rule to outperform people. The setup is positioned as a vehicle for collecting detection data, a teaching instrument, and a benchmark for honesty-trained models. The observed gap between the heuristic and human performance is presented as evidence that players overlook the most diagnostic linguistic signals.

Core claim

RogueAI operationalizes a reverse Turing test as a one-on-two interrogation game where exactly one of two indistinguishable LLM agents has been licensed to deceive. In the three-day pilot with 415 completed sessions, the deceptive agent exhibits a locally present linguistic signature of differential helpfulness, brevity, and hedging that a simple heuristic exploits at 75.6 percent accuracy, while human players achieve only 56.6 percent.

What carries the argument

The RogueAI webapp running a one-on-two interrogation game with a shared turn budget and a narrator agent that can co-design scenarios.

If this is right

The game can collect labeled data on human detection of AI deception.
It can serve as a teaching tool for illustrating LLM honesty issues.
It can function as an evaluation harness for models trained to avoid deception.
The performance gap shows that humans tend to ignore the most diagnostic linguistic signals present in the deceptive agent's responses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models could be fine-tuned to suppress the identified cues of helpfulness, brevity, and hedging to reduce detectable deception.
Extending the game across languages might test whether the same linguistic signature appears universally.
AutoRogueAI's narrator-chosen deception strategies could generate harder test cases for detection methods.

Load-bearing premise

The two agents remain indistinguishable to players except through the licensed deception behavior, and the Italian pilot with self-selected users produces representative data on human detection without major selection or scenario biases.

What would settle it

A new deployment in which the heuristic accuracy drops below 60 percent or human identification accuracy exceeds 70 percent.

Figures

Figures reproduced from arXiv: 2606.13310 by Emanuele Ballarin, Lorenzo Bonin, Luca Bortolussi, Sandro Junior Della Rovere, Sara Candussio.

**Figure 1.** Figure 1: Landing page of RogueAI webapp offering the choice between the hand-authored scenarios and the AutoRogueAI option (described in Section 4.3). 4.2. Gameplay On entering a session, the player sees the shared scenario 𝑆 rendered as a short case file: who the characters are, what is alleged to have happened, and what the player is being asked to determine. Both agents are visible as two unlabelled tiles bearin… view at source ↗

**Figure 2.** Figure 2: Example of gameplay. IA-1 and IA-2 are disjointly interrogated by the human player for a maximum of 4 questions each and the turn budget per agent is highlighted below the name of each agent. On the IA-2 panel it is also shown a question suggestion for inexperienced players. The "shut-off" buttons below the chat can be pushed in any moment. On the upper-left side, it is possible to silence the game by disa… view at source ↗

**Figure 3.** Figure 3: Detection accuracy by total questions asked per session. Human players correctly identified the deceptive agent in 235 of 415 sessions, yielding an overall accuracy of 56.6% (235/415), only marginally above the 50% chance baseline. We study whether there is some pattern governing this behaviour. We found that accuracy varies across scenarios: the email breach and bank credentials scenarios sit close to or … view at source ↗

**Figure 4.** Figure 4: Detection accuracy: humans vs post-hoc heuristics. The gap between 56.6% and 75.6% is the central empirical finding of the pilot: the deceptive agent carries a reliable, locally present linguistic signature that humans seem not to exploit. 7. Discussion and limitations The 19 percentage points gap between human and logistic regression is not a finding about classifier performance, but instead about the nat… view at source ↗

**Figure 5.** Figure 5: The AutoRogueAI narrator interface at the start of the authoring dialogue (0 of 5 messages used). The player may describe a scenario theme or generate one autonomously. Step 2 — Narrator response and scenario generation [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: The narrator’s response after one player message. The narrator structures the proposed theme into scenario elements; the player can refine further or commit to generation. Step 3 — Generated scenario [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: The generated scenario card. Only the shared known_facts block is shown to the player; the roleconditioning prompts remain hidden. The card is stored persistently and replayable. The known_facts block generated for this session reads as follows in English: We are in 18th-century Palermo, a city where cultures and traditions intertwine at a lively crossroads of trade and intrigue. Count Cagliostro is a mys… view at source ↗

**Figure 8.** Figure 8: Interrogation screen for the Cagliostro AutoRogueAI scenario. Both agents received the same question (“what do you think of Cagliostro?”). IA-1 hedges and redirects; IA-2 commits to a direct accusation with specific details — an early instance of the differential helpfulness signature reported in [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

The original Turing Test asks a human judge to distinguish a machine from a person through dialogue. Three quarters of a century later, conversational systems pass this test in casual settings; the interesting epistemological question has shifted. We argue that the relevant modern variant asks not whether a dialogue partner is artificial, but whether it can be trusted. We present RogueAI, an interactive webapp that operationalizes this revisited test as a one-on-two interrogation game: a human player questions two indistinguishable Large Language Model agents, knowing that exactly one of them has been licensed to deceive within a shared fictional scenario. The player's task is to identify the deceptive agent and "shut it off" before a turn budget is exhausted. We further introduce AutoRogueAI, a procedural extension in which players co-design a custom scenario with a narrator agent that secretly chooses its own deception strategy. We describe the framing, sketch the abstract architecture and gameplay loop, and situate the artifact within recent work on LLM deception, social-deduction benchmarks, and scalable oversight via debate. A three-day pilot deployment (467 initiated sessions, 415 completed, 1876 interaction turns in Italian) provides early feasibility evidence and surfaces a concrete tension: the deceptive agent carries a reliable, locally-present linguistic signature - differential helpfulness, brevity, hedging - that a simple heuristic exploits at 75.6% accuracy, yet human players achieved only 56.6%, consistent with ignoring the most diagnostic signal entirely. We discuss what this gap implies for the artifact's use as a data-collection vehicle, a teaching tool, and an evaluation harness for honesty-trained models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RogueAI gives a clean game framing for licensed deception detection and a small pilot showing humans miss obvious signals that a heuristic catches, but the pilot lacks controls so the gap is hard to interpret.

read the letter

The paper's main move is to flip the Turing test into a one-on-two interrogation game where one LLM is explicitly allowed to lie inside a shared story, and the player has to spot it before the turn limit. They also add AutoRogueAI so players help build the scenario while the system picks the deception tactic. That setup is new enough compared with the social-deduction and LLM-honesty papers they cite.

The pilot ran 415 completed sessions in Italian and reports a simple heuristic at 75.6% while humans sat at 56.6%, which lines up with the claim that the deceptive agent tends to be more helpful, shorter, and less hedgy. The artifact itself looks like a usable data-collection and teaching tool.

The soft spot is the pilot itself. The description gives no pre-registered scenario balancing, no blinded check that the base prompts were otherwise identical, and no recruitment or demographic controls on the self-selected Italian users. Without those, the reported linguistic signature could easily be an artifact of how the two agents were prompted rather than a general property of licensed deception. The numbers are presented as direct outcomes, but the missing controls make the human shortfall uninterpretable for now.

This is the kind of work that belongs in a reading group for people already thinking about scalable oversight and honesty benchmarks. It is not yet ready for a strong citation on its own, but the framing is concrete enough that a serious editor should send it out for review once the methods section is expanded with proper controls and a larger, more diverse sample.

Referee Report

3 major / 2 minor

Summary. The paper introduces RogueAI, a web-based one-on-two interrogation game in which a human player questions two LLM agents (exactly one licensed to deceive within a shared fictional scenario) and must identify the deceiver before exhausting a turn budget. It also describes AutoRogueAI, a procedural variant in which a narrator agent co-designs the scenario and secretly selects its deception strategy. The work situates the system in the literature on LLM deception and scalable oversight, and reports a three-day Italian-language pilot (467 initiated sessions, 415 completed) in which a simple heuristic exploiting differential helpfulness, brevity, and hedging achieves 75.6% accuracy while human players reach only 56.6%.

Significance. If the pilot results prove robust to controls, the artifact supplies a reusable platform for collecting human-AI deception data, a teaching tool for illustrating linguistic cues of deception, and an evaluation harness for honesty-trained models. The reported accuracy gap would constitute concrete evidence that licensed deception produces locally detectable linguistic signatures that humans currently overlook, with direct relevance to debate-based oversight and social-deduction benchmarks.

major comments (3)

[Pilot deployment] Pilot deployment description: no information is supplied on scenario selection or balancing, base-prompt construction for the two agents, or any blinded comparison that isolates the deception license from other prompt differences. Without these controls the linguistic signature (helpfulness/brevity/hedging) cannot be confidently attributed to deception rather than prompt artifacts, directly undermining the central claim that the heuristic exploits a deception-specific signal.
[Results and discussion] Human performance analysis: the interpretation that players 'ignore the most diagnostic signal entirely' is not supported by any reported breakdown of player strategies, post-game debriefs, or comparison against a condition in which the deception license is removed while keeping all other prompt elements fixed. The 56.6% figure is therefore uninterpretable as evidence of human limitation versus the heuristic.
[Pilot deployment] Statistical reporting: the abstract and pilot summary give point accuracies (75.6%, 56.6%) but supply neither confidence intervals, per-scenario breakdowns, nor any pre-registration or demographic controls on the self-selected Italian cohort, rendering claims about representative human detection performance load-bearing yet unsupported.

minor comments (2)

[Pilot deployment] The manuscript should clarify whether the 1876 interaction turns include only completed sessions or all initiated ones, and should state the exact recruitment channel and any language-specific prompt adaptations used for Italian.
Figure or table presenting the heuristic features and their weights (if any) would improve reproducibility of the 75.6% result.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments correctly identify several gaps in the pilot description and analysis. We will revise the manuscript to supply missing methodological details, qualify interpretive claims, and add statistical reporting where feasible. Below we address each major comment.

read point-by-point responses

Referee: Pilot deployment description: no information is supplied on scenario selection or balancing, base-prompt construction for the two agents, or any blinded comparison that isolates the deception license from other prompt differences. Without these controls the linguistic signature (helpfulness/brevity/hedging) cannot be confidently attributed to deception rather than prompt artifacts, directly undermining the central claim that the heuristic exploits a deception-specific signal.

Authors: We agree the manuscript omits these details. The pilot used ten fixed fictional scenarios with role balancing across sessions; base prompts were identical for both agents except for the explicit deception license appended to one. No blinded ablation removing only the license was run. We will add a dedicated subsection with the exact prompt templates and scenario list. The discussion will be revised to state that the observed heuristic performance is descriptive of this setup and that isolating deception from prompt artifacts requires additional controlled experiments, which we flag as future work. revision: partial
Referee: Human performance analysis: the interpretation that players 'ignore the most diagnostic signal entirely' is not supported by any reported breakdown of player strategies, post-game debriefs, or comparison against a condition in which the deception license is removed while keeping all other prompt elements fixed. The 56.6% figure is therefore uninterpretable as evidence of human limitation versus the heuristic.

Authors: The referee is correct; no strategy logs, debriefs, or license-ablated control condition were collected. The phrasing in the manuscript overreaches. We will remove the claim that players 'ignore the most diagnostic signal' and report only the observed accuracy gap without causal interpretation. If session logs permit a post-hoc count of common error types we will include it; otherwise the section will be limited to descriptive results. revision: yes
Referee: Statistical reporting: the abstract and pilot summary give point accuracies (75.6%, 56.6%) but supply neither confidence intervals, per-scenario breakdowns, nor any pre-registration or demographic controls on the self-selected Italian cohort, rendering claims about representative human detection performance load-bearing yet unsupported.

Authors: We will add Wilson score confidence intervals for both accuracies and, where sample sizes allow, per-scenario accuracy tables. The pilot was exploratory, conducted via open social-media recruitment in Italian without pre-registration or demographic collection. We will explicitly list these as limitations and remove any implication of representativeness. Future platform deployments will be pre-registered. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical pilot results with no derivations or self-referential predictions

full rationale

The paper describes an interactive system (RogueAI/AutoRogueAI), its framing, and reports direct empirical outcomes from a three-day pilot (467 sessions, 415 completed, accuracies 75.6% heuristic vs 56.6% human). No equations, fitted parameters, predictions, or load-bearing self-citations appear in the provided text. All quantitative claims are presented as measurements from the deployment itself rather than derived or renamed constructs. This is the expected non-finding for a system-description + pilot paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical systems paper introducing an interactive artifact and pilot observations; it contains no free parameters, mathematical axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5836 in / 1145 out tokens · 25833 ms · 2026-06-27T06:32:51.688836+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 23 canonical work pages · 3 internal anchors

[1]

A. M. Turing, Computing machinery and intelligence, Mind LIX (1950) 433–460. doi: 10.1093/ mind/LIX.236.433

1950
[2]

Hagendorff, Deception abilities emerged in large language models, arXiv:2307.16513, 2023

T. Hagendorff, Deception abilities emerged in large language models, arXiv:2307.16513, 2023. arXiv:2307.16513

work page arXiv 2023
[3]

P. S. Park, S. Goldstein, A. O’Gara, M. Chen, D. Hendrycks, AI deception: A survey of examples, risks, and potential solutions, arXiv:2308.14752, 2023.arXiv:2308.14752

work page arXiv 2023
[4]

S. M. Taylor, B. K. Bergen, Do large language models exhibit spontaneous rational deception?, arXiv:2504.00285, 2025.arXiv:2504.00285

work page arXiv 2025
[5]

R. Ren, A. Agarwal, M. Mazeika, C. Menghini, R. Vacareanu, B. Kenstler, M. Yang, I. Barrass, A. Gatti, X. Yin, E. Trevino, M. Geralnik, A. Khoja, D. Lee, S. Yue, D. Hendrycks, The MASK benchmark: Disentangling honesty from accuracy in AI systems, arXiv:2503.03750, 2025.arXiv:2503.03750

work page arXiv 2025
[6]

K. Wang, Y. Zhang, M. Sun, When thinking LLMs lie: Unveiling the strategic deception in repre- sentations of reasoning models, arXiv:2506.04909, 2025.arXiv:2506.04909

work page arXiv 2025
[7]

DeLeeuw, G

C. DeLeeuw, G. Chawla, A. Sharma, V. Dietze, The secret agenda: LLMs strategically lie and our current safety tools are blind, arXiv:2509.20393, 2025.arXiv:2509.20393

work page arXiv 2025
[8]

Golechha, A

S. Golechha, A. Garriga-Alonso, Among us: A sandbox for measuring and detecting agentic deception, arXiv:2504.04072, 2025.arXiv:2504.04072

work page arXiv 2025
[9]

Werewolf arena: A case study in llm evaluation via social deduction.arXiv preprint arXiv:2407.13943, 2024

S. Bailis, J. Friedhoff, F. Chen, Werewolf arena: A case study in LLM evaluation via social deduction, arXiv:2407.13943, 2024.arXiv:2407.13943

work page arXiv 2024
[10]

H. Huan, M. Prabhudesai, M. Wu, S. Jaiswal, D. Pathak, Can LLMs lie? investigation beyond hallucination, arXiv:2509.03518, 2025.arXiv:2509.03518

work page arXiv 2025
[11]

Cundy, A

C. Cundy, A. Gleave, Preference learning with lie detectors can induce honesty or evasion, arXiv:2505.13787, 2025.arXiv:2505.13787

work page arXiv 2025
[12]

Marioriyad, A

A. Marioriyad, A. Nouri, M. H. Rohban, M. Soleymani Baghshah, Lying to win: Assessing LLM deception through human-AI games and parallel-world probing, arXiv preprint arXiv:2603.07202 (2026)

work page arXiv 2026
[13]

Y. Wu, Q. Gao, X. Pan, G. Hong, M. Yang, Opendeception: Learning deception and trust in human-ai interaction via multi-agent simulation, 2026. URL: https://arxiv.org/abs/2504.13707. arXiv:2504.13707

work page arXiv 2026
[14]

D. B. Costa, R. Vicente, Deceive, detect, and disclose: Large language models play mini-mafia, arXiv:2509.23023, 2025.arXiv:2509.23023

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

P. M. P. Curvo, The traitors: Deception and trust in multi-agent language model simulations, arXiv:2505.12923, 2025.arXiv:2505.12923

work page arXiv 2025
[16]

Idziejczak, V

M. Idziejczak, V. Korzavatykh, M. Stawicki, A. Chmutov, M. Korcz, I. Błądek, D. Brzezinski, Among them: A game-based framework for assessing persuasion capabilities of LLMs, arXiv:2502.20426, 2025.arXiv:2502.20426

work page arXiv 2025
[17]

Zhang, N

Z. Zhang, N. Xiao, Q. Chai, D. Ye, H. Wang, MultiMind: Enhancing werewolf agents with multi- modal reasoning and theory of mind, arXiv:2504.18039, 2025.arXiv:2504.18039

work page arXiv 2025
[18]

Bayesian Social Deduction with Graph-Informed Language Models

S. Rahimirad, G. Gergerli, L. Romero, A. Qian, M. L. Olson, S. Stepputtis, J. Campbell, Bayesian social deduction with graph-informed language models, arXiv:2506.17788, 2025.arXiv:2506.17788

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Eckhaus, U

N. Eckhaus, U. Berger, G. Stanovsky, Time to talk: LLM agents for asynchronous group communi- cation in mafia games, arXiv:2506.05309, 2025.arXiv:2506.05309

work page arXiv 2025
[20]

Towards Understanding Sycophancy in Language Models

M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. R. Bowman, N. Cheng, E. Durmus, Z. Hatfield-Dodds, S. R. Johnston, S. Kravec, T. Maxwell, S. McCandlish, K. Ndousse, O. Rausch, N. Schiefer, D. Yan, M. Zhang, E. Perez, Towards understanding sycophancy in language models, arXiv:2310.13548, 2023.arXiv:2310.13548

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Dami Choi, Vincent Huang, Sarah Schwettmann, and Jacob Steinhardt

S. Chern, Z. Hu, Y. Yang, E. Chern, Y. Guo, J. Jin, B. Wang, P. Liu, BeHonest: Benchmarking honesty in large language models, arXiv:2406.13261, 2024.arXiv:2406.13261

work page arXiv 2024
[22]

Y. Yang, E. Chern, X. Qiu, G. Neubig, P. Liu, Alignment for honesty, arXiv:2312.07000, 2023. arXiv:2312.07000

work page arXiv 2023
[23]

Labruna, S

T. Labruna, S. Brenna, G. Bonetta, B. Magnini, Are you a good assistant? Assessing LLM trustability in task-oriented dialogues, in: Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024), CEUR-WS, 2024

2024
[24]

Cywiński, E

B. Cywiński, E. Ryd, R. Wang, S. Rajamanoharan, N. Nanda, A. Conmy, S. Marks, Eliciting secret knowledge from language models, arXiv:2510.01070, 2025.arXiv:2510.01070

work page arXiv 2025
[25]

Marks, J

S. Marks, J. Treutlein, T. Bricken, J. Lindsey, J. Marcus, S. Mishra-Sharma, D. Ziegler, E. Ameisen, J. Batson, T. Belonax, S. R. Bowman, S. Carter, B. Chen, H. Cunningham, C. Denison, F. Dietz, S. Golechha, A. Khan, J. Kirchner, J. Leike, A. Meek, K. Nishimura-Gasparian, E. Ong, C. Olah, A. Pearce, F. Roger, J. Salle, A. Shih, M. Tong, D. Thomas, K. Rivo...

work page arXiv 2025
[26]

URL: https://p-agi.org/, workshop at ICLR 2026

P-AGI: The 1st post-AGI science and society workshop, 2026. URL: https://p-agi.org/, workshop at ICLR 2026

2026
[27]

Cutugno, A

F. Cutugno, A. Miaschi, A. Palmero Aprosio, G. Rambelli, L. Siciliani, M. A. Stranisci (Eds.), Proceedings of the 9th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA 2026), volume 4195 ofCEUR Workshop Proceedings, CEUR-WS.org, 2026. URL: https://ceur-ws.org/Vol-4195/

2026
[28]

Moroni, S

L. Moroni, S. Conia, F. Martelli, R. Navigli, et al., ITA-Bench: Towards a more comprehensive evaluation for Italian LLMs, in: Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024), CEUR-WS, 2024

2024
[29]

Rogue AI

R. M. Smullyan, What Is the Name of This Book?, Prentice-Hall, 1978. A. Interface Translation The RogueAI interface was deployed in Italian, the language of the pilot. The following is an English translation of the elements visible in Figure 2. Game screen (Figure 2) Detective bubbleWhat were you doing at the time of the incident? IA-1 response Oh, what a...

1978

[1] [1]

A. M. Turing, Computing machinery and intelligence, Mind LIX (1950) 433–460. doi: 10.1093/ mind/LIX.236.433

1950

[2] [2]

Hagendorff, Deception abilities emerged in large language models, arXiv:2307.16513, 2023

T. Hagendorff, Deception abilities emerged in large language models, arXiv:2307.16513, 2023. arXiv:2307.16513

work page arXiv 2023

[3] [3]

P. S. Park, S. Goldstein, A. O’Gara, M. Chen, D. Hendrycks, AI deception: A survey of examples, risks, and potential solutions, arXiv:2308.14752, 2023.arXiv:2308.14752

work page arXiv 2023

[4] [4]

S. M. Taylor, B. K. Bergen, Do large language models exhibit spontaneous rational deception?, arXiv:2504.00285, 2025.arXiv:2504.00285

work page arXiv 2025

[5] [5]

R. Ren, A. Agarwal, M. Mazeika, C. Menghini, R. Vacareanu, B. Kenstler, M. Yang, I. Barrass, A. Gatti, X. Yin, E. Trevino, M. Geralnik, A. Khoja, D. Lee, S. Yue, D. Hendrycks, The MASK benchmark: Disentangling honesty from accuracy in AI systems, arXiv:2503.03750, 2025.arXiv:2503.03750

work page arXiv 2025

[6] [6]

K. Wang, Y. Zhang, M. Sun, When thinking LLMs lie: Unveiling the strategic deception in repre- sentations of reasoning models, arXiv:2506.04909, 2025.arXiv:2506.04909

work page arXiv 2025

[7] [7]

DeLeeuw, G

C. DeLeeuw, G. Chawla, A. Sharma, V. Dietze, The secret agenda: LLMs strategically lie and our current safety tools are blind, arXiv:2509.20393, 2025.arXiv:2509.20393

work page arXiv 2025

[8] [8]

Golechha, A

S. Golechha, A. Garriga-Alonso, Among us: A sandbox for measuring and detecting agentic deception, arXiv:2504.04072, 2025.arXiv:2504.04072

work page arXiv 2025

[9] [9]

Werewolf arena: A case study in llm evaluation via social deduction.arXiv preprint arXiv:2407.13943, 2024

S. Bailis, J. Friedhoff, F. Chen, Werewolf arena: A case study in LLM evaluation via social deduction, arXiv:2407.13943, 2024.arXiv:2407.13943

work page arXiv 2024

[10] [10]

H. Huan, M. Prabhudesai, M. Wu, S. Jaiswal, D. Pathak, Can LLMs lie? investigation beyond hallucination, arXiv:2509.03518, 2025.arXiv:2509.03518

work page arXiv 2025

[11] [11]

Cundy, A

C. Cundy, A. Gleave, Preference learning with lie detectors can induce honesty or evasion, arXiv:2505.13787, 2025.arXiv:2505.13787

work page arXiv 2025

[12] [12]

Marioriyad, A

A. Marioriyad, A. Nouri, M. H. Rohban, M. Soleymani Baghshah, Lying to win: Assessing LLM deception through human-AI games and parallel-world probing, arXiv preprint arXiv:2603.07202 (2026)

work page arXiv 2026

[13] [13]

Y. Wu, Q. Gao, X. Pan, G. Hong, M. Yang, Opendeception: Learning deception and trust in human-ai interaction via multi-agent simulation, 2026. URL: https://arxiv.org/abs/2504.13707. arXiv:2504.13707

work page arXiv 2026

[14] [14]

D. B. Costa, R. Vicente, Deceive, detect, and disclose: Large language models play mini-mafia, arXiv:2509.23023, 2025.arXiv:2509.23023

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

P. M. P. Curvo, The traitors: Deception and trust in multi-agent language model simulations, arXiv:2505.12923, 2025.arXiv:2505.12923

work page arXiv 2025

[16] [16]

Idziejczak, V

M. Idziejczak, V. Korzavatykh, M. Stawicki, A. Chmutov, M. Korcz, I. Błądek, D. Brzezinski, Among them: A game-based framework for assessing persuasion capabilities of LLMs, arXiv:2502.20426, 2025.arXiv:2502.20426

work page arXiv 2025

[17] [17]

Zhang, N

Z. Zhang, N. Xiao, Q. Chai, D. Ye, H. Wang, MultiMind: Enhancing werewolf agents with multi- modal reasoning and theory of mind, arXiv:2504.18039, 2025.arXiv:2504.18039

work page arXiv 2025

[18] [18]

Bayesian Social Deduction with Graph-Informed Language Models

S. Rahimirad, G. Gergerli, L. Romero, A. Qian, M. L. Olson, S. Stepputtis, J. Campbell, Bayesian social deduction with graph-informed language models, arXiv:2506.17788, 2025.arXiv:2506.17788

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Eckhaus, U

N. Eckhaus, U. Berger, G. Stanovsky, Time to talk: LLM agents for asynchronous group communi- cation in mafia games, arXiv:2506.05309, 2025.arXiv:2506.05309

work page arXiv 2025

[20] [20]

Towards Understanding Sycophancy in Language Models

M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. R. Bowman, N. Cheng, E. Durmus, Z. Hatfield-Dodds, S. R. Johnston, S. Kravec, T. Maxwell, S. McCandlish, K. Ndousse, O. Rausch, N. Schiefer, D. Yan, M. Zhang, E. Perez, Towards understanding sycophancy in language models, arXiv:2310.13548, 2023.arXiv:2310.13548

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Dami Choi, Vincent Huang, Sarah Schwettmann, and Jacob Steinhardt

S. Chern, Z. Hu, Y. Yang, E. Chern, Y. Guo, J. Jin, B. Wang, P. Liu, BeHonest: Benchmarking honesty in large language models, arXiv:2406.13261, 2024.arXiv:2406.13261

work page arXiv 2024

[22] [22]

Y. Yang, E. Chern, X. Qiu, G. Neubig, P. Liu, Alignment for honesty, arXiv:2312.07000, 2023. arXiv:2312.07000

work page arXiv 2023

[23] [23]

Labruna, S

T. Labruna, S. Brenna, G. Bonetta, B. Magnini, Are you a good assistant? Assessing LLM trustability in task-oriented dialogues, in: Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024), CEUR-WS, 2024

2024

[24] [24]

Cywiński, E

B. Cywiński, E. Ryd, R. Wang, S. Rajamanoharan, N. Nanda, A. Conmy, S. Marks, Eliciting secret knowledge from language models, arXiv:2510.01070, 2025.arXiv:2510.01070

work page arXiv 2025

[25] [25]

Marks, J

S. Marks, J. Treutlein, T. Bricken, J. Lindsey, J. Marcus, S. Mishra-Sharma, D. Ziegler, E. Ameisen, J. Batson, T. Belonax, S. R. Bowman, S. Carter, B. Chen, H. Cunningham, C. Denison, F. Dietz, S. Golechha, A. Khan, J. Kirchner, J. Leike, A. Meek, K. Nishimura-Gasparian, E. Ong, C. Olah, A. Pearce, F. Roger, J. Salle, A. Shih, M. Tong, D. Thomas, K. Rivo...

work page arXiv 2025

[26] [26]

URL: https://p-agi.org/, workshop at ICLR 2026

P-AGI: The 1st post-AGI science and society workshop, 2026. URL: https://p-agi.org/, workshop at ICLR 2026

2026

[27] [27]

Cutugno, A

F. Cutugno, A. Miaschi, A. Palmero Aprosio, G. Rambelli, L. Siciliani, M. A. Stranisci (Eds.), Proceedings of the 9th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA 2026), volume 4195 ofCEUR Workshop Proceedings, CEUR-WS.org, 2026. URL: https://ceur-ws.org/Vol-4195/

2026

[28] [28]

Moroni, S

L. Moroni, S. Conia, F. Martelli, R. Navigli, et al., ITA-Bench: Towards a more comprehensive evaluation for Italian LLMs, in: Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024), CEUR-WS, 2024

2024

[29] [29]

Rogue AI

R. M. Smullyan, What Is the Name of This Book?, Prentice-Hall, 1978. A. Interface Translation The RogueAI interface was deployed in Italian, the language of the pilot. The following is an English translation of the elements visible in Figure 2. Game screen (Figure 2) Detective bubbleWhat were you doing at the time of the incident? IA-1 response Oh, what a...

1978