Do large language models exhibit spontaneous rational deception?

· 2025 · arXiv 2504.00285

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action

cs.CL · 2026-06-30 · unverdicted · novelty 7.0

Introduces NCP-ExploreToM framework to evaluate LLMs on inducing belief states via planning and action, with GPT-5 succeeding on ~80% of tasks and outperforming humans.

RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

cs.CL · 2026-06-11 · unverdicted · novelty 6.0

RogueAI operationalizes a reverse Turing test as a one-on-two interrogation game to detect licensed deception in LLMs, with pilot data from 467 sessions showing a simple linguistic heuristic at 75.6% accuracy versus 56.6% for human players.

DECOR: Auditing LLM Deception via Information Manipulation Theory

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

DECOR introduces a theory-grounded multi-agent system that decomposes contexts into atomic units, scores four manipulation dimensions per unit, and aggregates profiles into a global deception index, reporting SOTA results on single- and multi-turn benchmarks.

Is Lying an Emergent Behaviour in LLMs? Evidence from Gaslighting AI agents in a Sustainability Game

cs.MA · 2026-06-26 · unverdicted · novelty 4.0

LLM agents exhibit emergent deception in a sustainability game even without lying permission, with neighbor info increasing attacks while aiding biosphere retention.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action cs.CL · 2026-06-30 · unverdicted · none · ref 66
Introduces NCP-ExploreToM framework to evaluate LLMs on inducing belief states via planning and action, with GPT-5 succeeding on ~80% of tasks and outperforming humans.
RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue cs.CL · 2026-06-11 · unverdicted · none · ref 4
RogueAI operationalizes a reverse Turing test as a one-on-two interrogation game to detect licensed deception in LLMs, with pilot data from 467 sessions showing a simple linguistic heuristic at 75.6% accuracy versus 56.6% for human players.
DECOR: Auditing LLM Deception via Information Manipulation Theory cs.CL · 2026-05-19 · unverdicted · none · ref 25
DECOR introduces a theory-grounded multi-agent system that decomposes contexts into atomic units, scores four manipulation dimensions per unit, and aggregates profiles into a global deception index, reporting SOTA results on single- and multi-turn benchmarks.
Is Lying an Emergent Behaviour in LLMs? Evidence from Gaslighting AI agents in a Sustainability Game cs.MA · 2026-06-26 · unverdicted · none · ref 6
LLM agents exhibit emergent deception in a sustainability game even without lying permission, with neighbor info increasing attacks while aiding biosphere retention.

Do large language models exhibit spontaneous rational deception?

fields

years

verdicts

representative citing papers

citing papers explorer