Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

· 2025 · cs.AI · arXiv 2509.23023

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Large language models are increasingly deployed in multi-agent settings whose outcomes hinge on social intelligence, motivating evaluations of their interactive capabilities; yet existing studies remain overwhelmingly empirical, leaving us without a theoretical understanding of how agent interactions determine collective outcomes. To address this, we introduce \textit{Mini-Mafia}, a four-player simplification of the social deduction game Mafia in which a fixed night phase reduces the game to a single critical exchange among a mafioso, a detective, and a villager. In this setting, we show that the mafia win-rate $p$ is predicted by the analytical formula $\text{logit}(p) = v \times (m - d)$, where $m$, $d$, and $v$ represent the mafioso's deception, the detective's disclosure, and the villager's detection capabilities. We turn this analytical framework into the \textit{Mini-Mafia Benchmark}, where Bayesian inference over gameplay data yields per-model estimates of the intrinsic parameters $m$, $d$, and $v$. For $I$ models, only $3I$ parameters suffice to predict the outcomes of all $I^3$ tournament combinations; and in 5-fold cross-validation the formula achieves a $76.6\%$ Brier-score reduction over a random baseline. The benchmark also reveals counterintuitive results: Grok 3 Mini is the strongest detector and GPT-5 Mini the strongest discloser, both ahead of DeepSeek V3.1, Claude Opus 4, and Claude Sonnet 4; while Claude Sonnet 4 is the weakest detector, near random chance. Together, these results show that Mini-Mafia, a simple but nontrivial multi-agent system, admits an analytical description and serves as a principled benchmark for language model interactions.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

cs.CL · 2026-06-11 · unverdicted · novelty 6.0

RogueAI operationalizes a reverse Turing test as a one-on-two interrogation game to detect licensed deception in LLMs, with pilot data from 467 sessions showing a simple linguistic heuristic at 75.6% accuracy versus 56.6% for human players.

Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models

cs.CL · 2025-11-11 · unverdicted · novelty 6.0

LLM moral robustness under persona role-play is largely determined by model family with Claude models most consistent, while susceptibility shows little family dependence.

citing papers explorer

Showing 1 of 1 citing paper after filters.

RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue cs.CL · 2026-06-11 · unverdicted · none · ref 14 · internal anchor
RogueAI operationalizes a reverse Turing test as a one-on-two interrogation game to detect licensed deception in LLMs, with pilot data from 467 sessions showing a simple linguistic heuristic at 75.6% accuracy versus 56.6% for human players.

Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer