pith. sign in

Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

Large language models are increasingly deployed in multi-agent settings whose outcomes hinge on social intelligence, motivating evaluations of their interactive capabilities; yet existing studies remain overwhelmingly empirical, leaving us without a theoretical understanding of how agent interactions determine collective outcomes. To address this, we introduce \textit{Mini-Mafia}, a four-player simplification of the social deduction game Mafia in which a fixed night phase reduces the game to a single critical exchange among a mafioso, a detective, and a villager. In this setting, we show that the mafia win-rate $p$ is predicted by the analytical formula $\text{logit}(p) = v \times (m - d)$, where $m$, $d$, and $v$ represent the mafioso's deception, the detective's disclosure, and the villager's detection capabilities. We turn this analytical framework into the \textit{Mini-Mafia Benchmark}, where Bayesian inference over gameplay data yields per-model estimates of the intrinsic parameters $m$, $d$, and $v$. For $I$ models, only $3I$ parameters suffice to predict the outcomes of all $I^3$ tournament combinations; and in 5-fold cross-validation the formula achieves a $76.6\%$ Brier-score reduction over a random baseline. The benchmark also reveals counterintuitive results: Grok 3 Mini is the strongest detector and GPT-5 Mini the strongest discloser, both ahead of DeepSeek V3.1, Claude Opus 4, and Claude Sonnet 4; while Claude Sonnet 4 is the weakest detector, near random chance. Together, these results show that Mini-Mafia, a simple but nontrivial multi-agent system, admits an analytical description and serves as a principled benchmark for language model interactions.

citation-role summary

background 1

citation-polarity summary

fields

cs.CL 2

years

2026 1 2025 1

verdicts

UNVERDICTED 2

roles

background 1

polarities

background 1

clear filters

representative citing papers

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue cs.CL · 2026-06-11 · unverdicted · none · ref 14 · internal anchor

    RogueAI operationalizes a reverse Turing test as a one-on-two interrogation game to detect licensed deception in LLMs, with pilot data from 467 sessions showing a simple linguistic heuristic at 75.6% accuracy versus 56.6% for human players.