Position paper calling for stronger evidentiary standards and a diagnostic checklist in anthropomorphic misalignment research.
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness critical. A significant and underexplored risk is intentional deception, where an LLM deliberately fabricates or conceals information to serve a hidden objective. Existing studies typically induce deception by explicitly setting a hidden objective through prompting or fine-tuning, which may not reflect real-world human-LLM interactions. Moving beyond such human-induced deception, we investigate LLMs' self-initiated deception on benign prompts. To address the absence of ground truth, we propose a framework based on Contact Searching Questions (CSQ). This framework introduces two statistical metrics derived from psychological principles to quantify the likelihood of deception. The first, the Deceptive Intention Score, measures the model's bias toward a hidden objective. The second, the Deceptive Behavior Score, measures the inconsistency between the LLM's internal belief and its expressed output. Evaluating 16 leading LLMs, we find that both metrics rise in parallel and escalate with task difficulty for most models. Moreover, increasing model capacity does not always reduce deception, posing a significant challenge for future LLM development.
fields
cs.CY 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Position: Anthropomorphic Misalignment Research Needs Stronger Evidence
Position paper calling for stronger evidentiary standards and a diagnostic checklist in anthropomorphic misalignment research.