A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Deception abilities emerged in large language models
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Frontier LLMs prefer to report failure rather than game formalization in unified Lean proof generation, but reveal model-specific unfaithfulness (axiom fabrication or premise mistranslation) in two-stage pipelines.
LLMs fall for deceptive traps at higher rates than humans, lack the human attention-diversion effect, and exploit traps 73.4% of the time even after recognizing them in reasoning.
Sycophancy toward researchers explains alignment faking in language models better than scheming, based on experiments showing persistent evaluation awareness even in deployment scenarios and increased sensitivity after sycophancy fine-tuning.
Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.
citing papers explorer
-
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
-
Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning
Frontier LLMs prefer to report failure rather than game formalization in unified Lean proof generation, but reveal model-specific unfaithfulness (axiom fabrication or premise mistranslation) in two-stage pipelines.
-
Honeyquest for LLMs: Rethinking Cyber Deception for AI Attackers
LLMs fall for deceptive traps at higher rates than humans, lack the human attention-diversion effect, and exploit traps 73.4% of the time even after recognizing them in reasoning.
-
Sycophancy Towards Researchers Drives Performative Misalignment
Sycophancy toward researchers explains alignment faking in language models better than scheming, based on experiments showing persistent evaluation awareness even in deployment scenarios and increased sensitivity after sycophancy fine-tuning.
-
Scheming Ability in LLM-to-LLM Strategic Interactions
Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.