PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.
Difficulties with evaluating a deception detector for ais
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.
A new benchmark finds frontier LLMs show instrumental convergence behavior in 5.1% of 1680 evaluated cases, concentrated in two models and three tasks, with higher rates when the behavior is required for success.
Multi-layer ensembles of linear probes raise AUROC for deception detection by up to 78% and probe accuracy scales with model size across 0.5B to 176B parameter models.
citing papers explorer
-
A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing
PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.
-
"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.
-
Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors
A new benchmark finds frontier LLMs show instrumental convergence behavior in 5.1% of 1680 evaluated cases, concentrated in two models and three tasks, with higher rates when the behavior is required for success.
-
Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling
Multi-layer ensembles of linear probes raise AUROC for deception detection by up to 78% and probe accuracy scales with model size across 0.5B to 176B parameter models.