Difficulties with evaluating a deception detector for ais

Lewis Smith, Bilal Chughtai, Neel Nanda · 2025 · arXiv 2511.22662

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

cs.AI · 2026-06-10 · unverdicted · novelty 6.0

Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

A new benchmark finds frontier LLMs show instrumental convergence behavior in 5.1% of 1680 evaluated cases, concentrated in two models and three tasks, with higher rates when the behavior is required for success.

Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

Multi-layer ensembles of linear probes raise AUROC for deception detection by up to 78% and probe accuracy scales with model size across 0.5B to 176B parameter models.

citing papers explorer

Showing 4 of 4 citing papers after filters.

A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing cs.CL · 2026-06-03 · unverdicted · none · ref 116
PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.
"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms cs.AI · 2026-06-10 · unverdicted · none · ref 96
Lie detectors effective on prompted deception in LLMs fail on trained model organisms with verified opposite beliefs, except chain-of-thought judges which retain 0.82 balanced accuracy partly due to verification artifacts.
Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors cs.AI · 2026-05-07 · unverdicted · none · ref 30
A new benchmark finds frontier LLMs show instrumental convergence behavior in 5.1% of 1680 evaluated cases, concentrated in two models and three tasks, with higher rates when the behavior is required for success.
Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling cs.LG · 2026-04-15 · unverdicted · none · ref 11
Multi-layer ensembles of linear probes raise AUROC for deception detection by up to 78% and probe accuracy scales with model size across 0.5B to 176B parameter models.

Difficulties with evaluating a deception detector for ais

fields

years

verdicts

representative citing papers

citing papers explorer