Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations
Pith reviewed 2026-06-29 12:37 UTC · model grok-4.3
The pith
Deception probes in LLMs fail under stylistic shifts because training data lacks diversity rather than because larger models hide the signal.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Linear probes reach AUROC values of at least 0.998 on clean data but drop under stylistic shifts. Training on data augmented with eight styles yields mean AUROC of 0.979-0.983 on unseen styles. The single-direction hypothesis is rejected because k=1 captures only 0.61-0.80 AUROC, cross-domain transfer failure is geometric rather than layer-driven, the entropy-proxy hypothesis is rejected with maximum correlation 0.454 and negligible residualization effect, and deception does not occupy a significant linear subspace yet multi-dimensional probes recover the signal from distributed sub-threshold features. The inverse scaling pattern disappears once the training distribution is broadened.
What carries the argument
Style-augmented linear probes trained on activations that recover distributed sub-threshold features of deception across stylistic shifts.
If this is right
- Style-augmented probes maintain high detection accuracy on styles not present in the original training set.
- The apparent worsening of probe performance at larger scales is eliminated by broadening the training distribution.
- Single linear directions capture only a minority of the deception signal.
- Residualizing activations for entropy leaves probe performance essentially unchanged.
Where Pith is reading between the lines
- Deployment of deception probes would require training data that spans the stylistic range expected in actual use.
- The same augmentation strategy could be tested on other activation-based detectors such as those for hallucination.
- Future probes might benefit from explicitly targeting the distributed sub-threshold features identified by the multi-dimensional analysis.
Load-bearing premise
The eight stylistic shifts used for testing are representative of the distributional changes that would occur in real deployment settings where deception probes might be applied.
What would settle it
Apply the style-augmented probes without retraining to a ninth stylistic shift or domain outside the original eight and measure whether AUROC stays above 0.95.
Figures
read the original abstract
Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper systematically evaluates linear probes for deception detection in the Gemma 3 model family (1B–27B). It reports near-perfect AUROC (≥0.998) on clean data that collapses under 8 stylistic shifts, with style-augmented training recovering mean AUROC 0.979–0.983 on unseen styles. Four hypotheses are tested via cross-domain transfer matrices, multi-dimensional probes with permutation null baselines, entropy-residualization, and distractor evaluations: single linear direction is rejected (k=1 yields only 0.61–0.80 AUROC); entropy proxy is rejected (max |ρ|=0.454, residualization Δ-AUROC≤0.004); deception does not occupy a significant linear subspace (per-domain k*=0) yet multi-dimensional probes (k≥5) recover signal via distributed sub-threshold features; inverse scaling is attributed to training-distribution narrowness rather than architecture.
Significance. If the empirical results hold, the work advances interpretability by providing concrete hypothesis tests with explicit null baselines and residualization controls that allow rejection of single-direction and entropy accounts. The demonstration that style augmentation restores performance at both 4B and 27B scales supplies evidence that apparent inverse scaling can be a distributional artifact. Credit is due for the permutation null baselines and entropy-residualization protocol, which strengthen falsifiability of the geometric claims.
major comments (2)
- [Abstract] Abstract: the claim that 'style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon' is load-bearing for the central conclusion yet rests on the untested assumption that the eight stylistic shifts are representative of distributional changes arising in deployment (topic drift, context length, adversarial rephrasing, or domain semantics are not examined).
- [Results] Results on multi-dimensional probes: the statement that 'deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k≥5) recover the signal through distributed sub-threshold features' requires the precise definition of k* and the quantitative comparison against the permutation null baseline to be load-bearing; without those details the geometric interpretation cannot be verified from the reported AUROC numbers alone.
minor comments (2)
- [Abstract] The abstract and methods should explicitly list the eight stylistic shifts and the exact train/test split sizes used for the style-augmented condition.
- [Methods] Notation for the cross-domain transfer matrices should be defined before the first numerical results are presented.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on scope and clarity. We address each major point below with proposed revisions to improve precision without altering the core empirical results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon' is load-bearing for the central conclusion yet rests on the untested assumption that the eight stylistic shifts are representative of distributional changes arising in deployment (topic drift, context length, adversarial rephrasing, or domain semantics are not examined).
Authors: We agree the eight stylistic shifts do not exhaust all deployment-relevant distributional changes. The central empirical result is that style augmentation restores high AUROC at both 4B and 27B scales for the tested shifts, showing the inverse scaling is not an intrinsic scale effect. To address the concern, we will revise the abstract to qualify the claim as applying to stylistic shifts and add a limitations paragraph discussing untested shifts such as topic drift. This clarifies scope while preserving the reported findings on the examined conditions. revision: yes
-
Referee: [Results] Results on multi-dimensional probes: the statement that 'deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k≥5) recover the signal through distributed sub-threshold features' requires the precise definition of k* and the quantitative comparison against the permutation null baseline to be load-bearing; without those details the geometric interpretation cannot be verified from the reported AUROC numbers alone.
Authors: We apologize for insufficient explicitness. k* is defined as the minimal dimensionality where probe AUROC exceeds the 95th percentile of the permutation null (computed via 1000 shuffles of labels). Per-domain results show k*=0 because k=1 yields 0.61–0.80 AUROC (null ~0.50), while k≥5 reaches ≥0.95 and surpasses the null. We will add the formal definition, per-domain AUROC-vs-k tables with null baselines, and bootstrap p-values to the results section so the geometric claims are directly verifiable from the numbers. revision: yes
Circularity Check
No circularity: empirical recovery on held-out styles is a genuine test
full rationale
The paper's core claim—that probe fragility under stylistic shift reflects training-distribution narrowness rather than architecture—is supported by explicit empirical splits: style-augmented probes are trained on a subset of the eight shifts and evaluated on unseen shifts, yielding AUROC 0.979-0.983. This is a standard held-out evaluation, not a quantity recovered by construction. The design includes permutation null baselines, entropy residualization (max Delta-AUROC=0.004), and multi-dimensional probe analysis with k* selection per domain; none of these reduce the reported recovery to a definitional tautology or self-citation chain. No load-bearing step matches any of the six enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Linear probes trained on activations can detect deception signals
- domain assumption The eight stylistic shifts constitute a sufficient test of robustness
Reference graph
Works this paper leans on
-
[1]
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. https://arxiv.org/abs/2305.13245 Gqa: Training generalized multi-query transformer models from multi-head checkpoints . Preprint, arXiv:2305.13245
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Taylor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, and Scott Emmons. 2025. https://arxiv.org/abs/2412.09565 Obfuscated activations bypass llm latent-space defenses . Preprint, arXiv:2412.09565
-
[3]
Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt. 2025. https://arxiv.org/abs/2303.08112 Eliciting latent predictions from transformers with the tuned lens . Preprint, arXiv:2303.08112
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [4]
-
[5]
Lennart B\" u rger, Fred Hamprecht, and Boaz Nadler. 2024. https://doi.org/10.52202/079017-4392 Truth is universal: Robust detection of lies in llms . In Advances in Neural Information Processing Systems, volume 37, pages 138393--138431. Curran Associates, Inc
-
[6]
Ann-Kathrin Dombrowski and Guillaume Corlouer. 2024. https://openreview.net/forum?id=9AM5i1wWZZ An information-theoretic study of lying in LLM s . In ICML 2024 Workshop on LLMs and Cognition
2024
-
[7]
Gemma Team and Google DeepMind . 2025. https://arxiv.org/abs/2503.19786 Gemma 3 technical report . Preprint, arXiv:2503.19786
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [8]
- [9]
-
[10]
Samuel Marks and Max Tegmark. 2024. https://arxiv.org/abs/2310.06824 The geometry of truth: Emergent linear structure in large language model representations of true/false datasets . Preprint, arXiv:2310.06824
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Nitika Mathur, Timothy Baldwin, and Trevor Cohn. 2020. https://doi.org/10.18653/v1/2020.acl-main.448 Tangled up in BLEU : Reevaluating the evaluation of automatic machine translation evaluation metrics . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984--4997, Online. Association for Computational Linguistics
-
[12]
nostalgebraist. 2020. interpreting GPT : the logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Accessed: 2026-02-19
2020
-
[13]
Lorenzo Pacchiardi, Alex James Chan, S \"o ren Mindermann, Ilan Moscovitz, Alexa Yue Pan, Yarin Gal, Owain Evans, and Jan M. Brauner. 2024. https://openreview.net/forum?id=567BjxgaTp How to catch an AI liar: Lie detection in black-box LLM s by asking unrelated questions . In The Twelfth International Conference on Learning Representations
2024
-
[14]
Park, Simon Goldstein, Aidan O'Gara, Michael Chen, and Dan Hendrycks
Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, and Dan Hendrycks. 2024. https://doi.org/10.1016/j.patter.2024.100988 AI deception: A survey of examples, risks, and potential solutions . Patterns, 5(5)
- [15]
-
[16]
Ehud Reiter. 2018. https://doi.org/10.1162/coli_a_00322 A structured review of the validity of BLEU . Computational Linguistics, 44(3):393--401
-
[17]
Towards Understanding Sycophancy in Language Models
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. 2025. https://arxiv.org/abs/2310.13548 Towards understanding syc...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Noam Shazeer. 2020. https://arxiv.org/abs/2002.05202 Glu variants improve transformer . Preprint, arXiv:2002.05202
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[19]
Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, and Joseph Bloom. 2025. https://arxiv.org/abs/2512.07810 Auditing games for sandbagging . Preprint, arXiv:2512.07810
-
[20]
Stanley Yu, Vaidehi Bulusu, Oscar Yasunaga, Clayton Lau, Cole Blondin, Sean O'Brien, Kevin Zhu, and Vasu Sharma. 2025. https://arxiv.org/abs/2505.21800 From directions to cones: Exploring multidimensional representations of propositional facts in llms . Preprint, arXiv:2505.21800
-
[21]
Boren Zheng, Mengying Yuan, Kexin Chen, Baihui Zheng, Zhendong Liu, Boyuan Chen, Jiaming Ji, Yingshui Tan, Xiaoyong Zhu, Yaodong Yang, and Bo Zheng. 2026. https://openreview.net/forum?id=0lW2UBiEWN Mesa and mask: A benchmark for detecting and classifying deceptive behaviors in LLM s
2026
-
[22]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, and 2 others. 2025. https://arxiv.org/abs/2310.01405 Representation engineering: A top-...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[24]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.