pith. sign in

arxiv: 2605.27958 · v1 · pith:7FNIQNELnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI· cs.LG

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

Pith reviewed 2026-06-29 12:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords deception detectionlinear probesdistributional shiftLLM activationsstyle augmentationgeometric representationsinverse scaling
0
0 comments X

The pith

Deception probes in LLMs fail under stylistic shifts because training data lacks diversity rather than because larger models hide the signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests why linear probes on LLM activations detect deception with high accuracy on standard benchmarks yet collapse when input style changes. It evaluates four geometric hypotheses about how deception is represented and shows that expanding the probe training set to include eight stylistic variants restores near-perfect detection on styles never seen during training. This recovery occurs at both 4B and 27B scales, indicating that the previously reported inverse scaling of probe performance is an artifact of narrow training distributions. A sympathetic reader would care because the result reframes probe fragility as a solvable data issue rather than an inherent barrier to reliable deception monitoring in deployed systems.

Core claim

Linear probes reach AUROC values of at least 0.998 on clean data but drop under stylistic shifts. Training on data augmented with eight styles yields mean AUROC of 0.979-0.983 on unseen styles. The single-direction hypothesis is rejected because k=1 captures only 0.61-0.80 AUROC, cross-domain transfer failure is geometric rather than layer-driven, the entropy-proxy hypothesis is rejected with maximum correlation 0.454 and negligible residualization effect, and deception does not occupy a significant linear subspace yet multi-dimensional probes recover the signal from distributed sub-threshold features. The inverse scaling pattern disappears once the training distribution is broadened.

What carries the argument

Style-augmented linear probes trained on activations that recover distributed sub-threshold features of deception across stylistic shifts.

If this is right

  • Style-augmented probes maintain high detection accuracy on styles not present in the original training set.
  • The apparent worsening of probe performance at larger scales is eliminated by broadening the training distribution.
  • Single linear directions capture only a minority of the deception signal.
  • Residualizing activations for entropy leaves probe performance essentially unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployment of deception probes would require training data that spans the stylistic range expected in actual use.
  • The same augmentation strategy could be tested on other activation-based detectors such as those for hallucination.
  • Future probes might benefit from explicitly targeting the distributed sub-threshold features identified by the multi-dimensional analysis.

Load-bearing premise

The eight stylistic shifts used for testing are representative of the distributional changes that would occur in real deployment settings where deception probes might be applied.

What would settle it

Apply the style-augmented probes without retraining to a ninth stylistic shift or domain outside the original eight and measure whether AUROC stays above 0.95.

Figures

Figures reproduced from arXiv: 2605.27958 by Sachin Kumar.

Figure 1
Figure 1. Figure 1: Layer-wise probe AUROC on D-RepE for Gemma 3 1B, 4B, 12B, and 27B. All models achieve AUROC [PITH_FULL_IMAGE:figures/full_fig_p014_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Multi-dimensional probe AUROC as a func [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
read the original abstract

Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper systematically evaluates linear probes for deception detection in the Gemma 3 model family (1B–27B). It reports near-perfect AUROC (≥0.998) on clean data that collapses under 8 stylistic shifts, with style-augmented training recovering mean AUROC 0.979–0.983 on unseen styles. Four hypotheses are tested via cross-domain transfer matrices, multi-dimensional probes with permutation null baselines, entropy-residualization, and distractor evaluations: single linear direction is rejected (k=1 yields only 0.61–0.80 AUROC); entropy proxy is rejected (max |ρ|=0.454, residualization Δ-AUROC≤0.004); deception does not occupy a significant linear subspace (per-domain k*=0) yet multi-dimensional probes (k≥5) recover signal via distributed sub-threshold features; inverse scaling is attributed to training-distribution narrowness rather than architecture.

Significance. If the empirical results hold, the work advances interpretability by providing concrete hypothesis tests with explicit null baselines and residualization controls that allow rejection of single-direction and entropy accounts. The demonstration that style augmentation restores performance at both 4B and 27B scales supplies evidence that apparent inverse scaling can be a distributional artifact. Credit is due for the permutation null baselines and entropy-residualization protocol, which strengthen falsifiability of the geometric claims.

major comments (2)
  1. [Abstract] Abstract: the claim that 'style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon' is load-bearing for the central conclusion yet rests on the untested assumption that the eight stylistic shifts are representative of distributional changes arising in deployment (topic drift, context length, adversarial rephrasing, or domain semantics are not examined).
  2. [Results] Results on multi-dimensional probes: the statement that 'deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k≥5) recover the signal through distributed sub-threshold features' requires the precise definition of k* and the quantitative comparison against the permutation null baseline to be load-bearing; without those details the geometric interpretation cannot be verified from the reported AUROC numbers alone.
minor comments (2)
  1. [Abstract] The abstract and methods should explicitly list the eight stylistic shifts and the exact train/test split sizes used for the style-augmented condition.
  2. [Methods] Notation for the cross-domain transfer matrices should be defined before the first numerical results are presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on scope and clarity. We address each major point below with proposed revisions to improve precision without altering the core empirical results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon' is load-bearing for the central conclusion yet rests on the untested assumption that the eight stylistic shifts are representative of distributional changes arising in deployment (topic drift, context length, adversarial rephrasing, or domain semantics are not examined).

    Authors: We agree the eight stylistic shifts do not exhaust all deployment-relevant distributional changes. The central empirical result is that style augmentation restores high AUROC at both 4B and 27B scales for the tested shifts, showing the inverse scaling is not an intrinsic scale effect. To address the concern, we will revise the abstract to qualify the claim as applying to stylistic shifts and add a limitations paragraph discussing untested shifts such as topic drift. This clarifies scope while preserving the reported findings on the examined conditions. revision: yes

  2. Referee: [Results] Results on multi-dimensional probes: the statement that 'deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k≥5) recover the signal through distributed sub-threshold features' requires the precise definition of k* and the quantitative comparison against the permutation null baseline to be load-bearing; without those details the geometric interpretation cannot be verified from the reported AUROC numbers alone.

    Authors: We apologize for insufficient explicitness. k* is defined as the minimal dimensionality where probe AUROC exceeds the 95th percentile of the permutation null (computed via 1000 shuffles of labels). Per-domain results show k*=0 because k=1 yields 0.61–0.80 AUROC (null ~0.50), while k≥5 reaches ≥0.95 and surpasses the null. We will add the formal definition, per-domain AUROC-vs-k tables with null baselines, and bootstrap p-values to the results section so the geometric claims are directly verifiable from the numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical recovery on held-out styles is a genuine test

full rationale

The paper's core claim—that probe fragility under stylistic shift reflects training-distribution narrowness rather than architecture—is supported by explicit empirical splits: style-augmented probes are trained on a subset of the eight shifts and evaluated on unseen shifts, yielding AUROC 0.979-0.983. This is a standard held-out evaluation, not a quantity recovered by construction. The design includes permutation null baselines, entropy residualization (max Delta-AUROC=0.004), and multi-dimensional probe analysis with k* selection per domain; none of these reduce the reported recovery to a definitional tautology or self-citation chain. No load-bearing step matches any of the six enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions in mechanistic interpretability that linear probes are appropriate tools and that the chosen stylistic shifts adequately sample relevant distribution shifts.

axioms (2)
  • domain assumption Linear probes trained on activations can detect deception signals
    Core method used throughout the reported experiments.
  • domain assumption The eight stylistic shifts constitute a sufficient test of robustness
    Used to diagnose why probes fail and to claim recovery via augmentation.

pith-pipeline@v0.9.1-grok · 5856 in / 1330 out tokens · 54505 ms · 2026-06-29T12:37:19.398724+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 18 canonical work pages · 7 internal anchors

  1. [1]

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. https://arxiv.org/abs/2305.13245 Gqa: Training generalized multi-query transformer models from multi-head checkpoints . Preprint, arXiv:2305.13245

  2. [2]

    Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Taylor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, and Scott Emmons. 2025. https://arxiv.org/abs/2412.09565 Obfuscated activations bypass llm latent-space defenses . Preprint, arXiv:2412.09565

  3. [3]

    Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt. 2025. https://arxiv.org/abs/2303.08112 Eliciting latent predictions from transformers with the tuned lens . Preprint, arXiv:2303.08112

  4. [4]

    Gerard Boxo, Ryan Socha, Daniel Yoo, and Shivam Raval. 2025. https://arxiv.org/abs/2508.19505 Caught in the act: a mechanistic approach to detecting deception . Preprint, arXiv:2508.19505

  5. [5]

    Lennart B\" u rger, Fred Hamprecht, and Boaz Nadler. 2024. https://doi.org/10.52202/079017-4392 Truth is universal: Robust detection of lies in llms . In Advances in Neural Information Processing Systems, volume 37, pages 138393--138431. Curran Associates, Inc

  6. [6]

    Ann-Kathrin Dombrowski and Guillaume Corlouer. 2024. https://openreview.net/forum?id=9AM5i1wWZZ An information-theoretic study of lying in LLM s . In ICML 2024 Workshop on LLMs and Cognition

  7. [7]

    Gemma Team and Google DeepMind . 2025. https://arxiv.org/abs/2503.19786 Gemma 3 technical report . Preprint, arXiv:2503.19786

  8. [8]

    Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, and Marius Hobbhahn. 2025. https://arxiv.org/abs/2502.03407 Detecting strategic deception using linear probes . Preprint, arXiv:2502.03407

  9. [9]

    Rohan Gupta and Erik Jenner. 2025. https://arxiv.org/abs/2506.14261 Rl-obfuscation: Can language models learn to evade latent-space monitors? Preprint, arXiv:2506.14261

  10. [10]

    Samuel Marks and Max Tegmark. 2024. https://arxiv.org/abs/2310.06824 The geometry of truth: Emergent linear structure in large language model representations of true/false datasets . Preprint, arXiv:2310.06824

  11. [11]

    Nitika Mathur, Timothy Baldwin, and Trevor Cohn. 2020. https://doi.org/10.18653/v1/2020.acl-main.448 Tangled up in BLEU : Reevaluating the evaluation of automatic machine translation evaluation metrics . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984--4997, Online. Association for Computational Linguistics

  12. [12]

    nostalgebraist. 2020. interpreting GPT : the logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Accessed: 2026-02-19

  13. [13]

    Lorenzo Pacchiardi, Alex James Chan, S \"o ren Mindermann, Ilan Moscovitz, Alexa Yue Pan, Yarin Gal, Owain Evans, and Jan M. Brauner. 2024. https://openreview.net/forum?id=567BjxgaTp How to catch an AI liar: Lie detection in black-box LLM s by asking unrelated questions . In The Twelfth International Conference on Learning Representations

  14. [14]

    Park, Simon Goldstein, Aidan O'Gara, Michael Chen, and Dan Hendrycks

    Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, and Dan Hendrycks. 2024. https://doi.org/10.1016/j.patter.2024.100988 AI deception: A survey of examples, risks, and potential solutions . Patterns, 5(5)

  15. [15]

    Avi Parrack, Carlo Leonardo Attubato, and Stefan Heimersheim. 2026. https://arxiv.org/abs/2507.12691 Benchmarking deception probes via black-to-white performance boosts . Preprint, arXiv:2507.12691

  16. [16]

    Ehud Reiter. 2018. https://doi.org/10.1162/coli_a_00322 A structured review of the validity of BLEU . Computational Linguistics, 44(3):393--401

  17. [17]

    Towards Understanding Sycophancy in Language Models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. 2025. https://arxiv.org/abs/2310.13548 Towards understanding syc...

  18. [18]

    Noam Shazeer. 2020. https://arxiv.org/abs/2002.05202 Glu variants improve transformer . Preprint, arXiv:2002.05202

  19. [19]

    Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, and Joseph Bloom. 2025. https://arxiv.org/abs/2512.07810 Auditing games for sandbagging . Preprint, arXiv:2512.07810

  20. [20]

    Stanley Yu, Vaidehi Bulusu, Oscar Yasunaga, Clayton Lau, Cole Blondin, Sean O'Brien, Kevin Zhu, and Vasu Sharma. 2025. https://arxiv.org/abs/2505.21800 From directions to cones: Exploring multidimensional representations of propositional facts in llms . Preprint, arXiv:2505.21800

  21. [21]

    Boren Zheng, Mengying Yuan, Kexin Chen, Baihui Zheng, Zhendong Liu, Boyuan Chen, Jiaming Ji, Yingshui Tan, Xiaoyong Zhu, Yaodong Yang, and Bo Zheng. 2026. https://openreview.net/forum?id=0lW2UBiEWN Mesa and mask: A benchmark for detecting and classifying deceptive behaviors in LLM s

  22. [22]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, and 2 others. 2025. https://arxiv.org/abs/2310.01405 Representation engineering: A top-...

  23. [23]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  24. [24]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...