pith. sign in

arxiv: 2605.27671 · v1 · pith:6IRIUQTTnew · submitted 2026-05-26 · 📊 stat.ML · cs.LG

Evolving and Detecting Multi-Turn Deception using Geometric Signatures

Pith reviewed 2026-06-29 15:15 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords multi-turn deceptiongeometric signaturesLLM safetygenetic prompt optimizationembedding space detectiondeception detectionlightweight classifiermulti-objective evolution
0
0 comments X

The pith

Multi-turn deceptive intent in LLMs leaves a stable geometric footprint in embedding space that a small classifier can detect with 0.89 recall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an evolutionary system that uses multi-objective genetic prompt optimization with co-evolving mutation operators to create realistic multi-turn deceptive question sets aimed at extracting prohibited information. A human study validates the prompts and shows that earlier generations produce the most convincing deception while revealing practical constraints like ordering effects. These prompts are screened with three geometric measures in embedding space—angular coverage, distance ratio, and linearity—plus pairwise similarity statistics, fed into a lightweight feed-forward classifier. The classifier maintains high recall across base, reworded, and shortened versions of the conversations without any end-to-end retraining of the underlying LLM. The central result is that deceptive multi-turn behavior produces a detectable geometric pattern usable for transparent safety screening.

Core claim

Using multi-objective genetic prompt optimization with co-evolving mutation operators, the paper generates realistic multi-turn deceptive question sets. These are validated in a human study showing that early generations are most convincing. Deceptive attempts are then detected using three geometric features—angular coverage, distance ratio, and linearity—in embedding space, augmented with pairwise similarity statistics, via a lightweight feed-forward classifier achieving 0.89 recall consistently across base, reworded, and truncated (three-turn) scenarios. The results support the hypothesis that multi-turn deceptive intent leaves a stable geometric footprint enabling lightweight, transparent

What carries the argument

Multi-objective genetic prompt optimization with co-evolving mutation operators for generating deceptive sets, combined with three geometric features (angular coverage, distance ratio, and linearity) in embedding space plus pairwise similarity statistics that feed a lightweight feed-forward classifier.

If this is right

  • Lightweight screening becomes possible without retraining the base LLM safety model.
  • High recall (0.89) holds on reworded prompts and on conversations truncated to three turns.
  • The approach supplies explainable signals rather than black-box decisions.
  • The evolutionary generator can produce larger test sets for safety evaluation.
  • The method targets deployment inside LLM safety infrastructure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same geometric features could be tested on single-turn jailbreak attempts to check whether the footprint is specific to multi-turn structure.
  • Safety pipelines might combine the geometric detector with existing content filters to reduce false negatives on indirect probing.
  • Running the pipeline on open-source models of varying sizes would test whether the features are architecture-agnostic.
  • The genetic optimization could be adapted to target specific prohibited domains such as chemical synthesis instructions.

Load-bearing premise

The three geometric features remain stable and discriminative across different base LLMs, prompt rewordings, and truncation levels beyond the specific scenarios tested.

What would settle it

Apply the same three-feature classifier to deceptive multi-turn conversations generated against an unseen LLM family and measure whether recall stays above 0.8.

Figures

Figures reproduced from arXiv: 2605.27671 by Mary L. Cummings, Surender Suresh Kumar.

Figure 1
Figure 1. Figure 1: Iterative steps for generating and optimizing synthetic data via multi-objective selection [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Survey interface for rating individual questions on perceived deceptiveness. Partici￾pants used a seven-point Likert scale to assess the likelihood of deceptive intent [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Boxplots for the average of 5 individual questions from phase 1 and participant ratings [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Means, modes and variance of human and LLM average responses per phase. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Safety defenses for large language models (LLMs) are typically trained and evaluated on single-turn prompts, yet real attacks often unfold as indirect, multi-turn probing. To defend against this more nuanced form of deception, we present a unified pipeline that generates realistic multi-turn deceptive question sets via multi-objective genetic prompt optimization with co-evolving mutation operators. We validate this dataset through a human study, which also revealed that early generations yielded the most convincing deception and practical constraints such as adherence filtering and ordering effects. Using this data, we were able to detect deceptive attempts to access prohibited information using simple, explainable geometric signals in embedding space coupled with a lightweight feed-forward classifier. Three geometric features (angular coverage, distance ratio, and linearity) augmented with pairwise similarity statistics led to a compact predictive model that achieved consistently high recall (0.89) across base, reworded, and truncated (three-turn) scenarios, with test-time F1 ranging from 0.74-0.86. The results support a central hypothesis that multi-turn deceptive intent leaves a stable geometric footprint that enables lightweight, transparent screening without expensive end-to-end training. We further discuss responsible uses, limitations, and paths toward larger, more diverse human-evaluated datasets. The primary contribution to artificial intelligence is the multi-objective evolutionary framework for prompt generation, and the engineering application is the deployment of a lightweight geometric detection system for LLM safety infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a multi-objective genetic prompt optimization pipeline to generate realistic multi-turn deceptive question sets for LLMs, validates the outputs via a human study (noting that early generations were most convincing and highlighting adherence filtering and ordering effects), and detects deceptive attempts to access prohibited information via three geometric features in embedding space (angular coverage, distance ratio, linearity) plus pairwise similarity statistics fed to a lightweight feed-forward classifier. The classifier reports recall of 0.89 and F1 scores of 0.74-0.86 on base, reworded, and three-turn truncated scenarios, supporting the hypothesis that multi-turn deceptive intent produces a stable geometric footprint amenable to lightweight, explainable screening without end-to-end retraining.

Significance. If the geometric signatures are shown to be intrinsic to deceptive intent rather than tied to the specific generation pipeline, the work would offer a meaningful engineering contribution to LLM safety infrastructure by providing transparent, low-cost detection that complements heavier models. The multi-objective evolutionary framework for prompt generation is also a clear methodological strength that could be reused for other safety-related data synthesis tasks.

major comments (2)
  1. [Abstract and dataset generation] Abstract and dataset generation pipeline: all training, reworded, and truncated test data are produced by the same multi-objective evolutionary process with co-evolving mutation operators and adherence filtering. No control set of independently generated (non-evolved) deceptive multi-turn dialogues is evaluated, so the reported geometric patterns and 0.89 recall may reflect optimizer-specific artifacts rather than a method-independent signature of deceptive intent. This directly undermines the central hypothesis that the three features form a stable, intrinsic footprint.
  2. [Classifier evaluation and human study] Classifier evaluation and human study: concrete metrics (recall 0.89, F1 0.74-0.86) are reported without error bars, dataset sizes, exclusion criteria, inter-rater agreement statistics, or the full classifier architecture and training details. The human study is summarized without quantitative agreement measures, making it impossible to assess whether the geometric features remain discriminative when the base LLM, prompt wording, or truncation level changes beyond the specific evolved scenarios tested.
minor comments (1)
  1. [Abstract] The abstract states that 'early generations yielded the most convincing deception' but does not quantify how this observation was used to select or filter the final dataset for the classifier experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important considerations for strengthening claims about intrinsic geometric signatures. We address each major comment below and outline revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract and dataset generation] Abstract and dataset generation pipeline: all training, reworded, and truncated test data are produced by the same multi-objective evolutionary process with co-evolving mutation operators and adherence filtering. No control set of independently generated (non-evolved) deceptive multi-turn dialogues is evaluated, so the reported geometric patterns and 0.89 recall may reflect optimizer-specific artifacts rather than a method-independent signature of deceptive intent. This directly undermines the central hypothesis that the three features form a stable, intrinsic footprint.

    Authors: We agree this is a substantive limitation. While the reworded and truncated variants introduce variation in surface form and length, they remain downstream of the same evolutionary process, so they do not fully rule out pipeline-specific artifacts. The central hypothesis would be more robust with an independent control set. We will add such a control set (deceptive multi-turn dialogues generated via standard non-evolutionary prompting on the same base models) and report comparative geometric statistics and classifier performance in the revised manuscript. revision: yes

  2. Referee: [Classifier evaluation and human study] Classifier evaluation and human study: concrete metrics (recall 0.89, F1 0.74-0.86) are reported without error bars, dataset sizes, exclusion criteria, inter-rater agreement statistics, or the full classifier architecture and training details. The human study is summarized without quantitative agreement measures, making it impossible to assess whether the geometric features remain discriminative when the base LLM, prompt wording, or truncation level changes beyond the specific evolved scenarios tested.

    Authors: We acknowledge the need for greater transparency. The revised manuscript will include: (i) error bars and exact dataset sizes for all reported metrics, (ii) exclusion criteria and inter-rater agreement statistics (e.g., Fleiss' kappa) for the human study, and (iii) the complete classifier architecture, training hyperparameters, and evaluation protocol. These additions will allow readers to evaluate robustness across the tested scenarios. revision: yes

Circularity Check

0 steps flagged

Empirical pipeline with held-out evaluation; no reduction to inputs by construction

full rationale

The paper describes a data-generation pipeline (multi-objective genetic prompt optimization) followed by human validation and a separate classifier trained on geometric features extracted from embeddings. Reported metrics (recall 0.89, F1 0.74-0.86) are obtained on held-out base, reworded, and truncated scenarios. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would make the detection result equivalent to its inputs by definition. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. Full paper may introduce genetic algorithm hyperparameters or embedding model assumptions.

pith-pipeline@v0.9.1-grok · 5783 in / 1060 out tokens · 37838 ms · 2026-06-29T15:15:15.575905+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 17 canonical work pages · 7 internal anchors

  1. [1]

    Training language models to follow instructions with human feedback

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” 2022. [Online]. Available: https://arxiv.org/abs/2203.02155

  2. [2]

    arXiv preprint arXiv:2404.01295 , year=

    Y .-L. Tuan, X. Chen, E. M. Smith, L. Martin, S. Batra, A. Celikyilmaz, W. Y . Wang, and D. M. Bikel, “Towards safety and helpfulness balanced responses via controllable large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2404.01295

  3. [3]

    Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

    B. Peng, Z. Bi, Q. Niu, M. Liu, P. Feng, T. Wang, L. K. Q. Yan, Y . Wen, Y . Zhang, and C. H. Yin, “Jailbreaking and mitigation of vulnerabilities in large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2410.15236

  4. [4]

    arXiv preprint arXiv:2408.15221 , year=

    N. Li, Z. Han, I. Steneker, W. Primack, R. Goodside, H. Zhang, Z. Wang, C. Menghini, and S. Yue, “Llm defenses are not robust to multi-turn human jailbreaks yet,” 2024. [Online]. Available: https://arxiv.org/abs/2408.15221

  5. [5]

    Does safety training of llms generalize to semantically related natural prompts?, 2025

    S. Addepalli, Y . Varun, A. Suggala, K. Shanmugam, and P. Jain, “Does safety training of llms generalize to semantically related natural prompts?” 2025. [Online]. Available: https://arxiv.org/abs/2412.03235

  6. [6]

    Ignore this title and hackAPrompt: Exposing systemic vulnerabilities of LLMs through a global prompt hacking competition,

    S. V . Schulhoff, J. Pinto, A. Khan, L.-F. Bouchard, C. Si, S. Anati, V . Tagliabue, A. L. Kost, C. R. Carnahan, and J. L. Boyd-Graber, “Ignore this title and hackAPrompt: Exposing systemic vulnerabilities of LLMs through a global prompt hacking competition,” inThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. [Online]. Availa...

  7. [7]

    AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts,

    T. Shin, Y . Razeghi, R. L. Logan IV , E. Wallace, and S. Singh, “AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts,” Nov. 2020, arXiv:2010.15980 [cs]. [Online]. Available: http://arxiv.org/abs/2010.15980

  8. [8]

    Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab

    K. Opsahl-Ong, M. J. Ryan, J. Purtell, D. Broman, C. Potts, M. Zaharia, and O. Khattab, “Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs,” Jun. 2024. [Online]. Available: https://arxiv.org/abs/2406.11695v1

  9. [9]

    Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    C. Fernando, D. Banarse, H. Michalewski, S. Osindero, and T. Rockt¨aschel, “Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution,” Sep. 2023, arXiv:2309.16797 [cs]. [Online]. Available: http://arxiv.org/abs/2309.16797

  10. [10]

    Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science,

    V . Veselovsky, M. H. Ribeiro, A. Arora, M. Josifoski, A. Anderson, and R. West, “Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science,” May 2023. [Online]. Available: https://arxiv.org/abs/2305.15041v1

  11. [11]

    TarGEN: Targeted Data Generation with Large Language Models,

    H. Gupta, K. Scaria, U. Anantheswaran, S. Verma, M. Parmar, S. A. Sawant, C. Baral, and S. Mishra, “TarGEN: Targeted Data Generation with Large Language Models,” Aug. 2024, arXiv:2310.17876 [cs]. [Online]. Available: http://arxiv.org/abs/2310.17876 16

  12. [12]

    Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations,

    Z. Li, H. Zhu, Z. Lu, and M. Yin, “Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 10 443–10 461. [On...

  13. [13]

    On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey,

    L. Long, R. Wang, R. Xiao, J. Zhao, X. Ding, G. Chen, and H. Wang, “On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey,” in Findings of the Association for Computational Linguistics ACL 2024, L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand and virtual meeting: Association for Computational Linguistics, Aug. 2024, p...

  14. [14]

    A fast and elitist multiobjective genetic algorithm: NSGA-II,

    K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,”IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, Apr. 2002, conference Name: IEEE Transactions on Evolutionary Computation. [Online]. Available: https://ieeexplore.ieee.org/document/996017

  15. [15]

    Termination criteria in evolutionary algorithms: A survey,

    S. N. Ghoreishi, A. Clausen, and B. N. Jørgensen, “Termination criteria in evolutionary algorithms: A survey,” inInternational Joint Conference on Computational Intelligence,

  16. [16]

    Available: https://api.semanticscholar.org/CorpusID:13490784

    [Online]. Available: https://api.semanticscholar.org/CorpusID:13490784

  17. [17]

    Subjectivity in unsupervised machine learning model selection,

    W. Chen and M. L. Cummings, “Subjectivity in unsupervised machine learning model selection,” inAAAI Spring Symposia, 2023. [Online]. Available: https: //api.semanticscholar.org/CorpusID:261494324

  18. [18]

    Synthetic replacements for human survey data? the perils of large language models,

    J. Bisbee, J. D. Clinton, C. Dorff, B. Kenkel, and J. M. Larson, “Synthetic replacements for human survey data? the perils of large language models,”Political Analysis, vol. 32, no. 4, p. 401–416, 2024

  19. [19]

    Efficient Estimation of Word Representations in Vector Space

    T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” 2013. [Online]. Available: https://arxiv.org/abs/1301.3781

  20. [20]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. [Online]. Available: https://arxiv.org/abs/1908.10084

  21. [21]

    Embedding Projector: Interactive Visualization and Interpretation of Embeddings

    D. Smilkov, N. Thorat, C. Nicholson, E. Reif, F. B. Vi´egas, and M. Wattenberg, “Embedding projector: Interactive visualization and interpretation of embeddings,” 2016. [Online]. Available: https://arxiv.org/abs/1611.05469

  22. [22]

    Exploring dimensionality reduction techniques in multilingual transformers,

    ´Alvaro Huertas-Garc ´ıa, A. Mart ´ın, J. Huertas-Tato, and D. Camacho, “Exploring dimensionality reduction techniques in multilingual transformers,” 2022. [Online]. Available: https://arxiv.org/abs/2204.08415

  23. [23]

    Rethinking coherence modeling: Synthetic vs. downstream tasks,

    T. Mohiuddin, P. Jwalapuram, X. Lin, and S. Joty, “Rethinking coherence modeling: Synthetic vs. downstream tasks,” inProceedings of the 16th Conference of the European 17 Chapter of the Association for Computational Linguistics: Main V olume, P. Merlo, J. Tiedemann, and R. Tsarfaty, Eds. Online: Association for Computational Linguistics, Apr. 2021, pp. 35...

  24. [24]

    Content moderation by llm: From accuracy to legitimacy,

    T. Huang, “Content moderation by llm: From accuracy to legitimacy,” 2024. [Online]. Available: https://arxiv.org/abs/2409.03219

  25. [25]

    Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms,

    V . Sirdeshmukh, K. Deshpande, J. Mols, L. Jin, E.-Y . Cardona, D. Lee, J. Kritz, W. Primack, S. Yue, and C. Xing, “Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms,” 2025. [Online]. Available: https: //arxiv.org/abs/2501.17399

  26. [26]

    Multichallenge github repository,

    V . Ekwinox, “Multichallenge github repository,” https://github.com/ekwinox117/ multi-challenge, 2025, accessed: 2025-03-30

  27. [27]

    Efficient Guided Generation for Large Language Models

    B. T. Willard and R. Louf, “Efficient guided generation for large language models,”arXiv preprint arXiv:2307.09702, 2023. 18 Dataset Split TN FN FP TP TNR Precision Recall Accuracy F1 Base Test 41 2 4 16 0.911 0.800 0.889 0.905 0.842 Reworded Test 40 4 5 14 0.889 0.737 0.778 0.857 0.757 Turn Constrained Test 36 2 9 16 0.800 0.640 0.889 0.825 0.744 Both Te...