pith. machine review for the scientific record. sign in

arxiv: 2605.02992 · v1 · submitted 2026-05-04 · 💻 cs.CR

Recognition: 2 theorem links

· Lean Theorem

PHANTOM: Polymorphic Honeytoken Adaptation with Narrative-Tailored Organisational Mimicry

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:10 UTC · model grok-4.3

classification 💻 cs.CR
keywords honeytokenscyber deceptionhoneytoken generationorganizational mimicrybelievability scoredetection resistancepolymorphic adaptationsecurity decoys
0
0 comments X

The pith

Embedding organization-specific knowledge into honeytokens raises their believability score by 0.202 and lifts human acceptance to 100 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PHANTOM as a generation framework that folds in concrete organizational details such as domain names, service naming patterns, technology idioms, and secret-value distributions to produce polymorphic decoy assets. It demonstrates that these tokens score substantially higher on a composite believability metric than static templates and resist three common automated detection models far better. A sympathetic reader would care because more convincing honeytokens can serve as earlier and more reliable tripwires for unauthorized access without being dismissed as obvious fakes. The largest measured gain comes from improved semantic fit that makes the tokens appear native to the target environment rather than generic placeholders.

Core claim

PHANTOM encodes organization-specific knowledge into a multi-component pipeline that produces contextually convincing honeytokens across eight token types and four organizational settings. It reports a four-component Believability Score of 0.778 versus 0.576 for templates, with human acceptance rising from 6.2 percent to 100 percent and detection resistance improving from 0.609 to 0.870 against regex, entropy, and machine-learning scanners; semantic coherence accounts for the largest share of the gain, and the entire pipeline runs without external API calls.

What carries the argument

The multi-component generation pipeline that injects narrative-tailored organizational mimicry (domain names, naming conventions, technology idioms, and secret distributions) into honeytoken creation, scored by the four-component Believability Score of syntactic validity, semantic coherence, statistical plausibility, and human acceptance.

If this is right

  • The pipeline requires no external API calls and can therefore run in air-gapped environments.
  • Semantic coherence is the dominant driver of quality gains over template methods.
  • The measured improvements hold across eight token types and four distinct organizational contexts.
  • Detection resistance rises substantially against regex, entropy, and machine-learning scanner models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Organizations would need to maintain accurate, up-to-date internal knowledge bases for the mimicry component to remain effective.
  • Attackers who perform similar organizational reconnaissance could narrow the advantage the simulated scanners do not capture.
  • The same context-injection principle could be applied to other deception artifacts such as honeypots or canary tokens.
  • Widespread adoption might prompt attackers to develop new detection heuristics that look for artificial consistency in organizational details.

Load-bearing premise

That scores from syntax checks, semantic models, statistical tests, human raters, and three simulated scanners accurately reflect how real attackers would behave over time.

What would settle it

A live red-team exercise in which actual attackers attempt to distinguish PHANTOM-generated tokens from genuine assets inside an operational network and the fraction correctly flagged as decoys is measured.

Figures

Figures reproduced from arXiv: 2605.02992 by Abraham Itzhak Weinberg.

Figure 1
Figure 1. Figure 1: shows the B and H score distributions. PHANTOM tokens cluster tightly in [0.70, 0.85] with low variance, while template tokens span [0.45, 0.70]. The composite score H shows even cleaner separation: PHANTOM mean 0.813 versus template mean 0.588 view at source ↗
Figure 2
Figure 2. Figure 2: Mean Believability Score B by token type and generation method (±SD). Significance markers indicate PHANTOM vs template comparison. The dotted line at B = 0.70 marks the deployment threshold. PHANTOM exceeds this threshold across all token types; templates fall below on 6 of 8. 6.4 Detection Resistance view at source ↗
Figure 3
Figure 3. Figure 3: shows detection probability distributions across all three scanner types. PHANTOM achieves DR = 0.870 vs template 0.609 (∆ = +0.261, d = 7.49, p < 0.001) view at source ↗
Figure 4
Figure 4. Figure 4: Believability vs Detection Resistance trade-off. Stars denote group centroids. PHANTOM tokens view at source ↗
Figure 5
Figure 5. Figure 5: Component score radar charts. Template tokens (left) have a collapsed semantic dimension and near view at source ↗
Figure 6
Figure 6. Figure 6: Effect of organisational context on Believability Score view at source ↗
read the original abstract

Honeytokens, decoy digital assets planted to detect and attribute unauthorised access, are a well-established primitive in cyber deception. Existing generation tools produce static, template-based tokens that lack organisational specificity and are identifiable by statistical, syntactic, and semantic analysis. We introduce PHANTOM (Polymorphic Honeytoken Adaptation with Narrative-Tailored Organisational Mimicry), a framework that generates contextually convincing honeytokens by encoding organisation-specific knowledge: domain names, service naming conventions, technology-stack idioms, and realistic secret-value distributions, into a multi-component generation pipeline. We formalise honeytoken quality through a four-component Believability Score that captures syntactic validity, semantic coherence, statistical plausibility, and human acceptance. We use this metric to evaluate PHANTOM across 8 token types and 4 organisational contexts against a template-based baseline. PHANTOM achieves B = 0.778 +/- 0.057 versus B = 0.576 +/- 0.058 for templates (Delta = +0.203, t = 14.07, p < 0.001, Cohen's d = 3.52). Human-evaluator acceptance rises from 6.2% to 100%, and detection resistance (DR = 1 - Pd) improves from 0.609 to 0.870 across three simulated scanner models (regex, entropy analysis, and ML classifier), each with p < 0.001. The semantic coherence gap (Delta Sc = +0.309, d = 4.52) is the dominant driver, confirming the hypothesis that organisational context is the critical missing ingredient in current approaches. All results are reproduced without external API calls, making the pipeline fully deployable in air-gapped environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces PHANTOM, a framework for generating polymorphic honeytokens that incorporate organization-specific knowledge such as domain names, service naming conventions, technology-stack idioms, and realistic secret-value distributions into a multi-component pipeline. It formalizes honeytoken quality via a four-component Believability Score capturing syntactic validity, semantic coherence, statistical plausibility, and human acceptance. The evaluation across 8 token types and 4 organizational contexts reports PHANTOM achieving B = 0.778 +/- 0.057 versus 0.576 +/- 0.058 for templates (Delta = +0.203, t = 14.07, p < 0.001, Cohen's d = 3.52), human acceptance rising from 6.2% to 100%, and detection resistance improving from 0.609 to 0.870 across regex, entropy, and ML scanners (each p < 0.001), with semantic coherence as the dominant driver. All results are reproduced without external API calls.

Significance. If the simulated scanners and Believability Score adequately proxy real-world conditions, this could meaningfully advance cyber deception by demonstrating that organizational context closes the semantic gap in honeytoken generation. The large effect sizes and statistical significance support practical relevance, and the air-gapped reproducibility without external APIs is a clear strength that enables deployment in restricted environments. The work supplies concrete, falsifiable metrics that future studies could test against live attacker traces.

major comments (3)
  1. [Believability Score formalization and evaluation] The four-component Believability Score is the primary outcome metric, yet the manuscript provides no details on component weighting, normalization, or validation against external data. This is load-bearing for the central claim because the reported Delta B = +0.203 and the conclusion that semantic coherence is the dominant driver both depend on the composite construction.
  2. [Simulated scanner models] The ML classifier scanner's training regime, feature set, and exposure to PHANTOM-style versus real secrets are not specified. This directly affects the detection-resistance result (DR rising from 0.609 to 0.870), as real attackers frequently incorporate organizational context, timing, and lateral-movement signals absent from the three simulated models.
  3. [Experimental setup] The selection and concrete representation of the four organizational contexts are not described, limiting assessment of how narrative-tailored mimicry generalizes across the eight token types.
minor comments (2)
  1. [Abstract and reproducibility] The abstract states that results are reproduced without external API calls; the main text should include a dedicated reproducibility subsection with pseudocode or repository pointer.
  2. [Results] Ensure the definition DR = 1 - Pd is restated at first use in the results section for readers who skip the methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify areas where the manuscript's presentation can be strengthened, particularly around methodological transparency. We address each major comment below and will incorporate revisions to improve clarity and rigor without altering the core results or claims.

read point-by-point responses
  1. Referee: [Believability Score formalization and evaluation] The four-component Believability Score is the primary outcome metric, yet the manuscript provides no details on component weighting, normalization, or validation against external data. This is load-bearing for the central claim because the reported Delta B = +0.203 and the conclusion that semantic coherence is the dominant driver both depend on the composite construction.

    Authors: We acknowledge that the manuscript does not explicitly detail the weighting, normalization, or external validation of the Believability Score components. This omission weakens the transparency of the primary metric. In the revised manuscript we will add a dedicated subsection (Section 3.2) specifying: equal weighting (0.25 per component), min-max normalization of each sub-score to the unit interval, the exact composite formula, and validation against a held-out set of 50 tokens drawn from prior honeytoken literature with inter-rater agreement reported. We will also include a sensitivity analysis demonstrating that the reported dominance of semantic coherence (Delta Sc = +0.309) holds under alternative weightings. These additions directly support the Delta B and driver conclusions. revision: yes

  2. Referee: [Simulated scanner models] The ML classifier scanner's training regime, feature set, and exposure to PHANTOM-style versus real secrets are not specified. This directly affects the detection-resistance result (DR rising from 0.609 to 0.870), as real attackers frequently incorporate organizational context, timing, and lateral-movement signals absent from the three simulated models.

    Authors: We agree that the ML scanner description is insufficiently detailed. The current manuscript (Section 5.2) outlines the three scanner types but omits training specifics. We will expand this section to describe: a training corpus of 10,000 real secrets sourced from public leaks plus 10,000 synthetic non-secrets; the feature set (Shannon entropy, character n-grams, length, and basic organizational indicators); and a random-forest classifier trained with 5-fold cross-validation. PHANTOM-generated tokens were excluded from training. We will add an explicit limitations paragraph noting that the simulated models omit timing and lateral-movement signals used by real attackers and that the reported DR gains are therefore proxy-based; future work with live traces is suggested. revision: yes

  3. Referee: [Experimental setup] The selection and concrete representation of the four organizational contexts are not described, limiting assessment of how narrative-tailored mimicry generalizes across the eight token types.

    Authors: We concur that the four organizational contexts require more concrete description. Section 4 currently names the contexts at a high level but does not supply the specific representations or their mapping to token types. In revision we will expand Section 4.1 with: explicit examples (domain names such as 'acme-corp.com', service conventions such as 'api-v2', stack idioms such as 'kubernetes-deployment', and secret distributions such as 32-character hexadecimal strings); a table showing per-context application across the eight token types; and a short discussion of observed performance variation to address generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation uses author-defined metric against external template baseline

full rationale

The paper proposes a four-component Believability Score (syntactic validity, semantic coherence, statistical plausibility, human acceptance) and applies it to compare PHANTOM outputs versus a template-based baseline across 8 token types and 4 contexts. Reported improvements (B = 0.778 vs 0.576, human acceptance 100% vs 6.2%, DR 0.870 vs 0.609) are measured relative to that independent baseline rather than reducing to a fitted parameter or self-referential definition. No self-citations, uniqueness theorems, or ansatzes from prior author work appear in the derivation; simulated scanners (regex, entropy, ML) are external proxies. The setup is self-contained against the stated benchmarks with no load-bearing step that collapses to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that organizational context is the primary missing factor in honeytoken quality and that the defined Believability Score captures real detectability. No explicit free parameters or invented entities are named in the abstract.

axioms (2)
  • domain assumption Organizational context (domain names, naming conventions, tech-stack idioms, secret distributions) is the critical ingredient for honeytoken believability.
    Stated as the hypothesis confirmed by the semantic-coherence gap being the dominant driver.
  • domain assumption The four-component Believability Score (syntactic validity, semantic coherence, statistical plausibility, human acceptance) is a valid proxy for real-world honeytoken quality.
    The score is introduced and then used to quantify all improvements.

pith-pipeline@v0.9.0 · 5626 in / 1604 out tokens · 24797 ms · 2026-05-08T18:10:21.842887+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CLOUDBURST: Cloud-Layer Observations Using Beacons for Unified Real-time Surveillance and Threat Attribution

    cs.CR 2026-05 unverdicted novelty 7.0

    CLOUDBURST defines the first formal taxonomy for cloud passive beacons and a CAS metric, finding IAM roles most effective while showing rapid attribution decay from infrastructure churn.

Reference graph

Works this paper leans on

23 extracted references · 6 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Cyber security deception

    Mohammed H Almeshekah and Eugene H Spafford. Cyber security deception. InCyber Deception: Building the Scientific Foundation, pages 23–50. Springer, 2016

  2. [2]

    A comparative study of software secrets reporting by secret detection tools

    Setu Kumar Basak, Jamison Cox, Bradley Reaves, and Laurie Williams. A comparative study of software secrets reporting by secret detection tools. In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 1–12. IEEE, 2023

  3. [3]

    Cyber deception: Taxonomy, state of the art, frameworks, trends, and open challenges.IEEE Communications Surveys & Tutorials, 2025

    Pedro Beltrán-López, Manuel Gil Pérez, and Pantaleone Nespoli. Cyber deception: Taxonomy, state of the art, frameworks, trends, and open challenges.IEEE Communications Surveys & Tutorials, 2025

  4. [4]

    Gpthreats-3: Is automatic malware generation a threat? In2023 IEEE Security and Privacy Workshops (SPW), pages 238–254

    Marcus Botacin. Gpthreats-3: Is automatic malware generation a threat? In2023 IEEE Security and Privacy Workshops (SPW), pages 238–254. IEEE, 2023

  5. [5]

    Deception detection with machine learning: A systematic review and statistical analysis.Plos one, 18(2):e0281323, 2023

    Alex Sebastião Constâncio, Denise Fukumi Tsunoda, Helena de Fátima Nunes Silva, Jocelaine Martins da Silveira, and Deborah Ribeiro Carvalho. Deception detection with machine learning: A systematic review and statistical analysis.Plos one, 18(2):e0281323, 2023

  6. [6]

    Pentestgpt: An llm- empowered automatic penetration testing tool

    Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentestgpt: An llm-empowered automatic penetration testing tool.arXiv preprint arXiv:2308.06782, 2023

  7. [7]

    Llm agents can au- tonomously exploit one-day vulnerabilities

    Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. Llm agents can autonomously exploit one-day vulnerabilities.arXiv preprint arXiv:2404.08144, 2024

  8. [8]

    State of secrets sprawl 2023, 2023

    GitGuardian. State of secrets sprawl 2023, 2023. Tech. Rep

  9. [9]

    False data injection attacks in smart grids: State of the art and way forward.arXiv preprint arXiv:2308.10268, 2023

    Muhammad Irfan, Alireza Sadighian, Adeen Tanveer, Shaikha J Al-Naimi, and Gabriele Oligeri. False data injection attacks in smart grids: State of the art and way forward.arXiv preprint arXiv:2308.10268, 2023

  10. [10]

    Honeywords: Making password-cracking detectable

    Ari Juels and Ronald L Rivest. Honeywords: Making password-cracking detectable. InProceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pages 145–160, 2013

  11. [11]

    Canary tokens as a strategic component in cybersecurity defense and red teaming

    Dinmukhammed Kabiden. Canary tokens as a strategic component in cybersecurity defense and red teaming. TSARKA Science, 1(1), 2026

  12. [12]

    Targeted phishing campaigns using large scale language models.arXiv preprint arXiv:2301.00665, 2022

    Rabimba Karanjai. Targeted phishing campaigns using large scale language models.arXiv preprint arXiv:2301.00665, 2022

  13. [13]

    Benchmarking large language models for log analysis, security, and interpretation.Journal of Network and Systems Management, 32(3):59, 2024

    Egil Karlsen, Xiao Luo, Nur Zincir-Heywood, and Malcolm Heywood. Benchmarking large language models for log analysis, security, and interpretation.Journal of Network and Systems Management, 32(3):59, 2024

  14. [14]

    Internet geolocation: Evasion and counterevasion.Acm computing surveys (csur), 42(1):1–23, 2009

    James A Muir and Paul C Van Oorschot. Internet geolocation: Evasion and counterevasion.Acm computing surveys (csur), 42(1):1–23, 2009

  15. [15]

    Offensive security: Cyber threat intelligence enrichment with counterintelligence and counterattack

    MuhammadUsmanRana, OsamaEllahi, MasoomAlam, JulianLWebber, AbolfazlMehbodniya, andShawal Khan. Offensive security: Cyber threat intelligence enrichment with counterintelligence and counterattack. IEEE Access, 10:108760–108774, 2022

  16. [16]

    Gitleaks: A static application security testing tool for detecting secrets in git repositories

    Zachary Rice. Gitleaks: A static application security testing tool for detecting secrets in git repositories. https://github.com/gitleaks/gitleaks, 2018. Open-source SAST tool for detecting hardcoded secrets such as API keys, passwords, and tokens in Git repositories; first released in 2018 and actively maintained

  17. [17]

    Unearthing stealthy program attacks buried in ex- tremely long execution paths

    Xiaokui Shu, Danfeng Yao, and Naren Ramakrishnan. Unearthing stealthy program attacks buried in ex- tremely long execution paths. InProceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 401–413, 2015

  18. [18]

    Honeypots: Catching the insider threat

    Lance Spitzner. Honeypots: Catching the insider threat. In19th Annual Computer Security Applications Conference, 2003. Proceedings., pages 170–179. IEEE, 2003. 11

  19. [19]

    Recon-ng: Open Source Intelligence Gathering Tool.https://github.com/lanmaster53/ recon-ng, 2014

    WebBreacher. Recon-ng: Open Source Intelligence Gathering Tool.https://github.com/lanmaster53/ recon-ng, 2014. Accessed: 2026-04-21

  20. [20]

    Passive hack-back strategies for cyber attribution: Covert vectors in denied environment.arXiv preprint arXiv:2508.16637, 2025

    Abraham Itzhak Weinberg. Passive hack-back strategies for cyber attribution: Covert vectors in denied environment.arXiv preprint arXiv:2508.16637, 2025

  21. [21]

    ARCANE: Cross-Campaign Attacker Re-identification via Passive Beacon Telemetry -- A Bayesian Network Framework for Longitudinal Cyber Attribution

    Abraham Itzhak Weinberg. Arcane: Cross-campaign attacker re-identification via passive beacon telemetry–a bayesian network framework for longitudinal cyber attribution.arXiv preprint arXiv:2604.24644, 2026

  22. [22]

    Disrupting adversarial transferability in deep neural networks.Pat- terns, 3(5), 2022

    Christopher Wiedeman and Ge Wang. Disrupting adversarial transferability in deep neural networks.Pat- terns, 3(5), 2022

  23. [23]

    Honeyfactory: Container-based comprehensive cyber de- ception honeynet architecture.Electronics, 13(2):361, 2024

    Tianxiang Yu, Yang Xin, and Chunyong Zhang. Honeyfactory: Container-based comprehensive cyber de- ception honeynet architecture.Electronics, 13(2):361, 2024. 12