arxiv: 2605.02992 · v1 · submitted 2026-05-04 · 💻 cs.CR

Recognition: 2 theorem links

· Lean Theorem

PHANTOM: Polymorphic Honeytoken Adaptation with Narrative-Tailored Organisational Mimicry

Abraham Itzhak Weinberg

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:10 UTC · model grok-4.3

classification 💻 cs.CR

keywords honeytokenscyber deceptionhoneytoken generationorganizational mimicrybelievability scoredetection resistancepolymorphic adaptationsecurity decoys

0 comments

The pith

Embedding organization-specific knowledge into honeytokens raises their believability score by 0.202 and lifts human acceptance to 100 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PHANTOM as a generation framework that folds in concrete organizational details such as domain names, service naming patterns, technology idioms, and secret-value distributions to produce polymorphic decoy assets. It demonstrates that these tokens score substantially higher on a composite believability metric than static templates and resist three common automated detection models far better. A sympathetic reader would care because more convincing honeytokens can serve as earlier and more reliable tripwires for unauthorized access without being dismissed as obvious fakes. The largest measured gain comes from improved semantic fit that makes the tokens appear native to the target environment rather than generic placeholders.

Core claim

PHANTOM encodes organization-specific knowledge into a multi-component pipeline that produces contextually convincing honeytokens across eight token types and four organizational settings. It reports a four-component Believability Score of 0.778 versus 0.576 for templates, with human acceptance rising from 6.2 percent to 100 percent and detection resistance improving from 0.609 to 0.870 against regex, entropy, and machine-learning scanners; semantic coherence accounts for the largest share of the gain, and the entire pipeline runs without external API calls.

What carries the argument

The multi-component generation pipeline that injects narrative-tailored organizational mimicry (domain names, naming conventions, technology idioms, and secret distributions) into honeytoken creation, scored by the four-component Believability Score of syntactic validity, semantic coherence, statistical plausibility, and human acceptance.

If this is right

The pipeline requires no external API calls and can therefore run in air-gapped environments.
Semantic coherence is the dominant driver of quality gains over template methods.
The measured improvements hold across eight token types and four distinct organizational contexts.
Detection resistance rises substantially against regex, entropy, and machine-learning scanner models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Organizations would need to maintain accurate, up-to-date internal knowledge bases for the mimicry component to remain effective.
Attackers who perform similar organizational reconnaissance could narrow the advantage the simulated scanners do not capture.
The same context-injection principle could be applied to other deception artifacts such as honeypots or canary tokens.
Widespread adoption might prompt attackers to develop new detection heuristics that look for artificial consistency in organizational details.

Load-bearing premise

That scores from syntax checks, semantic models, statistical tests, human raters, and three simulated scanners accurately reflect how real attackers would behave over time.

What would settle it

A live red-team exercise in which actual attackers attempt to distinguish PHANTOM-generated tokens from genuine assets inside an operational network and the fraction correctly flagged as decoys is measured.

Figures

Figures reproduced from arXiv: 2605.02992 by Abraham Itzhak Weinberg.

**Figure 1.** Figure 1: shows the B and H score distributions. PHANTOM tokens cluster tightly in [0.70, 0.85] with low variance, while template tokens span [0.45, 0.70]. The composite score H shows even cleaner separation: PHANTOM mean 0.813 versus template mean 0.588 view at source ↗

**Figure 2.** Figure 2: Mean Believability Score B by token type and generation method (±SD). Significance markers indicate PHANTOM vs template comparison. The dotted line at B = 0.70 marks the deployment threshold. PHANTOM exceeds this threshold across all token types; templates fall below on 6 of 8. 6.4 Detection Resistance view at source ↗

**Figure 3.** Figure 3: shows detection probability distributions across all three scanner types. PHANTOM achieves DR = 0.870 vs template 0.609 (∆ = +0.261, d = 7.49, p < 0.001) view at source ↗

**Figure 4.** Figure 4: Believability vs Detection Resistance trade-off. Stars denote group centroids. PHANTOM tokens view at source ↗

**Figure 5.** Figure 5: Component score radar charts. Template tokens (left) have a collapsed semantic dimension and near view at source ↗

**Figure 6.** Figure 6: Effect of organisational context on Believability Score view at source ↗

read the original abstract

Honeytokens, decoy digital assets planted to detect and attribute unauthorised access, are a well-established primitive in cyber deception. Existing generation tools produce static, template-based tokens that lack organisational specificity and are identifiable by statistical, syntactic, and semantic analysis. We introduce PHANTOM (Polymorphic Honeytoken Adaptation with Narrative-Tailored Organisational Mimicry), a framework that generates contextually convincing honeytokens by encoding organisation-specific knowledge: domain names, service naming conventions, technology-stack idioms, and realistic secret-value distributions, into a multi-component generation pipeline. We formalise honeytoken quality through a four-component Believability Score that captures syntactic validity, semantic coherence, statistical plausibility, and human acceptance. We use this metric to evaluate PHANTOM across 8 token types and 4 organisational contexts against a template-based baseline. PHANTOM achieves B = 0.778 +/- 0.057 versus B = 0.576 +/- 0.058 for templates (Delta = +0.203, t = 14.07, p < 0.001, Cohen's d = 3.52). Human-evaluator acceptance rises from 6.2% to 100%, and detection resistance (DR = 1 - Pd) improves from 0.609 to 0.870 across three simulated scanner models (regex, entropy analysis, and ML classifier), each with p < 0.001. The semantic coherence gap (Delta Sc = +0.309, d = 4.52) is the dominant driver, confirming the hypothesis that organisational context is the critical missing ingredient in current approaches. All results are reproduced without external API calls, making the pipeline fully deployable in air-gapped environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PHANTOM adds org-specific details to honeytoken generation and reports solid statistical lifts over templates, but the detection gains depend on simulated scanners whose realism is not fully established.

read the letter

The main takeaway is that this framework generates honeytokens by feeding in real organization details like domain names, service naming patterns, tech-stack language, and secret distributions, then measures the output with a new four-component Believability Score. It beats a template baseline by 0.203 on that score (0.778 vs 0.576), with human evaluators accepting 100% of the new tokens versus 6.2% of the old ones, and detection resistance rising from 0.609 to 0.870 across the three scanner types. The semantic-coherence piece drives most of the difference, and the whole pipeline runs locally without external calls, which is a practical plus for restricted environments. The statistical reporting includes standard errors, t-tests, p-values, and Cohen's d, so the comparisons are at least transparent on their own terms. The work is new in combining those specific inputs into one pipeline and formalizing the score that way. The evaluation covers eight token types and four contexts, which gives some breadth. The soft spots sit in the scanner models. The ML classifier's training set, features, and exposure to PHANTOM-style tokens versus real secrets are not described in enough detail to know whether it stands in for a skilled attacker who would use organizational context, timing, or lateral movement. If the model was trained only on generic data, the measured resistance gap could shrink in practice. The Believability Score weighting and how the organizational contexts were selected also need more justification to avoid looking post-hoc. This paper is aimed at people already working on honeytokens and cyber-deception tooling. It has enough structure and quantified results to go to peer review, though any referee will want tighter documentation on the scanner construction and score validation before the claims can be taken as general.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces PHANTOM, a framework for generating polymorphic honeytokens that incorporate organization-specific knowledge such as domain names, service naming conventions, technology-stack idioms, and realistic secret-value distributions into a multi-component pipeline. It formalizes honeytoken quality via a four-component Believability Score capturing syntactic validity, semantic coherence, statistical plausibility, and human acceptance. The evaluation across 8 token types and 4 organizational contexts reports PHANTOM achieving B = 0.778 +/- 0.057 versus 0.576 +/- 0.058 for templates (Delta = +0.203, t = 14.07, p < 0.001, Cohen's d = 3.52), human acceptance rising from 6.2% to 100%, and detection resistance improving from 0.609 to 0.870 across regex, entropy, and ML scanners (each p < 0.001), with semantic coherence as the dominant driver. All results are reproduced without external API calls.

Significance. If the simulated scanners and Believability Score adequately proxy real-world conditions, this could meaningfully advance cyber deception by demonstrating that organizational context closes the semantic gap in honeytoken generation. The large effect sizes and statistical significance support practical relevance, and the air-gapped reproducibility without external APIs is a clear strength that enables deployment in restricted environments. The work supplies concrete, falsifiable metrics that future studies could test against live attacker traces.

major comments (3)

[Believability Score formalization and evaluation] The four-component Believability Score is the primary outcome metric, yet the manuscript provides no details on component weighting, normalization, or validation against external data. This is load-bearing for the central claim because the reported Delta B = +0.203 and the conclusion that semantic coherence is the dominant driver both depend on the composite construction.
[Simulated scanner models] The ML classifier scanner's training regime, feature set, and exposure to PHANTOM-style versus real secrets are not specified. This directly affects the detection-resistance result (DR rising from 0.609 to 0.870), as real attackers frequently incorporate organizational context, timing, and lateral-movement signals absent from the three simulated models.
[Experimental setup] The selection and concrete representation of the four organizational contexts are not described, limiting assessment of how narrative-tailored mimicry generalizes across the eight token types.

minor comments (2)

[Abstract and reproducibility] The abstract states that results are reproduced without external API calls; the main text should include a dedicated reproducibility subsection with pseudocode or repository pointer.
[Results] Ensure the definition DR = 1 - Pd is restated at first use in the results section for readers who skip the methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify areas where the manuscript's presentation can be strengthened, particularly around methodological transparency. We address each major comment below and will incorporate revisions to improve clarity and rigor without altering the core results or claims.

read point-by-point responses

Referee: [Believability Score formalization and evaluation] The four-component Believability Score is the primary outcome metric, yet the manuscript provides no details on component weighting, normalization, or validation against external data. This is load-bearing for the central claim because the reported Delta B = +0.203 and the conclusion that semantic coherence is the dominant driver both depend on the composite construction.

Authors: We acknowledge that the manuscript does not explicitly detail the weighting, normalization, or external validation of the Believability Score components. This omission weakens the transparency of the primary metric. In the revised manuscript we will add a dedicated subsection (Section 3.2) specifying: equal weighting (0.25 per component), min-max normalization of each sub-score to the unit interval, the exact composite formula, and validation against a held-out set of 50 tokens drawn from prior honeytoken literature with inter-rater agreement reported. We will also include a sensitivity analysis demonstrating that the reported dominance of semantic coherence (Delta Sc = +0.309) holds under alternative weightings. These additions directly support the Delta B and driver conclusions. revision: yes
Referee: [Simulated scanner models] The ML classifier scanner's training regime, feature set, and exposure to PHANTOM-style versus real secrets are not specified. This directly affects the detection-resistance result (DR rising from 0.609 to 0.870), as real attackers frequently incorporate organizational context, timing, and lateral-movement signals absent from the three simulated models.

Authors: We agree that the ML scanner description is insufficiently detailed. The current manuscript (Section 5.2) outlines the three scanner types but omits training specifics. We will expand this section to describe: a training corpus of 10,000 real secrets sourced from public leaks plus 10,000 synthetic non-secrets; the feature set (Shannon entropy, character n-grams, length, and basic organizational indicators); and a random-forest classifier trained with 5-fold cross-validation. PHANTOM-generated tokens were excluded from training. We will add an explicit limitations paragraph noting that the simulated models omit timing and lateral-movement signals used by real attackers and that the reported DR gains are therefore proxy-based; future work with live traces is suggested. revision: yes
Referee: [Experimental setup] The selection and concrete representation of the four organizational contexts are not described, limiting assessment of how narrative-tailored mimicry generalizes across the eight token types.

Authors: We concur that the four organizational contexts require more concrete description. Section 4 currently names the contexts at a high level but does not supply the specific representations or their mapping to token types. In revision we will expand Section 4.1 with: explicit examples (domain names such as 'acme-corp.com', service conventions such as 'api-v2', stack idioms such as 'kubernetes-deployment', and secret distributions such as 32-character hexadecimal strings); a table showing per-context application across the eight token types; and a short discussion of observed performance variation to address generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation uses author-defined metric against external template baseline

full rationale

The paper proposes a four-component Believability Score (syntactic validity, semantic coherence, statistical plausibility, human acceptance) and applies it to compare PHANTOM outputs versus a template-based baseline across 8 token types and 4 contexts. Reported improvements (B = 0.778 vs 0.576, human acceptance 100% vs 6.2%, DR 0.870 vs 0.609) are measured relative to that independent baseline rather than reducing to a fitted parameter or self-referential definition. No self-citations, uniqueness theorems, or ansatzes from prior author work appear in the derivation; simulated scanners (regex, entropy, ML) are external proxies. The setup is self-contained against the stated benchmarks with no load-bearing step that collapses to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that organizational context is the primary missing factor in honeytoken quality and that the defined Believability Score captures real detectability. No explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption Organizational context (domain names, naming conventions, tech-stack idioms, secret distributions) is the critical ingredient for honeytoken believability.
Stated as the hypothesis confirmed by the semantic-coherence gap being the dominant driver.
domain assumption The four-component Believability Score (syntactic validity, semantic coherence, statistical plausibility, human acceptance) is a valid proxy for real-world honeytoken quality.
The score is introduced and then used to quantify all improvements.

pith-pipeline@v0.9.0 · 5626 in / 1604 out tokens · 24797 ms · 2026-05-08T18:10:21.842887+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation (J(x)=½(x+x⁻¹)−1) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

B(c) = w1·Sv(c) + w2·Sc(c) + w3·Sn(c) + w4·Sh(c) ... Default weights (0.20,0.30,0.20,0.30)
Foundation.AlphaCoordinateFixation J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

H(c) = B(c)^λ · DR(c)^μ, λ+μ=1, with λ=0.6, μ=0.4

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CLOUDBURST: Cloud-Layer Observations Using Beacons for Unified Real-time Surveillance and Threat Attribution
cs.CR 2026-05 unverdicted novelty 7.0

CLOUDBURST defines the first formal taxonomy for cloud passive beacons and a CAS metric, finding IAM roles most effective while showing rapid attribution decay from infrastructure churn.

Reference graph

Works this paper leans on

23 extracted references · 6 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Cyber security deception

Mohammed H Almeshekah and Eugene H Spafford. Cyber security deception. InCyber Deception: Building the Scientific Foundation, pages 23–50. Springer, 2016

2016
[2]

A comparative study of software secrets reporting by secret detection tools

Setu Kumar Basak, Jamison Cox, Bradley Reaves, and Laurie Williams. A comparative study of software secrets reporting by secret detection tools. In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 1–12. IEEE, 2023

2023
[3]

Cyber deception: Taxonomy, state of the art, frameworks, trends, and open challenges.IEEE Communications Surveys & Tutorials, 2025

Pedro Beltrán-López, Manuel Gil Pérez, and Pantaleone Nespoli. Cyber deception: Taxonomy, state of the art, frameworks, trends, and open challenges.IEEE Communications Surveys & Tutorials, 2025

2025
[4]

Gpthreats-3: Is automatic malware generation a threat? In2023 IEEE Security and Privacy Workshops (SPW), pages 238–254

Marcus Botacin. Gpthreats-3: Is automatic malware generation a threat? In2023 IEEE Security and Privacy Workshops (SPW), pages 238–254. IEEE, 2023

2023
[5]

Deception detection with machine learning: A systematic review and statistical analysis.Plos one, 18(2):e0281323, 2023

Alex Sebastião Constâncio, Denise Fukumi Tsunoda, Helena de Fátima Nunes Silva, Jocelaine Martins da Silveira, and Deborah Ribeiro Carvalho. Deception detection with machine learning: A systematic review and statistical analysis.Plos one, 18(2):e0281323, 2023

2023
[6]

Pentestgpt: An llm- empowered automatic penetration testing tool

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentestgpt: An llm-empowered automatic penetration testing tool.arXiv preprint arXiv:2308.06782, 2023

work page arXiv 2023
[7]

Llm agents can au- tonomously exploit one-day vulnerabilities

Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. Llm agents can autonomously exploit one-day vulnerabilities.arXiv preprint arXiv:2404.08144, 2024

work page arXiv 2024
[8]

State of secrets sprawl 2023, 2023

GitGuardian. State of secrets sprawl 2023, 2023. Tech. Rep

2023
[9]

False data injection attacks in smart grids: State of the art and way forward.arXiv preprint arXiv:2308.10268, 2023

Muhammad Irfan, Alireza Sadighian, Adeen Tanveer, Shaikha J Al-Naimi, and Gabriele Oligeri. False data injection attacks in smart grids: State of the art and way forward.arXiv preprint arXiv:2308.10268, 2023

work page arXiv 2023
[10]

Honeywords: Making password-cracking detectable

Ari Juels and Ronald L Rivest. Honeywords: Making password-cracking detectable. InProceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pages 145–160, 2013

2013
[11]

Canary tokens as a strategic component in cybersecurity defense and red teaming

Dinmukhammed Kabiden. Canary tokens as a strategic component in cybersecurity defense and red teaming. TSARKA Science, 1(1), 2026

2026
[12]

Targeted phishing campaigns using large scale language models.arXiv preprint arXiv:2301.00665, 2022

Rabimba Karanjai. Targeted phishing campaigns using large scale language models.arXiv preprint arXiv:2301.00665, 2022

work page arXiv 2022
[13]

Benchmarking large language models for log analysis, security, and interpretation.Journal of Network and Systems Management, 32(3):59, 2024

Egil Karlsen, Xiao Luo, Nur Zincir-Heywood, and Malcolm Heywood. Benchmarking large language models for log analysis, security, and interpretation.Journal of Network and Systems Management, 32(3):59, 2024

2024
[14]

Internet geolocation: Evasion and counterevasion.Acm computing surveys (csur), 42(1):1–23, 2009

James A Muir and Paul C Van Oorschot. Internet geolocation: Evasion and counterevasion.Acm computing surveys (csur), 42(1):1–23, 2009

2009
[15]

Offensive security: Cyber threat intelligence enrichment with counterintelligence and counterattack

MuhammadUsmanRana, OsamaEllahi, MasoomAlam, JulianLWebber, AbolfazlMehbodniya, andShawal Khan. Offensive security: Cyber threat intelligence enrichment with counterintelligence and counterattack. IEEE Access, 10:108760–108774, 2022

2022
[16]

Gitleaks: A static application security testing tool for detecting secrets in git repositories

Zachary Rice. Gitleaks: A static application security testing tool for detecting secrets in git repositories. https://github.com/gitleaks/gitleaks, 2018. Open-source SAST tool for detecting hardcoded secrets such as API keys, passwords, and tokens in Git repositories; first released in 2018 and actively maintained

2018
[17]

Unearthing stealthy program attacks buried in ex- tremely long execution paths

Xiaokui Shu, Danfeng Yao, and Naren Ramakrishnan. Unearthing stealthy program attacks buried in ex- tremely long execution paths. InProceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 401–413, 2015

2015
[18]

Honeypots: Catching the insider threat

Lance Spitzner. Honeypots: Catching the insider threat. In19th Annual Computer Security Applications Conference, 2003. Proceedings., pages 170–179. IEEE, 2003. 11

2003
[19]

Recon-ng: Open Source Intelligence Gathering Tool.https://github.com/lanmaster53/ recon-ng, 2014

WebBreacher. Recon-ng: Open Source Intelligence Gathering Tool.https://github.com/lanmaster53/ recon-ng, 2014. Accessed: 2026-04-21

2014
[20]

Passive hack-back strategies for cyber attribution: Covert vectors in denied environment.arXiv preprint arXiv:2508.16637, 2025

Abraham Itzhak Weinberg. Passive hack-back strategies for cyber attribution: Covert vectors in denied environment.arXiv preprint arXiv:2508.16637, 2025

work page arXiv 2025
[21]

ARCANE: Cross-Campaign Attacker Re-identification via Passive Beacon Telemetry -- A Bayesian Network Framework for Longitudinal Cyber Attribution

Abraham Itzhak Weinberg. Arcane: Cross-campaign attacker re-identification via passive beacon telemetry–a bayesian network framework for longitudinal cyber attribution.arXiv preprint arXiv:2604.24644, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Disrupting adversarial transferability in deep neural networks.Pat- terns, 3(5), 2022

Christopher Wiedeman and Ge Wang. Disrupting adversarial transferability in deep neural networks.Pat- terns, 3(5), 2022

2022
[23]

Honeyfactory: Container-based comprehensive cyber de- ception honeynet architecture.Electronics, 13(2):361, 2024

Tianxiang Yu, Yang Xin, and Chunyong Zhang. Honeyfactory: Container-based comprehensive cyber de- ception honeynet architecture.Electronics, 13(2):361, 2024. 12

2024