Recognition: 2 theorem links
· Lean TheoremPHANTOM: Polymorphic Honeytoken Adaptation with Narrative-Tailored Organisational Mimicry
Pith reviewed 2026-05-08 18:10 UTC · model grok-4.3
The pith
Embedding organization-specific knowledge into honeytokens raises their believability score by 0.202 and lifts human acceptance to 100 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PHANTOM encodes organization-specific knowledge into a multi-component pipeline that produces contextually convincing honeytokens across eight token types and four organizational settings. It reports a four-component Believability Score of 0.778 versus 0.576 for templates, with human acceptance rising from 6.2 percent to 100 percent and detection resistance improving from 0.609 to 0.870 against regex, entropy, and machine-learning scanners; semantic coherence accounts for the largest share of the gain, and the entire pipeline runs without external API calls.
What carries the argument
The multi-component generation pipeline that injects narrative-tailored organizational mimicry (domain names, naming conventions, technology idioms, and secret distributions) into honeytoken creation, scored by the four-component Believability Score of syntactic validity, semantic coherence, statistical plausibility, and human acceptance.
If this is right
- The pipeline requires no external API calls and can therefore run in air-gapped environments.
- Semantic coherence is the dominant driver of quality gains over template methods.
- The measured improvements hold across eight token types and four distinct organizational contexts.
- Detection resistance rises substantially against regex, entropy, and machine-learning scanner models.
Where Pith is reading between the lines
- Organizations would need to maintain accurate, up-to-date internal knowledge bases for the mimicry component to remain effective.
- Attackers who perform similar organizational reconnaissance could narrow the advantage the simulated scanners do not capture.
- The same context-injection principle could be applied to other deception artifacts such as honeypots or canary tokens.
- Widespread adoption might prompt attackers to develop new detection heuristics that look for artificial consistency in organizational details.
Load-bearing premise
That scores from syntax checks, semantic models, statistical tests, human raters, and three simulated scanners accurately reflect how real attackers would behave over time.
What would settle it
A live red-team exercise in which actual attackers attempt to distinguish PHANTOM-generated tokens from genuine assets inside an operational network and the fraction correctly flagged as decoys is measured.
Figures
read the original abstract
Honeytokens, decoy digital assets planted to detect and attribute unauthorised access, are a well-established primitive in cyber deception. Existing generation tools produce static, template-based tokens that lack organisational specificity and are identifiable by statistical, syntactic, and semantic analysis. We introduce PHANTOM (Polymorphic Honeytoken Adaptation with Narrative-Tailored Organisational Mimicry), a framework that generates contextually convincing honeytokens by encoding organisation-specific knowledge: domain names, service naming conventions, technology-stack idioms, and realistic secret-value distributions, into a multi-component generation pipeline. We formalise honeytoken quality through a four-component Believability Score that captures syntactic validity, semantic coherence, statistical plausibility, and human acceptance. We use this metric to evaluate PHANTOM across 8 token types and 4 organisational contexts against a template-based baseline. PHANTOM achieves B = 0.778 +/- 0.057 versus B = 0.576 +/- 0.058 for templates (Delta = +0.203, t = 14.07, p < 0.001, Cohen's d = 3.52). Human-evaluator acceptance rises from 6.2% to 100%, and detection resistance (DR = 1 - Pd) improves from 0.609 to 0.870 across three simulated scanner models (regex, entropy analysis, and ML classifier), each with p < 0.001. The semantic coherence gap (Delta Sc = +0.309, d = 4.52) is the dominant driver, confirming the hypothesis that organisational context is the critical missing ingredient in current approaches. All results are reproduced without external API calls, making the pipeline fully deployable in air-gapped environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PHANTOM, a framework for generating polymorphic honeytokens that incorporate organization-specific knowledge such as domain names, service naming conventions, technology-stack idioms, and realistic secret-value distributions into a multi-component pipeline. It formalizes honeytoken quality via a four-component Believability Score capturing syntactic validity, semantic coherence, statistical plausibility, and human acceptance. The evaluation across 8 token types and 4 organizational contexts reports PHANTOM achieving B = 0.778 +/- 0.057 versus 0.576 +/- 0.058 for templates (Delta = +0.203, t = 14.07, p < 0.001, Cohen's d = 3.52), human acceptance rising from 6.2% to 100%, and detection resistance improving from 0.609 to 0.870 across regex, entropy, and ML scanners (each p < 0.001), with semantic coherence as the dominant driver. All results are reproduced without external API calls.
Significance. If the simulated scanners and Believability Score adequately proxy real-world conditions, this could meaningfully advance cyber deception by demonstrating that organizational context closes the semantic gap in honeytoken generation. The large effect sizes and statistical significance support practical relevance, and the air-gapped reproducibility without external APIs is a clear strength that enables deployment in restricted environments. The work supplies concrete, falsifiable metrics that future studies could test against live attacker traces.
major comments (3)
- [Believability Score formalization and evaluation] The four-component Believability Score is the primary outcome metric, yet the manuscript provides no details on component weighting, normalization, or validation against external data. This is load-bearing for the central claim because the reported Delta B = +0.203 and the conclusion that semantic coherence is the dominant driver both depend on the composite construction.
- [Simulated scanner models] The ML classifier scanner's training regime, feature set, and exposure to PHANTOM-style versus real secrets are not specified. This directly affects the detection-resistance result (DR rising from 0.609 to 0.870), as real attackers frequently incorporate organizational context, timing, and lateral-movement signals absent from the three simulated models.
- [Experimental setup] The selection and concrete representation of the four organizational contexts are not described, limiting assessment of how narrative-tailored mimicry generalizes across the eight token types.
minor comments (2)
- [Abstract and reproducibility] The abstract states that results are reproduced without external API calls; the main text should include a dedicated reproducibility subsection with pseudocode or repository pointer.
- [Results] Ensure the definition DR = 1 - Pd is restated at first use in the results section for readers who skip the methods.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments identify areas where the manuscript's presentation can be strengthened, particularly around methodological transparency. We address each major comment below and will incorporate revisions to improve clarity and rigor without altering the core results or claims.
read point-by-point responses
-
Referee: [Believability Score formalization and evaluation] The four-component Believability Score is the primary outcome metric, yet the manuscript provides no details on component weighting, normalization, or validation against external data. This is load-bearing for the central claim because the reported Delta B = +0.203 and the conclusion that semantic coherence is the dominant driver both depend on the composite construction.
Authors: We acknowledge that the manuscript does not explicitly detail the weighting, normalization, or external validation of the Believability Score components. This omission weakens the transparency of the primary metric. In the revised manuscript we will add a dedicated subsection (Section 3.2) specifying: equal weighting (0.25 per component), min-max normalization of each sub-score to the unit interval, the exact composite formula, and validation against a held-out set of 50 tokens drawn from prior honeytoken literature with inter-rater agreement reported. We will also include a sensitivity analysis demonstrating that the reported dominance of semantic coherence (Delta Sc = +0.309) holds under alternative weightings. These additions directly support the Delta B and driver conclusions. revision: yes
-
Referee: [Simulated scanner models] The ML classifier scanner's training regime, feature set, and exposure to PHANTOM-style versus real secrets are not specified. This directly affects the detection-resistance result (DR rising from 0.609 to 0.870), as real attackers frequently incorporate organizational context, timing, and lateral-movement signals absent from the three simulated models.
Authors: We agree that the ML scanner description is insufficiently detailed. The current manuscript (Section 5.2) outlines the three scanner types but omits training specifics. We will expand this section to describe: a training corpus of 10,000 real secrets sourced from public leaks plus 10,000 synthetic non-secrets; the feature set (Shannon entropy, character n-grams, length, and basic organizational indicators); and a random-forest classifier trained with 5-fold cross-validation. PHANTOM-generated tokens were excluded from training. We will add an explicit limitations paragraph noting that the simulated models omit timing and lateral-movement signals used by real attackers and that the reported DR gains are therefore proxy-based; future work with live traces is suggested. revision: yes
-
Referee: [Experimental setup] The selection and concrete representation of the four organizational contexts are not described, limiting assessment of how narrative-tailored mimicry generalizes across the eight token types.
Authors: We concur that the four organizational contexts require more concrete description. Section 4 currently names the contexts at a high level but does not supply the specific representations or their mapping to token types. In revision we will expand Section 4.1 with: explicit examples (domain names such as 'acme-corp.com', service conventions such as 'api-v2', stack idioms such as 'kubernetes-deployment', and secret distributions such as 32-character hexadecimal strings); a table showing per-context application across the eight token types; and a short discussion of observed performance variation to address generalizability. revision: yes
Circularity Check
No significant circularity; evaluation uses author-defined metric against external template baseline
full rationale
The paper proposes a four-component Believability Score (syntactic validity, semantic coherence, statistical plausibility, human acceptance) and applies it to compare PHANTOM outputs versus a template-based baseline across 8 token types and 4 contexts. Reported improvements (B = 0.778 vs 0.576, human acceptance 100% vs 6.2%, DR 0.870 vs 0.609) are measured relative to that independent baseline rather than reducing to a fitted parameter or self-referential definition. No self-citations, uniqueness theorems, or ansatzes from prior author work appear in the derivation; simulated scanners (regex, entropy, ML) are external proxies. The setup is self-contained against the stated benchmarks with no load-bearing step that collapses to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Organizational context (domain names, naming conventions, tech-stack idioms, secret distributions) is the critical ingredient for honeytoken believability.
- domain assumption The four-component Believability Score (syntactic validity, semantic coherence, statistical plausibility, human acceptance) is a valid proxy for real-world honeytoken quality.
Lean theorems connected to this paper
-
Cost.FunctionalEquation (J(x)=½(x+x⁻¹)−1)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
B(c) = w1·Sv(c) + w2·Sc(c) + w3·Sn(c) + w4·Sh(c) ... Default weights (0.20,0.30,0.20,0.30)
-
Foundation.AlphaCoordinateFixationJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
H(c) = B(c)^λ · DR(c)^μ, λ+μ=1, with λ=0.6, μ=0.4
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
CLOUDBURST: Cloud-Layer Observations Using Beacons for Unified Real-time Surveillance and Threat Attribution
CLOUDBURST defines the first formal taxonomy for cloud passive beacons and a CAS metric, finding IAM roles most effective while showing rapid attribution decay from infrastructure churn.
Reference graph
Works this paper leans on
-
[1]
Cyber security deception
Mohammed H Almeshekah and Eugene H Spafford. Cyber security deception. InCyber Deception: Building the Scientific Foundation, pages 23–50. Springer, 2016
2016
-
[2]
A comparative study of software secrets reporting by secret detection tools
Setu Kumar Basak, Jamison Cox, Bradley Reaves, and Laurie Williams. A comparative study of software secrets reporting by secret detection tools. In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 1–12. IEEE, 2023
2023
-
[3]
Cyber deception: Taxonomy, state of the art, frameworks, trends, and open challenges.IEEE Communications Surveys & Tutorials, 2025
Pedro Beltrán-López, Manuel Gil Pérez, and Pantaleone Nespoli. Cyber deception: Taxonomy, state of the art, frameworks, trends, and open challenges.IEEE Communications Surveys & Tutorials, 2025
2025
-
[4]
Gpthreats-3: Is automatic malware generation a threat? In2023 IEEE Security and Privacy Workshops (SPW), pages 238–254
Marcus Botacin. Gpthreats-3: Is automatic malware generation a threat? In2023 IEEE Security and Privacy Workshops (SPW), pages 238–254. IEEE, 2023
2023
-
[5]
Deception detection with machine learning: A systematic review and statistical analysis.Plos one, 18(2):e0281323, 2023
Alex Sebastião Constâncio, Denise Fukumi Tsunoda, Helena de Fátima Nunes Silva, Jocelaine Martins da Silveira, and Deborah Ribeiro Carvalho. Deception detection with machine learning: A systematic review and statistical analysis.Plos one, 18(2):e0281323, 2023
2023
-
[6]
Pentestgpt: An llm- empowered automatic penetration testing tool
Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. Pentestgpt: An llm-empowered automatic penetration testing tool.arXiv preprint arXiv:2308.06782, 2023
-
[7]
Llm agents can au- tonomously exploit one-day vulnerabilities
Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. Llm agents can autonomously exploit one-day vulnerabilities.arXiv preprint arXiv:2404.08144, 2024
-
[8]
State of secrets sprawl 2023, 2023
GitGuardian. State of secrets sprawl 2023, 2023. Tech. Rep
2023
-
[9]
Muhammad Irfan, Alireza Sadighian, Adeen Tanveer, Shaikha J Al-Naimi, and Gabriele Oligeri. False data injection attacks in smart grids: State of the art and way forward.arXiv preprint arXiv:2308.10268, 2023
-
[10]
Honeywords: Making password-cracking detectable
Ari Juels and Ronald L Rivest. Honeywords: Making password-cracking detectable. InProceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pages 145–160, 2013
2013
-
[11]
Canary tokens as a strategic component in cybersecurity defense and red teaming
Dinmukhammed Kabiden. Canary tokens as a strategic component in cybersecurity defense and red teaming. TSARKA Science, 1(1), 2026
2026
-
[12]
Targeted phishing campaigns using large scale language models.arXiv preprint arXiv:2301.00665, 2022
Rabimba Karanjai. Targeted phishing campaigns using large scale language models.arXiv preprint arXiv:2301.00665, 2022
-
[13]
Benchmarking large language models for log analysis, security, and interpretation.Journal of Network and Systems Management, 32(3):59, 2024
Egil Karlsen, Xiao Luo, Nur Zincir-Heywood, and Malcolm Heywood. Benchmarking large language models for log analysis, security, and interpretation.Journal of Network and Systems Management, 32(3):59, 2024
2024
-
[14]
Internet geolocation: Evasion and counterevasion.Acm computing surveys (csur), 42(1):1–23, 2009
James A Muir and Paul C Van Oorschot. Internet geolocation: Evasion and counterevasion.Acm computing surveys (csur), 42(1):1–23, 2009
2009
-
[15]
Offensive security: Cyber threat intelligence enrichment with counterintelligence and counterattack
MuhammadUsmanRana, OsamaEllahi, MasoomAlam, JulianLWebber, AbolfazlMehbodniya, andShawal Khan. Offensive security: Cyber threat intelligence enrichment with counterintelligence and counterattack. IEEE Access, 10:108760–108774, 2022
2022
-
[16]
Gitleaks: A static application security testing tool for detecting secrets in git repositories
Zachary Rice. Gitleaks: A static application security testing tool for detecting secrets in git repositories. https://github.com/gitleaks/gitleaks, 2018. Open-source SAST tool for detecting hardcoded secrets such as API keys, passwords, and tokens in Git repositories; first released in 2018 and actively maintained
2018
-
[17]
Unearthing stealthy program attacks buried in ex- tremely long execution paths
Xiaokui Shu, Danfeng Yao, and Naren Ramakrishnan. Unearthing stealthy program attacks buried in ex- tremely long execution paths. InProceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 401–413, 2015
2015
-
[18]
Honeypots: Catching the insider threat
Lance Spitzner. Honeypots: Catching the insider threat. In19th Annual Computer Security Applications Conference, 2003. Proceedings., pages 170–179. IEEE, 2003. 11
2003
-
[19]
Recon-ng: Open Source Intelligence Gathering Tool.https://github.com/lanmaster53/ recon-ng, 2014
WebBreacher. Recon-ng: Open Source Intelligence Gathering Tool.https://github.com/lanmaster53/ recon-ng, 2014. Accessed: 2026-04-21
2014
-
[20]
Abraham Itzhak Weinberg. Passive hack-back strategies for cyber attribution: Covert vectors in denied environment.arXiv preprint arXiv:2508.16637, 2025
-
[21]
Abraham Itzhak Weinberg. Arcane: Cross-campaign attacker re-identification via passive beacon telemetry–a bayesian network framework for longitudinal cyber attribution.arXiv preprint arXiv:2604.24644, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Disrupting adversarial transferability in deep neural networks.Pat- terns, 3(5), 2022
Christopher Wiedeman and Ge Wang. Disrupting adversarial transferability in deep neural networks.Pat- terns, 3(5), 2022
2022
-
[23]
Honeyfactory: Container-based comprehensive cyber de- ception honeynet architecture.Electronics, 13(2):361, 2024
Tianxiang Yu, Yang Xin, and Chunyong Zhang. Honeyfactory: Container-based comprehensive cyber de- ception honeynet architecture.Electronics, 13(2):361, 2024. 12
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.