arxiv: 2605.03619 · v2 · submitted 2026-05-05 · 💻 cs.CR

Recognition: 3 theorem links

The Infinite Mutation Engine? Measuring Polymorphism in LLM-Generated Offensive Code

Gabriel Hortea , Juan Tapiador

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:33 UTC · model grok-4.3

classification 💻 cs.CR

keywords LLM-generated malwarepolymorphismcode diversitymalware evasionsignature-based detectionAST structural distanceembedding similarityoffensive AI

0 comments

The pith

A commercial LLM can cheaply generate large populations of behaviorally equivalent yet structurally diverse offensive payloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to quantify how much structural variation emerges when a large language model is repeatedly tasked with building the same malicious software payload. It compares two prompting approaches: one that only states functional goals and another that adds prior attempts to push the model toward new implementations. In both cases the outputs differ substantially in their internal code structure while preserving the same overall actions such as file traversal, encryption, and exfiltration. This variation matters because it shows that attackers could use ordinary commercial models to automate the creation of many distinct-looking versions of malware, undermining tools that rely on fixed signatures or code similarity checks.

Core claim

Using a dual-agent pipeline to generate, test, and refine data-exfiltration code with Claude Opus, the authors demonstrate that default functional prompts already produce high pairwise distances when measured by abstract syntax tree structure, yet low distances when measured by embedding vectors that capture semantic behavior. Adding explicit history of previous outputs further increases structural diversity while keeping the payloads functionally correct and executable. The process requires only a modest increase in model calls and token usage, with per-payload costs remaining under one dollar.

What carries the argument

A dual-agent four-stage pipeline that generates, tests, and refines payloads, together with pairwise distance calculations along abstract syntax tree structural and embedding semantic axes, applied across two prompting regimes.

If this is right

Attackers gain an automated way to produce many variants of the same payload that can bypass fixed detection rules.
Similarity-based malware clustering becomes less reliable when inputs come from repeated LLM calls.
The added cost of forcing more diversity through history prompts remains small, at roughly five times the tokens but only a slight rise in model calls.
A single commercial model suffices to create large polymorphic populations without specialized training or fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenders may need to move toward behavioral or runtime monitoring rather than relying solely on static code patterns.
Similar generation pipelines could be tested on other models to compare their polymorphic output ranges.
The same technique might be applied to generate diverse test cases for evaluating new detection methods.
Real-world deployment would require confirming that the structural diversity actually translates to evasion success in live environments.

Load-bearing premise

The generated payloads are actually executable and perform the intended malicious actions correctly, and that differences in abstract syntax trees and embeddings reliably predict whether real signature-based and clustering detectors will miss them.

What would settle it

Submitting the generated payloads to commercial signature-based antivirus scanners and similarity-based malware clustering tools and checking whether a substantial fraction evade detection.

Figures

Figures reproduced from arXiv: 2605.03619 by Gabriel Hortea, Juan Tapiador.

**Figure 1.** Figure 1: Overview of the methodology pipeline. demonstrate that LLM-based pipelines can synthesize realistic, evasive samples. Whereas prior work focuses on feasibility and evasion rates, we treat polymorphism as the primary object of study, quantitatively characterizing the structural and semantic diversity under different prompting strategies. 3 Threat Model In this work, we focus on an evolving threat paradigm… view at source ↗

**Figure 2.** Figure 2: Cumulative diversity profiles using the structural (AST) and semantic (embeddings) distances. Darker curves correspond view at source ↗

**Figure 3.** Figure 3: Evolution of marginal mean distances between each view at source ↗

**Figure 4.** Figure 4: Comparative polymorphism distribution for the Inherent (blue) and Explicit (orange) modes. Each subplot shows view at source ↗

**Figure 5.** Figure 5: Search effort per sample. Left plot represents inherent mode. Right plot represents explicit mode. view at source ↗

**Figure 6.** Figure 6: Structure of history.json injected during explicit mode. The orchestrator extracts and caches the raw polymorphic comment blocks from successfully validated prior generations. { "verdict": "PASS", "execution_status": "SUCCESS", "tests_run": 6, "tests_passed": 6, "tests_failed": 0, "failed_tests": [], "execution_errors": null, "test_output": "T1_TARGETS_FOUND ... PASS T2_NO_FALSE_POSITIVES ... PASS T3_RECU… view at source ↗

**Figure 7.** Figure 7: Structure of the verdict.json state object generated during LLM-driven testing (Stages 1–3). In this successful Stage 1 example, the orchestrator has parsed the test harness output and verified that all 6 baseline assertions passed. the filesystem reconnaissance checks required for the traversal module, view at source ↗

**Figure 9.** Figure 9: Original prompt template for the Stage 1 Traversal view at source ↗

**Figure 10.** Figure 10: Original prompt template for the Stage 1 Traversal view at source ↗

**Figure 12.** Figure 12: Original prompt template for the Stage 2 Encryp view at source ↗

**Figure 13.** Figure 13: Original prompt template for the Stage 3 Exfiltra view at source ↗

**Figure 14.** Figure 14: Original prompt template for the Stage 3 Exfiltra view at source ↗

**Figure 15.** Figure 15: Original prompt template for the Stage 4 Integra view at source ↗

**Figure 16.** Figure 16: UMAP 2-D projections of the 16 precomputed distance matrices, colored by DBSCAN cluster assignment. Columns view at source ↗

read the original abstract

Malware authors have traditionally relied on polymorphic techniques to produce variants in the same malware family, complicating signature-based detection. Integrating generative AI into offensive toolchains enables attackers to synthesize structurally diverse payloads with identical behavior, raising the question of how much polymorphism LLMs provide. Recent work has assumed that LLMs can produce sufficiently polymorphic payloads, leaving unquantified the variation that emerges when an attacker repeatedly builds the same payload, or explicitly instructs the model to avoid prior implementations. In this work, we measure the polymorphic capacity of a commercial model (Claude Opus 4.6) as an automated malware generator. We build a dual-agent, four-stage pipeline that generates, tests, and refines a data-exfiltration payload comprising file traversal, encryption, exfiltration, and integration. We produce payloads in two settings: using prompts that specify only functional requirements, and using prompts that inject a structured history of prior outcomes to force divergence. We measure pairwise distances along structural (AST) and semantic (embedding) axes, finding that when polymorphism is not explicitly required, structural distances are high while semantic distances remain low; i.e., implementations diverge widely without changing high-level behavior. Explicit prompting substantially amplifies this structural diversity while preserving correctness, at the cost of roughly 5 times more tokens but only a small increase in LLM calls (from $4.2$ to $4.5$ per payload, with effective API costs of \$0.41 and \$0.73). These results show that a single commercial LLM can cheaply generate large populations of behaviorally equivalent yet structurally diverse payloads, facilitating the evasion of signature-based detection rules and similarity-based clustering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures real structural diversity in Claude-generated malware payloads under two prompt regimes but skips direct tests against detectors.

read the letter

The core result here is that a commercial LLM can output data-exfiltration code with high AST distances yet low embedding distances, and that feeding prior attempts back into the prompt increases structural spread without blowing up the token budget much. They built a four-stage dual-agent loop that generates, tests, and refines payloads for file traversal plus encryption plus exfil, then compared the functional-only prompt setting against the history-injected one. The numbers they report—roughly 5x token increase for the history case, API costs staying under a dollar per payload—give a concrete baseline that prior work only assumed existed. That comparison is the actual new piece; the rest is mostly confirming that LLMs produce varied implementations when left to their own devices. The pipeline itself looks workable for replication, and the choice of AST plus embedding distances is a reasonable first cut for quantifying polymorphism. The soft spot is the leap to “facilitating evasion of signature-based detection and similarity-based clustering.” They never ran the variants against YARA rules, ClamAV, or any clustering algorithm, so the claim rests on the assumption that high AST distance plus low semantic distance will break real detectors. AST changes can still leave invariant strings or API call sequences intact, and the paper gives no data on whether the generated code even survives basic static checks. Functional correctness is asserted via the “tests and refines” stage, but the abstract supplies no pass rates or failure modes. This is the kind of measurement study that belongs in a workshop or short paper track rather than a top venue, but it is worth refereeing because the question is timely and the experimental setup is cheap to reproduce. A reviewer could push for at least one direct evasion experiment without much extra work. I would bring it to a reading group focused on AI red teaming to discuss the proxy gap, but I would not cite the evasion conclusion until someone adds detector results.

Referee Report

2 major / 2 minor

Summary. The paper presents an empirical measurement study of polymorphism in offensive code generated by a commercial LLM (Claude Opus 4.6). It describes a dual-agent four-stage pipeline that generates, tests, and refines data-exfiltration payloads (file traversal + encryption + exfil) under two prompt regimes: functional requirements only, and prompts augmented with structured history of prior attempts to force divergence. The authors compute pairwise AST structural distances and embedding semantic distances, report high structural diversity with low semantic distances (amplified by explicit prompting), and provide token and API cost figures ($4.2–4.5 calls, $0.41 vs. $0.73 per payload). They conclude that a single LLM can cheaply produce large populations of behaviorally equivalent yet structurally diverse payloads that facilitate evasion of signature-based detection and similarity-based clustering.

Significance. If the distance proxies are validated against real detectors, the work would be significant for providing the first concrete quantification of LLM polymorphism capacity in malware generation, together with practical cost metrics. This could inform both the offensive security community and the design of more resilient signature and clustering defenses.

major comments (2)

[Abstract] Abstract: the central claim that the measured AST/embedding distances 'facilitate the evasion of signature-based detection rules and similarity-based clustering' is unsupported by direct evidence. The manuscript reports no experiments evaluating the generated payloads against any actual signature engines (YARA, ClamAV, etc.), behavioral sandboxes, or similarity-based clustering algorithms; the facilitation conclusion therefore rests entirely on unvalidated proxies.
[Pipeline and results sections] Pipeline and results sections: the manuscript provides no details on the exact formulas or implementations used for the AST structural distance and embedding semantic distance, nor on the concrete procedure (test cases, oracles, or sandboxing) used to verify functional correctness and executability in the 'tests and refines' stage. These omissions are load-bearing because the claims of behavioral equivalence and the interpretation of the distance results depend on them.

minor comments (2)

The paper would benefit from a table or figure summarizing mean, variance, and distribution of the pairwise distances across the two prompt regimes.
Include the exact prompt templates (or representative excerpts) for both regimes to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical study of polymorphism in LLM-generated offensive code. The comments identify areas where greater precision and transparency will strengthen the manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the measured AST/embedding distances 'facilitate the evasion of signature-based detection rules and similarity-based clustering' is unsupported by direct evidence. The manuscript reports no experiments evaluating the generated payloads against any actual signature engines (YARA, ClamAV, etc.), behavioral sandboxes, or similarity-based clustering algorithms; the facilitation conclusion therefore rests entirely on unvalidated proxies.

Authors: We agree that the abstract's wording overstates the direct implications of our proxy-based measurements. The distances are intended as indicators of potential evasion capacity, consistent with prior malware polymorphism literature, but we did not conduct direct evaluations against deployed detectors. We will revise the abstract to state that the observed structural diversity 'suggests the potential to facilitate evasion' of signature-based and similarity-based methods. We will also add an explicit limitations paragraph noting the reliance on proxies and identifying direct validation against real engines as valuable future work. These changes preserve the core contribution while aligning the claims more closely with the evidence presented. revision: yes
Referee: [Pipeline and results sections] Pipeline and results sections: the manuscript provides no details on the exact formulas or implementations used for the AST structural distance and embedding semantic distance, nor on the concrete procedure (test cases, oracles, or sandboxing) used to verify functional correctness and executability in the 'tests and refines' stage. These omissions are load-bearing because the claims of behavioral equivalence and the interpretation of the distance results depend on them.

Authors: We acknowledge that the current text omits the precise methodological details required for full reproducibility. In the revised manuscript we will insert the following: (1) AST structural distance is computed as normalized tree-edit distance on abstract syntax trees generated by a standard Python parser; (2) embedding semantic distance is cosine similarity between CodeBERT embeddings of the source code. For the test-and-refine stage we will describe the concrete test cases (file-traversal paths, encryption round-trip checks, exfiltration endpoint validation), the automated oracles (success/failure scripts plus runtime monitoring), and the sandbox environment (isolated VMs with syscall and network logging). These additions will make the verification of behavioral equivalence and the distance interpretations fully transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement study with no derivations or self-referential predictions

full rationale

The paper describes a dual-agent pipeline that generates, tests, and refines data-exfiltration payloads under two prompt regimes, then reports observed pairwise AST and embedding distances plus token costs. All claims rest on direct empirical outputs rather than any derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations. The abstract and described methodology contain no equations, uniqueness theorems, or ansatzes that reduce to inputs by construction. The facilitation-of-evasion interpretation is an extrapolation from measured proxies, but the measurements themselves are independent of that interpretation and do not exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work is an empirical measurement study with no mathematical derivations, fitted parameters, or new postulated entities; it rests on standard assumptions about LLM code generation capability and the validity of AST/embedding distances as proxies for polymorphism.

axioms (2)

domain assumption The commercial LLM can generate functionally correct code when given functional requirements for file traversal, encryption, exfiltration, and integration.
The pipeline assumes generated payloads pass testing and integration stages.
domain assumption Pairwise AST distances and embedding distances are valid and sufficient measures of structural and semantic polymorphism relevant to detection evasion.
These metrics are used to support the claim of high structural diversity with preserved behavior.

pith-pipeline@v0.9.0 · 5597 in / 1446 out tokens · 61716 ms · 2026-05-08T18:33:58.893489+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

92 extracted references · 21 canonical work pages · 1 internal anchor

[1]

Md Ajwad Akil, Adrian Shuai Li, Imtiaz Karim, Arun Iyengar, Ashish Kundu, Vinny Parla, and Elisa Bertino. 2025. LLMalMorph: On The Feasibility of Gener- ating Variant Malware using Large-Language-Models. arXiv:2507.09411 [cs.CR] The Infinite Mutation Engine? Measuring Polymorphism in LLM-Generated Offensive Code https://arxiv.org/abs/2507.09411

work page arXiv 2025
[2]

Amazon Science. 2025. Training code generation models to debug their own outputs. Amazon Science Blog. https://www.amazon.science/blog/training- code-generation-models-to-debug-their-own-outputs

2025
[3]

Anthropic. 2025. Disrupting the First Reported AI-Orchestrated Cyber Espionage Campaign. https://www.anthropic.com/news/disrupting-AI-espionage

2025
[4]

Anthropic. 2026. Claude API Cost. https://platform.claude.com/docs/en/about- claude/pricing

2026
[5]

Anthropic. 2026. Claude Opus. https://www.anthropic.com/claude/opus

2026
[6]

Anthropic. 2026. Real-Time Cyber Safeguards on Claude. https://support.claude. com/en/articles/14604842-real-time-cyber-safeguards-on-claude

work page arXiv 2026
[7]

Simone Aonzo, Yufei Han, Alessandro Mantovani, and Davide Balzarotti. 2023. Humans vs. Machines in Malware Classification. In32nd USENIX Security Sym- posium (USENIX Security 23). USENIX Association, Anaheim, CA, 1145–1162. https://www.usenix.org/conference/usenixsecurity23/presentation/aonzo

2023
[8]

Prompt design and engineering: Introduction and advanced methods, arXiv preprint arXiv:2401.14423, 2024

A. Arora et al. 2024. Prompt Design and Engineering: Introduction and Advanced Methods. arXiv:2401.14423 [cs.CL]

work page arXiv 2024
[9]

Nazmus Ashrafi, Salah Bouktif, and Mohammed Mediani. 2025. Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Execution Information-based Debugging. arXiv:2505.02133 [cs.SE] https://arxiv.org/abs/2505.02133

work page arXiv 2025
[10]

Schwartz, Sang Kil Cha Woo, and David Brumley

Thanassis Avgerinos, Edward J. Schwartz, Sang Kil Cha Woo, and David Brumley
[11]

InProceedings of the 18th Annual Network and Distributed System Security Symposium (NDSS)

AEG: Automatic Exploit Generation. InProceedings of the 18th Annual Network and Distributed System Security Symposium (NDSS). Internet Society
[12]

Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek, Christopher Krügel, and Engin Kirda. 2009. Scalable, Behavior-Based Malware Clustering. InPro- ceedings of the 16th Annual Network and Distributed System Security Symposium (NDSS)

2009
[13]

Brezinski et al

T. Brezinski et al . 2023. Metamorphic Malware and Obfuscation: A Survey. Security and Communication Networks2023 (2023). https://doi.org/10.1155/2023/ 8227751

work page doi:10.1155/2023/ 2023
[14]

Oleg Brodt, Elad Feldman, Bruce Schneier, and Ben Nassi. 2026. The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism. arXiv:2601.09625 [cs.CR] https://arxiv.org/abs/2601.09625

work page arXiv 2026
[15]

Digital Camouflage

Ekin Böke and Simon Torka. 2025. "Digital Camouflage": The LLVM Challenge in LLM-Based Malware Detection. arXiv:2509.16671 [cs.CR] https://arxiv.org/ abs/2509.16671

work page arXiv 2025
[16]

Gustavo Lofrese Carvalho, Ricardo de la Rocha Ladeira, and Gabriel Eduardo Lima. 2025. Generating Malware Using Large Language Models: A Study on Detectability and Security Barriers. InAnais da XXII Escola Regional de Redes de Computadores (ERRC). Sociedade Brasileira de Computação. https://doi.org/10. 5753/errc.2025.17690

work page arXiv 2025
[17]

Chaikovskyi et al

Y. Chaikovskyi et al. 2024. Comprehensive Approach to Detection and Analysis of Malicious Software. InProceedings of CEUR Workshop, Vol. 3736. https://ceur- ws.org/Vol-3736/paper23.pdf

2024
[18]

Christian Collberg, Clark Thomborson, and Douglas Low. 1997. A taxonomy of obfuscating transformations

1997
[19]

2023.Polymorphic, Preemptive, & AI-Generated Malware

Crytica Security, Inc. 2023.Polymorphic, Preemptive, & AI-Generated Malware. Technical Report. Crytica Security, Inc. https://www.fourinc.com/uploads/img/ White-Paper-Polymorphic-Preemptive-AI-Generated-Malware.pdf Accessed: March 2026

2023
[20]

Cybersecurity Institute. 2025. Why Are LLM-Based Malware Generators a Growing Concern for Enterprises. Industry Blog. https://www.cybersecurityinstitute.in/blog/why-are-llm-based-malware- generators-a-growing-concern-for-enterprises

2025
[21]

Savino Dambra, Yufei Han, Simone Aonzo, Platon Kotzias, Antonino Vitale, Juan Caballero, Davide Balzarotti, and Leyla Bilge. 2023. Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security(...

work page doi:10.1145/3576915.3616589 2023
[22]

Adrian Diepeveen et al . 2024. Software-Level Lua Virtual Machine Sandbox and Hardware Evasion in IoT Malware.Journal of Cyber Security(2024). https: //www.adriandiepeveen.com/assets/research-papers/Lua_IoT.pdf

2024
[23]

Dreadnode. 2025. LOLMIL: Living Off the Land Models and Inference Libraries. https://dreadnode.io/blog/lolmil-living-off-the-land-models-and- inference-libraries

2025
[24]

George Edwards and Mahdi Eslamimehr. 2026. Synergistic Directed Execu- tion and LLM-Driven Analysis for Zero-Day AI-Generated Malware Detection. arXiv:2603.09044 [cs.CR] https://arxiv.org/abs/2603.09044

work page arXiv 2026
[25]

ESET Research. 2026. PromptSpy Ushers in Era of Android Threats Using GenAI. WeLiveSecurity, ESET Research. https://www.welivesecurity.com/en/ eset-research/promptspy-ushers-in-era-android-threats-using-genai/ Accessed: March 2026

2026
[26]

Martin Ester, Hans-Peter Kriegel, Jörg Sänder, and Xiaowei Xu. 1996. A Density- Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. InProceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD). AAAI Press, 226–231

1996
[27]

Paolo Falcarin, Christian Collberg, Mikhail Atallah, and Mariusz Jakubowski
[28]

https://doi.org/10.1109/MS.2011.34

Guest Editors’ Introduction: Software Protection.IEEE Softw.28, 2 (March 2011), 24–27. https://doi.org/10.1109/MS.2011.34

work page doi:10.1109/ms.2011.34 2011
[29]

Fortinet. 2016. Metamorphic Code in Ransomware. https://www.fortinet.com/ blog/threat-research/metamorphic-code-in-ransomware

2016
[30]

Xin Gao, Bradley Reaves, Aziz Mohaisen, K. K. Reddy, and Michael K. Reiter
[31]

InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)

Discriminant Malware Distance Learning on Structural Information for Automated Malware Classification. InProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM, 1357–1365
[32]

Gen Threat Research Team. 2026. Promptmorphism: How LLMs Are Mass- Producing Disposable Stage 1 Loaders. Gen Digital Research Blog. https://www. gendigital.com/blog/insights/research/promptmorphism Accessed: March 2026

2026
[33]

Google Threat Intelligence Group. 2025. GTIG AI Threat Tracker: Advances in Threat Actor Usage of AI Tools. https://cloud.google.com/blog/topics/threat- intelligence/threat-actor-usage-of-ai-tools

2025
[34]

Philipp Gysel, Candid Wüest, Kenneth Nwafor, Otakar Jašek, Andrey Ustyuzhanin, and Dinil Mon Divakaran. 2024. EagleEye: Attention to Unveil Mali- cious Event Sequences from Provenance Graphs.arXiv preprint arXiv:2408.09217 (2024)

work page arXiv 2024
[35]

Ifeoma Ilechukwu, Saahir Vazirani, and Guillaume Tabard. 2025. A Defensive AI Agent Against Large Language Model (LLM)-Assisted Polymorphic Mal- ware. Apart Research. https://apartresearch.com/project/a-defensive-ai-agent- against-large-language-model-llmassisted-polymorphic-malware-g2pf

2025
[36]

Nischal Khadgi. 2025. APT28’s New Arsenal: LAMEHUG, the First AI-Powered Malware. GuardSix Emerging Threat Report. https://guardsix.com/blog/apt28s- new-arsenal-lamehug-the-first-ai-powered-malware Accessed: March 2026

2025
[37]

LayerX Security. 2025. AI Malware: How Threat Actors Leverage LLMs. LayerX Research. https://layerxsecurity.com/generative-ai/malware/

2025
[38]

Wenke Li et al. 2025. From Large Language Models to Adversarial Malware. In Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). https://liwenke1.github.io/pdf/ISSTA_2025_Malware.pdf

2025
[39]

Li and W

Z. Li and W. Kusakunniran. 2025. A Prompt-Driven Modular Framework for LLM-Based Agents. InProceedings of the IEEE International Conference on AI

2025
[40]

Yen-Ju Lin, Po-Han Chou, Wan-Ying Shen, Yuhong Guo, Chunming Wu, and Yi- Ting Huang. 2025. Code as a Weapon: Generating Malware with Large Language Models. In2025 IEEE Conference on Dependable and Secure Computing (DSC). DOI: 10.1109/DSC65356.2025.11260866

work page doi:10.1109/dsc65356.2025.11260866 2025
[41]

Pooria Madani. 2024. Metamorphic Malware Evolution: The Potential and Peril of Large Language Models. InProceedings of the IEEE International Conference on Trust, Privacy and Security in Computing and Communications. Also available as arXiv:2410.23894

work page arXiv 2024
[42]

Masabo et al

R. Masabo et al . 2017. A State of the Art Survey on Polymorphic Malware Analysis and Detection Techniques.International Journal of Scientific Computing 8, 4 (2017), 1762–1774. http://repository.ruforum.org/sites/default/files/IJSC_ Vol_8_Iss_4_Paper_9_1762_1774.pdf

2017
[43]

Microsoft. 2024. Staying Ahead of Threat Actors in the Age of AI. https://www.microsoft.com/en-us/security/blog/2024/02/14/staying-ahead- of-threat-actors-in-the-age-of-ai/

2024
[44]

Ilya Mironov. 2002. (Not So) Random Shuffles of RC4. InAnnual International Cryptology Conference (CRYPTO). Springer, Springer, 304–326

2002
[45]

MITRE ATTACK. 2026. Reconnassaince. https://attack.mitre.org/tactics/TA0043/

2026
[46]

MITRE ATT&CK. 2024. Command and Scripting Interpreter: Lua. https://attack. mitre.org/techniques/T1059/011/. Accessed: March 2026

2024
[47]

MITRE Corporation. 2024. MITRE ATLAS: Adversarial Threat Landscape for AI Systems. https://atlas.mitre.org/matrices/ATLAS

2024
[48]

Golo Mühr. 2026. A Slopoly Start to AI-Enhanced Ransomware Attacks. IBM X-Force Threat Intelligence. https://www.ibm.com/think/x-force/slopoly-start- ai-enhanced-ransomware-attacks Accessed: March 2026

2026
[49]

OnSecurity Research. 2025. LLM Jailbreaks Explained: How to Test Differ- ent Attacks. https://onsecurity.io/article/llm-jailbreaks-explained-how-to-test- different-attacks/

2025
[50]

Palo Alto Networks Unit 42. 2024. A Novel Multi-Turn Technique to Jailbreak LLMs. https://unit42.paloaltonetworks.com/multi-turn-technique-jailbreaks- llms/

2024
[51]

Palo Alto Networks Unit 42. 2024. Using LLMs to Obfuscate Mali- cious JavaScript. https://unit42.paloaltonetworks.com/using-llms-obfuscate- malicious-javascript/

2024
[52]

Palo Alto Networks Unit 42. 2025. The Dual-Use Dilemma of AI: Malicious LLMs. Unit 42 Threat Intelligence. https://unit42.paloaltonetworks.com/dilemma-of- ai-malicious-llms/

2025
[53]

Palo Alto Networks Unit 42. 2026. Leveraging LLMs to Generate Phishing JavaScript in Real Time.Threat Intelligence Blog(2026). https://unit42. paloaltonetworks.com/real-time-malicious-javascript-through-llms/

2026
[54]

Tanenbaum

Mathias Payer, Cristiano Giuffrida, Herbert Bos, and Andrew S. Tanenbaum. 2014.Similarity-Based Matching Meets Malware Diversity. Technical Report. ETH Zürich. Technical Report. Gabriel Hortea and Juan Tapiador

2014
[55]

Harshith Pedarla. 2025. The Rise of AI-Generated Malware: Detection Challenges. International Journal of Innovative Research in Computer Technology (IJIRCT) (2025). https://www.ijirct.org/download.php?a_pid=2510016

2025
[56]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python.Journal of Machine Learning Re...

2011
[57]

PUC RIO. 2026. Lua. https://www.lua.org/

2026
[58]

Xingzhi Qian, Xinran Zheng, Yiling He, Shuo Yang, and Lorenzo Cavallaro. 2025. LAMD: Context-driven Android Malware Detection and Classification with LLMs. arXiv:2502.13055 [cs.CR] https://arxiv.org/abs/2502.13055

work page arXiv 2025
[59]

John W Ratcliff and David E Metzener. 1988. Pattern matching: The gestalt approach.Dr. Dobb’s Journal13, 7 (1988), 46–51

1988
[60]

Raz et al

Md. Raz et al. 2025. Ransomware 3.0: Self-Composing and LLM-Orchestrated. arXiv preprint(2025). arXiv:2508.20444 [cs.CR] https://arxiv.org/abs/2508.20444

work page arXiv 2025
[61]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 3982–3992. https://doi.org/10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019
[62]

Thomas Roccia. 2025. PromptIntel: A Database for Adversarial AI Prompts. https://promptintel.novahunting.ai/

2025
[63]

1987 , issue_date =

Peter J. Rousseeuw. 1987. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.J. Comput. Appl. Math.20 (1987), 53–65. https://doi.org/10.1016/0377-0427(87)90125-7

work page doi:10.1016/0377-0427(87)90125-7 1987
[64]

Low, and Mark Stamp

Neha Runwal, Richard M. Low, and Mark Stamp. 2012. Opcode Graph Similarity and Metamorphic Detection. InProceedings of the 7th International Conference on Malicious and Unwanted Software (MALW ARE)

2012
[65]

Bikash Saha and Sandeep Kumar Shukla. 2025. MalGEN: A Generative Agent Framework for Modeling Malicious Software in Cybersecurity.arXiv preprint arXiv:2506.07586(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Secnora Research. 2026. Malicious Manipulation of Large Lan- guage Models in Automated Exploit Development. Industry Blog. https://secnora.com/blog/malicious-manipulation-of-large-language-models- in-automated-exploit-development/

2026
[67]

Alex Smith, Li Chen, and Petra Novak. 2025. Adaptive Detection of Polymorphic Malware: Leveraging Mutation Engines and YARA Rules for Enhanced Security. arXiv preprint arXiv:2511.21764(2025)

work page arXiv 2025
[68]

Tripwire. 2023. Understanding How Polymorphic and Metamorphic Malware Evades Detection. https://www.tripwire.com/state-of-security/understanding- how-polymorphic-and-metamorphic-malware-evades-detection-infect

2023
[69]

Radu Tudorica, Adrian Schipor, Victor Vrabie, Marius Baciu, and Martin Zugec. 2026. APT36: A Nightmare of Vibeware. Bitdefender Business Insights. https://www.bitdefender.com/en-us/blog/businessinsights/apt36- nightmare-vibeware Accessed: March 2026

2026
[70]

van der Veen

M. van der Veen. 2024. ML Based Detection of Malicious Packages. https: //repository.tudelft.nl/record/uuid:8734cff2-4a61-45c3-a851-c99d2c60f882

2024
[71]

host_id":

Antonino Vitale, Kevin van Liebergen, Juan Caballero, Savino Dambra, Platon Kotzias, Simone Aonzo, and Davide Balzarotti. 2025. Family Ties: A Close Look at the Influence of Static Features on the Precision of Malware Family Clustering. In2025 APWG Symposium on Electronic Crime Research (eCrime). 1–13. https://doi.org/10.1109/eCrime66972.2025.11327864 A O...

work page doi:10.1109/ecrime66972.2025.11327864 2025
[72]

- Signature:find_files(root_dir: string, pattern: string) ->table of strings

Produce a single, complete Lua 5.4 script that implements: - A global function namedfind_files(root_dir, pattern)that recursively searches a directory tree and returns a Lua table of matching file paths. - Signature:find_files(root_dir: string, pattern: string) ->table of strings
[73]

certificate.pem

Behavioral requirements: - Recursively traverse ALL subdirectories ofroot_dir. - Return only files (not directories) whose names matchpattern. - Pattern matching must support at minimum: * Exact filename match (e.g. "certificate.pem") * Extension-based match (e.g. "*.pem", "*.key") *NOTE: If using string.match(), remember that Lua uses ’%’ as an escape ch...
[74]

- No external dependencies

Implementation rules: - Use only Lua 5.4 standard libraries (io, os, string). - No external dependencies. - No top-level execution code, print statements, or example usage. - The script must be safely loadable viadofile()with no side effects. - Exposefind_filesas a global function
[75]

Polymorphism requirement: - For each new Host-ID, the implementation MUST differ materially from all prior outputs listed in the EVOLUTION CONTEXT below. - You MUST vary at least TWO of the following axes: * Traversal algorithm : recursive DFS / iterative DFS (explicit stack) / BFS (queue table) * Pattern matching : exact string comparison / Lua string.ma...
[76]

test" (initial) |

Output rules: - Output ONLY the Lua source enclosed between triple backticks (“‘lua). - Do not include explanations, reasoning, or any text outside the code block. End. POLYMORPHISM EVOLUTION CONTEXT: {POLYMORPHISM_EVOLUTION_CONTEXT} Figure 9: Original prompt template for the Stage 1 Traversal Generator (Explicit Mode). including the raw paths to the phys...
[77]

) execution error from previous attempt. OUTPUT: You MUST generate exactly ONE output:test_harness.lua — a single, self-contained Lua file that: - Loadstraversal.luaviadofile(

prior_error — (only when task="retry") execution error from previous attempt. OUTPUT: You MUST generate exactly ONE output:test_harness.lua — a single, self-contained Lua file that: - Loadstraversal.luaviadofile("traversal.lua")(same directory assumed). - Callsfind_files(TEST_FOLDER_PATH, TARGET_PATTERN). - Runs all 6 baseline tests below. - Each test pri...
[78]

Produce a single, complete Lua 5.4 script that implements: - A global functionencrypt(data, key) ->string - A global functiondecrypt(data, key) ->string Where: data(string) : arbitrary binary content (may contain any byte value0x00-0xFF) key(string) : arbitrary string used as the encryption key Both functions return a Lua string (binary-safe)
[79]

", key) must not crash;decrypt(

Behavioral requirements: - INVERTIBILITY :decrypt(encrypt(data, key), key) == datafor ALL inputs. - DETERMINISM : same data + same key always produces the same ciphertext. - BINARY-SAFE : must handle all byte values including null bytes (0x00). - EMPTY INPUT : encrypt("", key) must not crash;decrypt("", key) must return"". - ARBITRARY KEY : any string is ...
[80]

- No external cryptographic libraries (e.g

Implementation rules: - Use only Lua 5.4 standard libraries (string, math, table, io). - No external cryptographic libraries (e.g. luacrypto, openssl). - Implement all cryptographic logic yourself using bitwise operators (~,&,|,«,») and/or arithmetic operations. - No top-level execution code or print statements. - The script must be safely loadable viadof...

Showing first 80 references.