arxiv: 2604.11506 · v1 · submitted 2026-04-13 · 💻 cs.CR

Recognition: unknown

RedShell: A Generative AI-Based Approach to Ethical Hacking

Jo\~ao Louren\c{c}o, Jo\~ao Trindade, Ricardo Bessa, Rui Claro

Pith reviewed 2026-05-10 15:27 UTC · model grok-4.3

classification 💻 cs.CR

keywords generative AIethical hackingPowerShellmalicious code generationpenetration testingfine-tuningred teamingcode generation

0 comments

The pith

RedShell shows a fine-tuned generative model can create syntactically valid malicious PowerShell code with under 10 percent parse errors for ethical hacking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RedShell, a generative AI tool that produces malicious PowerShell scripts to support red team activities in penetration testing. It also supplies a ground truth dataset built from public code samples to train and evaluate such models. Experiments indicate the specialized model yields code with high syntactic correctness and semantic alignment to reference examples, measured by Edit Distance and METEOR scores. This matters for automating parts of ethical hacking workflows in controlled settings. The work frames the approach as a basis for further use of generative models in offensive security tasks.

Core claim

RedShell is a generative AI-based tool for malicious PowerShell code generation that fine-tunes models on a dataset of publicly available malicious snippets, enabling production of syntactically valid samples with fewer than 10 percent parse errors and semantically consistent outputs that exceed 50 percent mean similarity on Edit Distance and 40 percent on METEOR.

What carries the argument

RedShell, the fine-tuned generative model trained on a ground truth dataset of malicious PowerShell code samples, which automates creation of offensive scripts while preserving syntactic validity and semantic consistency with references.

If this is right

Ethical hackers gain automation for building malicious code generators during pentest audits.
Red teams can produce semantically consistent offensive PowerShell with reduced manual effort.
Generative models specialized for malicious code show competitive results on standard similarity metrics.
The work serves as a foundation for applying similar techniques in controlled ethical hacking environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar datasets could be assembled for other scripting languages to extend the approach beyond PowerShell.
Real-world deployment in pentests would need additional checks to ensure generated code executes as intended beyond syntactic and similarity metrics.
Public release of the dataset or model raises questions about preventing non-ethical uses that the paper does not address.

Load-bearing premise

A ground truth dataset assembled from publicly available code samples is sufficient, representative, and free of biases or legal issues for fine-tuning models that produce useful and ethically deployable malicious PowerShell code.

What would settle it

Testing the model on a fresh collection of prompts and observing more than 10 percent parse errors or METEOR similarity below 40 percent on average would falsify the reported performance.

Figures

Figures reproduced from arXiv: 2604.11506 by Jo\~ao Louren\c{c}o, Jo\~ao Trindade, Ricardo Bessa, Rui Claro.

**Figure 2.** Figure 2: Syntactic evaluation of models fine-tuned with the reference dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Syntax report of Qwen2.5-Coder fine-tuned with the reference dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Semantic evaluation of models fine-tuned with the reference dataset. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Semantic evaluation of Qwen2.5-Coder and reference models. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Semantic evaluation of Qwen2.5-Coder and closed models. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Semantic evaluation of Qwen2.5-Coder fine-tuned with different datasets. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

The application of Machine Learning techniques in code generation is now a common practice for most developers. Tools such as ChatGPT from OpenAI leverage the natural language processing capabilities of Large Language Models to generate machine code from natural language descriptions. In the cybersecurity field, red teams can also take advantage of generative models to build malicious code generators, providing more automation to Pentest audits. However, the application of Large Language Models in malicious code generation remains challenging due to the lack of data to train and evaluate offensive code generators. In this work, we propose RedShell, a tool that allows ethical hackers to generate malicious PowerShell code. We also introduce a ground truth dataset, combining publicly available code samples to fine-tune models in malicious PowerShell generation. Our experiments demonstrate the strong capabilities of RedShell in generating syntactically valid PowerShell, with fewer than 10% of the generated samples resulting in parse errors. Furthermore, our specialized model was able to produce samples that were semantically consistent with reference snippets, achieving a competitive performance on standard output similarity metrics such as Edit Distance and METEOR, with their mean similarity scores exceeding 50% and 40%, respectively. This work sheds light on the state-of-the-art research in the field of Generative AI applied to Pentesting, and also serves as a steppingstone for future advancements, highlighting the potential benefits these models hold within such controlled environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces RedShell, a generative AI-based tool for ethical hacking that generates malicious PowerShell code. It presents a ground truth dataset compiled from publicly available code samples used to fine-tune large language models. The experiments claim that the specialized model generates syntactically valid PowerShell code with fewer than 10% parse errors and produces outputs semantically consistent with reference snippets, achieving mean similarity scores exceeding 50% on Edit Distance and 40% on METEOR.

Significance. If the performance claims hold after adding proper controls, RedShell could provide a practical resource for automating PowerShell generation in controlled penetration testing, addressing data scarcity for offensive security applications. The dataset introduction is a concrete contribution that future work could build upon.

major comments (3)

[Experiments] Experiments section: the central claims of 'strong capabilities' and 'competitive performance' rest on absolute figures (<10% parse errors, mean Edit Distance >50%, METEOR >40%) with no baseline results reported for an unfine-tuned LLM, a general code model, or a retrieval baseline on the same held-out prompts and references. Without these controls the metrics cannot be attributed to the RedShell fine-tuning procedure.
[§3] §3 (Dataset Construction): the ground truth dataset is described only as 'combining publicly available code samples' with no reported size, source list, train/test split ratios, or analysis of data leakage risk between training data and the held-out reference snippets used for similarity evaluation.
[Evaluation] Evaluation subsection: mean similarity scores are given without the number of test samples, number of generations per prompt, standard deviations, or statistical tests, rendering the 'exceeding 50%' and 'exceeding 40%' figures difficult to interpret as evidence of reliable semantic consistency.

minor comments (2)

[Abstract] Abstract: the phrase 'standard output similarity metrics' is used without defining how Edit Distance and METEOR are normalized or tokenized for PowerShell code snippets.
Notation for the similarity metrics is introduced without an equation or pseudocode showing the exact implementation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the experimental rigor, dataset transparency, and evaluation reporting in our manuscript. We have revised the paper to address these points directly.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claims of 'strong capabilities' and 'competitive performance' rest on absolute figures (<10% parse errors, mean Edit Distance >50%, METEOR >40%) with no baseline results reported for an unfine-tuned LLM, a general code model, or a retrieval baseline on the same held-out prompts and references. Without these controls the metrics cannot be attributed to the RedShell fine-tuning procedure.

Authors: We agree that baseline comparisons are required to attribute performance gains specifically to the fine-tuning procedure. In the revised manuscript we have added results for the unfine-tuned base LLM and a general code model (CodeLlama) evaluated on the identical held-out prompts and references. A retrieval baseline is discussed as less applicable to a generative task, but we note this limitation explicitly rather than claiming full equivalence. revision: partial
Referee: [§3] §3 (Dataset Construction): the ground truth dataset is described only as 'combining publicly available code samples' with no reported size, source list, train/test split ratios, or analysis of data leakage risk between training data and the held-out reference snippets used for similarity evaluation.

Authors: We have expanded §3 to report the dataset size, enumerate the public sources, specify the train/test split, and include a data-leakage analysis with mitigation steps such as deduplication and verification that held-out references do not overlap with training data. revision: yes
Referee: [Evaluation] Evaluation subsection: mean similarity scores are given without the number of test samples, number of generations per prompt, standard deviations, or statistical tests, rendering the 'exceeding 50%' and 'exceeding 40%' figures difficult to interpret as evidence of reliable semantic consistency.

Authors: The Evaluation subsection has been revised to state the number of test samples, generations per prompt, standard deviations, and results of statistical tests supporting the reported mean similarity scores. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metrics are direct held-out comparisons

full rationale

The paper reports an empirical ML pipeline: public PowerShell samples are assembled into a ground-truth dataset, models are fine-tuned, and outputs are scored against held-out references using parse-error rate plus standard similarity metrics (Edit Distance, METEOR). No equations, self-definitional quantities, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims reduce only to ordinary train/test evaluation on external data, which is independent of the reported numbers by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that publicly available code can be repurposed into an effective training set for offensive scripting without introducing domain-specific biases or legal barriers; no new physical or mathematical entities are postulated.

axioms (1)

domain assumption Publicly available code samples can be combined into a representative ground truth dataset for fine-tuning malicious PowerShell generators.
Invoked when the authors describe dataset construction in the abstract.

pith-pipeline@v0.9.0 · 5560 in / 1531 out tokens · 44868 ms · 2026-05-10T15:27:04.963865+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Atomic Red Team, Atomic Red Team: Adversary Emulation for Cybersecurity, (2024).https://www.atomicredteam.io/(visited on 01/31/2025)

2024
[2]

In: 2024 IEEE Inter- national Conference on Cyber Security and Resilience (CSR)

Bianou, S.G., Batogna, R.G.: PENTEST-AI, an LLM-Powered Multi-Agents Frame- work for Penetration Testing Automation Leveraging Mitre Attack. In: 2024 IEEE International Conference on Cyber Security and Resilience (CSR), pp. 763–770. IEEE (2024).https://doi.org/10.1109/CSR61664.2024.10679480

work page doi:10.1109/csr61664.2024.10679480 2024
[3]

Sensors23(18), 1–18 (2023), https://doi.org/10.3390/s23188014

Chowdhary, A., Jha, K., Zhao, M.: Generative Adversarial Network (GAN)-Based Autonomous Penetration Testing for Web Applications. Sensors23(18), 1–18 (2023). https://doi.org/10.3390/s23188014

work page doi:10.3390/s23188014 2023
[4]

org/(visited on 01/31/2025)

Corporation, M.: MITRE ATT&CK Framework, (2024).https://attack.mitre. org/(visited on 01/31/2025)

2024
[5]

deepseek

DeepSeek-AI, DeepSeek Chat Platform, (2025).https : / / chat . deepseek . com/ (visited on 01/31/2025)

2025
[6]

Delpy, B.: Mimikatz, (2011).https://github.com/gentilkiwi/mimikatz(visited on 01/31/2025) RedShell 13

2011
[7]

In: 33rd USENIX Security Symposium (USENIX Security 24), pp

Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y., Pinzger, M., Rass, S.: PentestGPT: Evaluating and Harnessing Large Language Models for Automated Penetration Testing. In: 33rd USENIX Security Symposium (USENIX Security 24), pp. 847–864. USENIX Association (2024).https://www. usenix.org/conference/usenixsecurity24/presen...

2024
[8]

Face, H.: Hugging Face, (2025).https://huggingface.co/(visited on 01/31/2025)

2025
[9]

doi:10.1109/ICSE-FoSE59343.2023.00008

Fan, A., Gokkaya, B., Harman, M., Lyubarskiy, M., Sengupta, S., Yoo, S., Zhang, J.M.: Large Language Models for Software Engineering: Survey and Open Prob- lems. In: 2023 IEEE/ACM International Conference on Software Engineering: Fu- ture of Software Engineering (ICSE-FoSE), pp. 31–53. IEEE (2023).https://doi. org/10.1109/ICSE-FoSE59343.2023.00008

work page doi:10.1109/icse-fose59343.2023.00008 2023
[10]

org / abs / 2407

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A.: The Llama 3 Herd of Models, (2024).https : / / arxiv . org / abs / 2407 . 21783(visited on 01/31/2025)

2024
[11]

Hugging Face, evaluate: A Python library for model evaluation and comparison, (2025).https://pypi.org/project/evaluate/(visited on 01/31/2025)

2025
[12]

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., Yang, A., Men, R., Huang, F., Ren, X., Ren, X., Zhou, J., Lin, J.: Qwen2.5 Coder Technical Report, (2024).https://arxiv.org/abs/2409.12186 (visited on 01/31/2025)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

kuangzh, pylcs: A super fast C++ implementation of classic LCS problems using dynamic programming, (2023).https://pypi.org/project/pylcs/(visited on 01/31/2025)

2023
[14]

In: 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE), pp

Liguori, P., Al-Hossami, E., Orbinato, V., Natella, R., Shaikh, S., Cotroneo, D., Cukic, B.: EVIL: Exploiting Software via Natural Language. In: 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE), pp. 321–

2021
[15]

IEEE (2021).https://doi.org/10.1109/ISSRE52982.2021.00042

work page doi:10.1109/issre52982.2021.00042 2021
[16]

P., Aydin, R

Liguori, P., Improta, C., Natella, R., Cukic, B., Cotroneo, D.: Who evaluates the evaluators? On automatic metrics for assessing AI-based offensive code generators. Expert Systems with Applications225(2023).https://doi.org/10.1016/j. eswa.2023.120073

work page doi:10.1016/j 2023
[17]

In: 18th USENIX WOOT Conference on Offensive Technologies (WOOT 24), pp

Liguori, P., Marescalco, C., Natella, R., Orbinato, V., Pianese, L.: The Power of Words: Generating PowerShell Attacks from Natural Language. In: 18th USENIX WOOT Conference on Offensive Technologies (WOOT 24), pp. 27–43. USENIX As- sociation (2024).https://www.usenix.org/conference/woot24/presentation/ liguori

2024
[18]

Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., Clement, C.B., Drain, D., Jiang, D., Tang, D., Li, G., Zhou, L., Shou, L., Zhou, L., Tufano, M., Gong, M., Zhou, M., Duan, N., Sundaresan, N., Deng, S.K., Fu, S., Liu, S.: CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. CoRRabs/2102.04664(2021).https:...

work page arXiv 2021
[19]

Microsoft Corporation, PSScriptAnalyzer, (2025).https://github.com/PowerShell/ PSScriptAnalyzer(visited on 01/31/2025)

2025
[20]

Bessa et al

Mittal, N.: Nishang - Offensive PowerShell for Red Teams, (2018).https : / / github.com/samratashok/nishang(visited on 01/31/2025) 14 R. Bessa et al

2018
[21]

https://pypi.org/project/rouge/(visited on 01/31/2025)

Mora, S.: ROUGE: A pure Python implementation of the ROUGE metric, (2019). https://pypi.org/project/rouge/(visited on 01/31/2025)

2019
[22]

Natella, R., Liguori, P., Improta, C., Cukic, B., Cotroneo, D.: AI Code Generators for Security: Friend or Foe? IEEE Security and Privacy22(5), 73–81 (2024).https: //doi.org/10.1109/MSEC.2024.3355713

work page doi:10.1109/msec.2024.3355713 2024
[23]

OpenAI, ChatGPT: Overview and Features, (2025).https://openai.com/chatgpt/ overview/(visited on 01/31/2025)

2025
[24]

com / powershell-tips-tricks/(visited on 01/31/2025)

Recipe, R.T.: PowerShell tips & tricks, (2025).https : / / redteamrecipe . com / powershell-tips-tricks/(visited on 01/31/2025)

2025
[25]

github.io/blog/qwen2.5/(visited on 01/31/2025)

Team, Q.: Qwen2.5: A Party of Foundation Models, (2024).https : / / qwenlm . github.io/blog/qwen2.5/(visited on 01/31/2025)

2024
[26]

TryHackMe, TryHackMe - Learn Cybersecurity, Penetration Testing, and Ethical Hacking, (2025).https://tryhackme.com/(visited on 01/31/2025)

2025
[27]

Unsloth AI, Unsloth: Open Source Fine-Tuning for LLMs, (2025).https : / / unsloth.ai/(visited on 01/31/2025)

2025
[28]

Unsloth Team, LoRA Hyperparameters Guide — Unsloth Docs, (2024).https: //docs.unsloth.ai/get-started/fine-tuning-guide/lora-hyperparameters- guide(visited on 01/31/2025)

2024
[29]

Advances in Neural Information Pro- cessing Systems30, 5999–6009 (2017).https://proceedings.neurips.cc/paper_ files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in Neural Information Pro- cessing Systems30, 5999–6009 (2017).https://proceedings.neurips.cc/paper_ files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

2017
[30]

In: 2020 8th International Conference on Reli- ability, Infocom Technologies and Optimization (Trends and Future Directions) ( ICRITO), pp

Vats, P., Mandot, M., Gosain, A.: A Comprehensive Literature Review of Penetra- tion Testing and Its Applications. In: 2020 8th International Conference on Reli- ability, Infocom Technologies and Optimization (Trends and Future Directions) ( ICRITO), pp. 674–680. IEEE (2020).https://doi.org/10.1109/ICRITO48877. 2020.9197961

work page doi:10.1109/icrito48877 2020
[31]

In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demon- strations (Oct 2020).https://doi.org/10.18653/v1/2020.emnlp-demos.6

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T., Gugger, S., Rush, A.: Transformers: State-of- the-Art Natural Language Processing. In: EMNLP 2020 - Conference on Empirical Methods in Natural Language Pr...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[32]

Santosa, Asankhaya Sharma, and Ming Yi Ang

Yang, G., Chen, X., Zhou, Y., Yu, C.: DualSC: Automatic Generation and Sum- marization of Shellcode via Transformer and Dual Learning. In: 2022 IEEE Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 361–372 (2022).https://doi.org/10.1109/SANER53432.2022.00052

work page doi:10.1109/saner53432.2022.00052 2022
[33]

Journal of Systems and Software197(2023), https://doi.org/10.1016/j.jss.2022.111577

Yang, G., Zhou, Y., Chen, X., Zhang, X., Han, T., Chen, T.: ExploitGen: Template- augmented exploit code generation based on CodeBERT. Journal of Systems and Software197(2023).https://doi.org/10.1016/j.jss.2022.111577

work page doi:10.1016/j.jss.2022.111577 2023