Recognition: unknown
Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents
Pith reviewed 2026-05-07 09:43 UTC · model grok-4.3
The pith
Targeted interventions enable local open-weight LLMs to exploit 83% of Linux privilege escalation vulnerabilities
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Open-weight models augmented with chain-of-thought prompting, retrieval-augmented generation, structured prompts, history compression, and reflective analysis as extensions to hackingBuddyGPT can match or outperform cloud-based models such as GPT-4o on autonomous Linux privilege escalation, achieving 83% exploitation for Llama3.1 70B and 67% for Llama3.1 8B and Qwen2.5 7B, with a full-factorial ablation study showing that reflection-based treatments contribute most while vulnerability discovery remains the primary bottleneck.
What carries the argument
Five targeted interventions—chain-of-thought prompting, retrieval-augmented generation, structured prompts, history compression, and reflective analysis—mapped from failure modes of open-weight models and added to hackingBuddyGPT, evaluated through full-factorial ablation.
If this is right
- Llama3.1 70B reaches an 83% exploitation rate once the full set of treatments is enabled.
- Smaller models such as Llama3.1 8B and Qwen2.5 7B reach 67% success when guidance from the interventions is provided.
- Reflection-based treatments account for the largest share of the observed performance lift in the ablation study.
- Vulnerability discovery stays the dominant remaining bottleneck even after the other interventions are applied.
Where Pith is reading between the lines
- Organizations could run capable local penetration-testing agents without routing sensitive target information to external cloud services.
- Self-reflection mechanisms appear especially effective at stabilizing LLM agents on long, error-prone command sequences.
- The same failure-mode mapping and intervention set could be tested on other operating systems or attack stages to check whether the gains transfer.
Load-bearing premise
The five selected interventions correctly resolve the main failure modes of open-weight models on privilege escalation and the tested vulnerabilities are representative of real-world Linux cases.
What would settle it
Running the same agents on a fresh collection of Linux privilege escalation vulnerabilities and finding exploitation rates well below 83% or 67% even with all interventions applied would show the performance gains do not generalize.
Figures
read the original abstract
Recent research has demonstrated the potential of Large Language Models (LLMs) for autonomous penetration testing, particularly when using cloud-based restricted-weight models. However, reliance on such models introduces security, privacy, and sovereignty concerns, motivating the use of locally hosted open-weight alternatives. Prior work shows that small open-weight models perform poorly on automated Linux privilege escalation, limiting their practical applicability. In this paper, we present a systematic empirical study of whether targeted system-level and prompting interventions can bridge this performance gap. We analyze failure modes of open-weight models in autonomous privilege escalation, map them to established enhancement techniques, and evaluate five concrete interventions (chain-of-thought prompting, retrieval-augmented generation, structured prompts, history compression, and reflective analysis) implemented as extensions to hackingBuddyGPT. Our results show that open-weight models can match or outperform cloud-based baselines such as GPT-4o. With our treatments enabled, Llama3.1 70B exploits 83% of tested vulnerabilities, while smaller models including Llama3.1 8B and Qwen2.5 7B achieve 67% when using guidance. A full-factorial ablation study over all treatment combinations reveals that reflection-based treatments contribute most, while also identifying vulnerability discovery as a remaining bottleneck for local models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a systematic empirical study of five interventions (chain-of-thought prompting, retrieval-augmented generation, structured prompts, history compression, and reflective analysis) applied as extensions to hackingBuddyGPT to improve open-weight LLMs on autonomous Linux privilege escalation. The authors analyze failure modes, map them to the interventions, and report via a full-factorial ablation that treated Llama 3.1 70B succeeds on 83% of tested vulnerabilities while smaller models (Llama 3.1 8B, Qwen2.5 7B) reach 67% with guidance; reflection contributes most, vulnerability discovery remains a bottleneck, and local models can match or exceed cloud baselines such as GPT-4o.
Significance. If the results hold under representative conditions, the work is significant for LLM agents in security: it supplies concrete evidence that open-weight models can be made practically useful for penetration testing without cloud APIs, directly addressing privacy, sovereignty, and security risks. The ablation study offers actionable, comparative data on technique efficacy that can inform future agent designs.
major comments (2)
- [Results] Results section: the central performance claims (83% for Llama 3.1 70B, 67% for smaller models) are reported without the total number of vulnerabilities tested, the source or selection criteria for the test set, error bars, or statistical significance tests. These omissions make it impossible to judge whether the reported lifts are robust or sensitive to the particular sample.
- [Methodology] Methodology / Failure-mode analysis: the paper states that failure modes were identified and mapped to the five interventions before the ablation, yet provides no quantitative breakdown of failure frequencies and no indication of whether the mapping was derived from blinded logs or from iterative prompting on the evaluation set itself. If the interventions were selected or tuned post-hoc on the same vulnerabilities, the attribution of gains to the treatments cannot be generalized.
minor comments (2)
- [Abstract] Abstract and Results: the claim that local models 'match or outperform' GPT-4o is stated without the corresponding GPT-4o baseline numbers or experimental conditions for that comparison.
- [Experiments] The manuscript would benefit from an explicit table or appendix listing the vulnerabilities used, even if only by CVE or description, to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which helps us improve the clarity and rigor of our empirical study on enhancing local LLMs for Linux privilege escalation. We address each major comment below and outline the corresponding revisions.
read point-by-point responses
-
Referee: [Results] Results section: the central performance claims (83% for Llama 3.1 70B, 67% for smaller models) are reported without the total number of vulnerabilities tested, the source or selection criteria for the test set, error bars, or statistical significance tests. These omissions make it impossible to judge whether the reported lifts are robust or sensitive to the particular sample.
Authors: We agree that these details are necessary for readers to evaluate robustness. In the revised manuscript we will explicitly report the total number of vulnerabilities in the test set, fully describe their source and selection criteria (drawn from public CVE records and standard privilege-escalation benchmarks with a focus on local, command-line exploits), add error bars derived from repeated runs, and include statistical significance tests comparing the intervention conditions. These additions will appear in the Results section, the experimental setup, and the associated tables. revision: yes
-
Referee: [Methodology] Methodology / Failure-mode analysis: the paper states that failure modes were identified and mapped to the five interventions before the ablation, yet provides no quantitative breakdown of failure frequencies and no indication of whether the mapping was derived from blinded logs or from iterative prompting on the evaluation set itself. If the interventions were selected or tuned post-hoc on the same vulnerabilities, the attribution of gains to the treatments cannot be generalized.
Authors: The failure-mode analysis was performed on a disjoint pilot set of vulnerabilities before the main evaluation set was finalized, and the interventions were selected on the basis of that pilot analysis together with prior literature. We will add a new table providing quantitative frequencies of each failure mode and an explicit description of the pilot-to-main separation. While the log review was not formally blinded, the temporal and set separation prevents post-hoc tuning on the reported evaluation data; we will state this limitation clearly. revision: partial
Circularity Check
No circularity: purely empirical evaluation with no derivations or fitted predictions
full rationale
The paper reports measured success rates from controlled experiments on a fixed set of vulnerabilities, including a full-factorial ablation over five interventions. No equations, parameters fitted to subsets of data, or predictions derived from those fits appear anywhere. Failure-mode analysis is described as preceding the choice of interventions, but the central claims rest on direct experimental outcomes rather than any self-definitional mapping or self-citation chain that reduces the result to its inputs by construction. The study is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
- [2]
- [3]
-
[4]
In: Proceedings of the 37th International Conference on Neural Information Processing Systems
Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: efficient fine- tuning of quantized llms. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023)
2023
-
[5]
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Guo, Q., Wang, M., Wang, H.: Retrieval-augmented generation for large language models: A survey. CoRRabs/2312.10997(2023),https://doi.org/10.48550/arXiv. 2312.10997, accessed by: 26.3.2025
work page internal anchor Pith review doi:10.48550/arxiv 2023
-
[6]
Garcia, S., Lukas, O., Rigaki, M., Catania, C.: NetSecGame, a RL env for train- ing and evaluating AI agents in network security tasks.,https://github.com/ stratosphereips/NetSecGame
-
[7]
In: First Conference on Language Modeling (2024),https://openreview
Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: First Conference on Language Modeling (2024),https://openreview. net/forum?id=tEYskw1VY2
2024
-
[8]
Happe, A., Cito, J.: Getting pwn’d by ai: Penetration testing with large lan- guage models. In: Proceedings of the 31st ACM Joint European Software En- gineering Conference and Symposium on the Foundations of Software Engineer- ing. p. 2082–2086. ESEC/FSE 2023, Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3611643....
-
[9]
org/abs/2405.02106, accessed by: 26.3.2025
Happe, A., Cito, J.: Got root? a linux priv-esc benchmark (2024),https://arxiv. org/abs/2405.02106, accessed by: 26.3.2025
- [10]
-
[11]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models (2021),https://arxiv. org/abs/2106.09685, accessed by: 26.3.2025
work page internal anchor Pith review arXiv 2021
-
[12]
In: Proceedings of the Workshop on Autonomous Cybersecurity
Huang, J., Zhu, Q.: Penheal: A two-stage llm framework for automated pentest- ing and optimal remediation. In: Proceedings of the Workshop on Autonomous Cybersecurity. p. 11–22. AutonomousCyber ’24, Association for Computing Ma- chinery, New York, NY, USA (2024). https://doi.org/10.1145/3689933.3690831, https://doi.org/10.1145/3689933.3690831
-
[13]
In: Koyejo, S., Mohamed, S., Agar- wal, A., Belgrave, D., Cho, K., Oh, A
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large lan- guage models are zero-shot reasoners. In: Koyejo, S., Mohamed, S., Agar- wal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Infor- mation Processing Systems. vol. 35, pp. 22199–22213. Curran Associates, Inc. (2022),https://proceedings.neurips.cc/paper_files/paper/2022/file/ ...
2022
-
[14]
In: Proceedings of the 34th Interna- tional Conference on Neural Information Processing Systems
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Proceedings of the 34th Interna- tional Conference on Neural Information Processing Systems. NIPS ’20, Curran Associates Inc., Red Hook, NY...
2020
-
[15]
arXiv preprint arXiv:2412.19442 , year =
Li, H., Li, Y., Tian, A., Tang, T., Xu, Z., Chen, X., Hu, N., Dong, W., Li, Q., Chen, L.: A survey on large language model acceleration based on kv cache management. arXiv preprint arXiv:2412.19442 (2024) 22 Benjamin Probst, Andreas Happe, and Jürgen Cito
-
[16]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P.: Lost in the middle: How language models use long con- texts. Transactions of the Association for Computational Linguistics12, 157– 173 (2024). https://doi.org/10.1162/tacl_a_00638,https://aclanthology.org/ 2024.tacl-1.9/
-
[17]
Medicina60(1), 148 (2024)
Miao, J., Thongprayoon, C., Suppadungsuk, S., Krisanapan, P., Radhakrishnan, Y., Cheungpasitporn, W.: Chain of thought utilization in large language models and application in nephrology. Medicina60(1), 148 (2024)
2024
-
[18]
Pinna, E., Cardaci, A.: Gtfobins.https://gtfobins.github.io/(2025), accessed: 2025-07-30
2025
-
[19]
Polop, C.: Hacktricks: Linux privilege escalation.https://book.hacktricks.xyz/ linux-hardening/privilege-escalation(2025), accessed: 2025-07-30
2025
-
[20]
CIPHER: Cybersecurity Intelligent Penetration-Testing Helper for Ethical Researcher
Pratama, D., Suryanto, N., Adiputra, A.A., Le, T.T.H., Kadiptya, A.Y., Iqbal, M., Kim, H.: Cipher: Cybersecurity intelligent penetration-testing helper for ethical researcher. Sensors24(21), 6878 (Oct 2024). https://doi.org/10.3390/s24216878, http://dx.doi.org/10.3390/s24216878
-
[21]
Ragab, R., Altahhan, A.: Fine-tuning of small/medium llms for business qa on structured data. Available at SSRN:https://ssrn.com/abstract=4850031or https://dx.doi.org/10.2139/ssrn.4850031(2024), accessed by: 26.3.2025
-
[22]
Expert Systems with Applications 299, 129987 (2026)
Rigaki, M., Catania, C.A., García, S.: Building adaptative and transparent cyber agents with local language models. Expert Systems with Applications 299, 129987 (2026). https://doi.org/https://doi.org/10.1016/j.eswa.2025.129987, https://www.sciencedirect.com/science/article/pii/S0957417425036024
-
[23]
Out of the Cage: How Stochastic Parrots Win in Cyber Security Environments
Rigaki, M., Lukáš, O., Catania, C., Garcia, S.: Out of the cage: How stochastic par- rots win in cyber security environments. In: Proceedings of the 16th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART. pp. 774–781. INSTICC, SciTePress (2024). https://doi.org/10.5220/0012391800003636
-
[24]
IEEE Access13, 38889–38900 (2025)
Shestov, A., Levichev, R., Mussabayev, R., Maslov, E., Zadorozhny, P., Cheshkov, A., Mussabayev, R., Toleu, A., Tolegen, G., Krassovitskiy, A.: Finetuning large language models for vulnerability detection. IEEE Access13, 38889–38900 (2025). https://doi.org/10.1109/ACCESS.2025.3546700, accessed by: 26.3.2025
- [25]
- [26]
-
[27]
CoRRabs/2411.03350(2024),https://doi.org/10.48550/arXiv.2411.03350, accessed by: 26.3.2025
Wang, F., Zhang, Z., Zhang, X., Wu, Z., Mo, T., Lu, Q., Wang, W., Li, R., Xu, J., Tang, X., He, Q., Ma, Y., Huang, M., Wang, S.: A comprehensive survey of small language models in the era of large language models: Tech- niques, enhancements, applications, collaboration with llms, and trustworthiness. CoRRabs/2411.03350(2024),https://doi.org/10.48550/arXiv...
-
[28]
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
Wang, L., Xu, W., Lan, Y., Hu, Z., Lan, Y., Lee, R.K.W., Lim, E.P.: Plan-and- solve prompting: Improving zero-shot chain-of-thought reasoning by large lan- guage models. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics SLMs for Linux Privilege Escalation Attacks 23 ...
-
[29]
In: Proceedings of the 20th ACM Asia Conference on Computer and Communications Security
Weber, D.M., Tzachristas, I., Sui, A.: Perses: Unlocking privilege escala- tion for small llms via extensible heterogeneity. In: Proceedings of the 20th ACM Asia Conference on Computer and Communications Security. p. 344–357. ASIA CCS ’25, Association for Computing Machinery, New York, NY, USA (2025). https://doi.org/10.1145/3708821.3736189,https://doi.or...
-
[30]
NIPS ’22, Curran Associates Inc., Red Hook, NY, USA (2022)
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E.H., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models.In:Proceedingsofthe36thInternationalConferenceonNeuralInformation Processing Systems. NIPS ’22, Curran Associates Inc., Red Hook, NY, USA (2022)
2022
-
[31]
In: Al-Onaizan, Y., Bansal, M., Chen, Y.N
Xia, P., Zhu, K., Li, H., Zhu, H., Li, Y., Li, G., Zhang, L., Yao, H.: RULE: Reliable multimodal RAG for factuality in medical vision language models. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 1081–
2024
-
[32]
https://doi.org/10.18653/v1/2024.emnlp-main.62,https://aclanthology
Association for Computational Linguistics, Miami, Florida, USA (Nov 2024). https://doi.org/10.18653/v1/2024.emnlp-main.62,https://aclanthology. org/2024.emnlp-main.62/
- [33]
-
[34]
A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models
Xu, Z., Liu, Y., Deng, G., Li, Y., Picek, S.: A comprehensive study of jail- break attack versus defense for large language models. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) Findings of the Association for Computational Linguistics: ACL 2024. pp. 7432–7449. Association for Computational Linguistics, Bangkok, Thailand (Aug 2024). https://doi.org/10.18...
- [35]
-
[36]
In: Proceedings of the Fourth ACM International Conference on AI in Finance
Zhang, B., Yang, H., Zhou, T., Ali Babar, M., Liu, X.Y.: Enhancing fi- nancial sentiment analysis via retrieval augmented large language models. In: Proceedings of the Fourth ACM International Conference on AI in Finance. p. 349–356. ICAIF ’23, Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3604237.3626866,https://d...
-
[37]
Zheng,J.,Hong,H.,Wang,X.,Su,J.,Liang,Y.,Wu,S.:Fine-tuninglargelanguage models for domain-specific machine translation. CoRRabs/2402.15061(2024), https://doi.org/10.48550/arXiv.2402.15061, accessed by: 26.3.2025 A Preliminary Prototype Analysis A.1 Preliminary Prototype We conduct a preliminary analysis to determine promising treatments. Multiple treatment...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.