pith. machine review for the scientific record. sign in

arxiv: 2604.27143 · v1 · submitted 2026-04-29 · 💻 cs.CR · cs.AI

Recognition: unknown

Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:43 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords privilege escalationLLM agentsautonomous penetration testingopen-weight modelsprompting interventionsreflective analysisablation study
0
0 comments X

The pith

Targeted interventions enable local open-weight LLMs to exploit 83% of Linux privilege escalation vulnerabilities

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether system-level and prompting interventions can close the performance gap for locally hosted open-weight LLMs in autonomous Linux privilege escalation, where prior results showed they lagged behind restricted cloud models. The authors map observed failure modes to five concrete techniques—chain-of-thought prompting, retrieval-augmented generation, structured prompts, history compression, and reflective analysis—and implement them as extensions to hackingBuddyGPT. With these treatments, Llama3.1 70B reaches 83% success while smaller models reach 67%, matching or exceeding GPT-4o baselines. A full-factorial ablation identifies reflection as the largest contributor and leaves vulnerability discovery as the main remaining limit. This matters for applications that require keeping attack data and tools on local hardware for privacy or sovereignty reasons.

Core claim

Open-weight models augmented with chain-of-thought prompting, retrieval-augmented generation, structured prompts, history compression, and reflective analysis as extensions to hackingBuddyGPT can match or outperform cloud-based models such as GPT-4o on autonomous Linux privilege escalation, achieving 83% exploitation for Llama3.1 70B and 67% for Llama3.1 8B and Qwen2.5 7B, with a full-factorial ablation study showing that reflection-based treatments contribute most while vulnerability discovery remains the primary bottleneck.

What carries the argument

Five targeted interventions—chain-of-thought prompting, retrieval-augmented generation, structured prompts, history compression, and reflective analysis—mapped from failure modes of open-weight models and added to hackingBuddyGPT, evaluated through full-factorial ablation.

If this is right

  • Llama3.1 70B reaches an 83% exploitation rate once the full set of treatments is enabled.
  • Smaller models such as Llama3.1 8B and Qwen2.5 7B reach 67% success when guidance from the interventions is provided.
  • Reflection-based treatments account for the largest share of the observed performance lift in the ablation study.
  • Vulnerability discovery stays the dominant remaining bottleneck even after the other interventions are applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Organizations could run capable local penetration-testing agents without routing sensitive target information to external cloud services.
  • Self-reflection mechanisms appear especially effective at stabilizing LLM agents on long, error-prone command sequences.
  • The same failure-mode mapping and intervention set could be tested on other operating systems or attack stages to check whether the gains transfer.

Load-bearing premise

The five selected interventions correctly resolve the main failure modes of open-weight models on privilege escalation and the tested vulnerabilities are representative of real-world Linux cases.

What would settle it

Running the same agents on a fresh collection of Linux privilege escalation vulnerabilities and finding exploitation rates well below 83% or 67% even with all interventions applied would show the performance gains do not generalize.

Figures

Figures reproduced from arXiv: 2604.27143 by Andreas Happe, Benjamin Probst, J\"urgen Cito.

Figure 1
Figure 1. Figure 1: Core architecture of hackingBuddyGPT [10] tiple specialized LLMs (three 7B models). While Peres achieves promising re￾sults, it similarly reports that guidance is required to reach acceptable success rates with open-weight models. A direct quantitative comparison is not feasi￾ble: Peres evaluates on a different benchmark with different vulnerability types, iteration limits, and success criteria. Conceptual… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of the prototype. is included in the analyze_cmd prompt rather than the query_next_command prompt, reducing clutter and allowing pre-processing with the full command output; (2) guidance information is included in both prompts rather than only query_next_command. No Duplicates and State were excluded as they did not improve results during the preliminary analysis. State increased prompt clutte… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation Study using Llama3, including guidance and both baselines and all our suggested treatment ideas. Guidance improved the performance, but the gap between GPT-4o and the other models remains. The treatments increases the success rate of all tested models. The gain for the open-weight models is insignificant, as it only allowed the SLMs to solve a single additional test case. GPT-4o-mini showed substa… view at source ↗
Figure 4
Figure 4. Figure 4: Context size distribution (baseline vs. treatments) for each model. WhiteRab￾bitNeo excluded as it does not finish most runs. more tokens than larger models: Llama3.1 8B uses 281% more output and 163% more input tokens compared to GPT-4o mini, while Qwen2.5 7B uses 236% and 143% respectively. Llama3.1 70B is closer to GPT-4o mini at 125% output and 108% input. Qwen2.5 7B has the longest average length for … view at source ↗
Figure 5
Figure 5. Figure 5: Examples of what each model returns in a single iteration view at source ↗
Figure 6
Figure 6. Figure 6: Architecture for the initial prototype. create and evaluate an initial set of treatment ideas. Based on the result of this first set, we select the treatments to be included in the final architecture. To evaluate the initial set, we use Llama3.1 8B and GPT-4o mini. We build our prototype on top of hackingBuddyGPT [10] and included treat￾ment ideas as mention in Section 5 yielding the architecture shown in view at source ↗
Figure 7
Figure 7. Figure 7: query_next_command prompt 1 You executed the command ’$ { cmd }’ and retrieved the following result : 2 3 ~~~ bash 4 $ { resp } 5 ~~~ 6 7 You also have the following additional information : 8 --- 9 $ { rag_text } 10 --- 11 12 Analyze if the output of the executed command allows you to determine a way to escalate your privileges into a root 13 shell . If you find a command that can grant access to a root s… view at source ↗
Figure 8
Figure 8. Figure 8: analyze_cmd prompt view at source ↗
Figure 9
Figure 9. Figure 9: rag_prompt view at source ↗
read the original abstract

Recent research has demonstrated the potential of Large Language Models (LLMs) for autonomous penetration testing, particularly when using cloud-based restricted-weight models. However, reliance on such models introduces security, privacy, and sovereignty concerns, motivating the use of locally hosted open-weight alternatives. Prior work shows that small open-weight models perform poorly on automated Linux privilege escalation, limiting their practical applicability. In this paper, we present a systematic empirical study of whether targeted system-level and prompting interventions can bridge this performance gap. We analyze failure modes of open-weight models in autonomous privilege escalation, map them to established enhancement techniques, and evaluate five concrete interventions (chain-of-thought prompting, retrieval-augmented generation, structured prompts, history compression, and reflective analysis) implemented as extensions to hackingBuddyGPT. Our results show that open-weight models can match or outperform cloud-based baselines such as GPT-4o. With our treatments enabled, Llama3.1 70B exploits 83% of tested vulnerabilities, while smaller models including Llama3.1 8B and Qwen2.5 7B achieve 67% when using guidance. A full-factorial ablation study over all treatment combinations reveals that reflection-based treatments contribute most, while also identifying vulnerability discovery as a remaining bottleneck for local models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a systematic empirical study of five interventions (chain-of-thought prompting, retrieval-augmented generation, structured prompts, history compression, and reflective analysis) applied as extensions to hackingBuddyGPT to improve open-weight LLMs on autonomous Linux privilege escalation. The authors analyze failure modes, map them to the interventions, and report via a full-factorial ablation that treated Llama 3.1 70B succeeds on 83% of tested vulnerabilities while smaller models (Llama 3.1 8B, Qwen2.5 7B) reach 67% with guidance; reflection contributes most, vulnerability discovery remains a bottleneck, and local models can match or exceed cloud baselines such as GPT-4o.

Significance. If the results hold under representative conditions, the work is significant for LLM agents in security: it supplies concrete evidence that open-weight models can be made practically useful for penetration testing without cloud APIs, directly addressing privacy, sovereignty, and security risks. The ablation study offers actionable, comparative data on technique efficacy that can inform future agent designs.

major comments (2)
  1. [Results] Results section: the central performance claims (83% for Llama 3.1 70B, 67% for smaller models) are reported without the total number of vulnerabilities tested, the source or selection criteria for the test set, error bars, or statistical significance tests. These omissions make it impossible to judge whether the reported lifts are robust or sensitive to the particular sample.
  2. [Methodology] Methodology / Failure-mode analysis: the paper states that failure modes were identified and mapped to the five interventions before the ablation, yet provides no quantitative breakdown of failure frequencies and no indication of whether the mapping was derived from blinded logs or from iterative prompting on the evaluation set itself. If the interventions were selected or tuned post-hoc on the same vulnerabilities, the attribution of gains to the treatments cannot be generalized.
minor comments (2)
  1. [Abstract] Abstract and Results: the claim that local models 'match or outperform' GPT-4o is stated without the corresponding GPT-4o baseline numbers or experimental conditions for that comparison.
  2. [Experiments] The manuscript would benefit from an explicit table or appendix listing the vulnerabilities used, even if only by CVE or description, to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which helps us improve the clarity and rigor of our empirical study on enhancing local LLMs for Linux privilege escalation. We address each major comment below and outline the corresponding revisions.

read point-by-point responses
  1. Referee: [Results] Results section: the central performance claims (83% for Llama 3.1 70B, 67% for smaller models) are reported without the total number of vulnerabilities tested, the source or selection criteria for the test set, error bars, or statistical significance tests. These omissions make it impossible to judge whether the reported lifts are robust or sensitive to the particular sample.

    Authors: We agree that these details are necessary for readers to evaluate robustness. In the revised manuscript we will explicitly report the total number of vulnerabilities in the test set, fully describe their source and selection criteria (drawn from public CVE records and standard privilege-escalation benchmarks with a focus on local, command-line exploits), add error bars derived from repeated runs, and include statistical significance tests comparing the intervention conditions. These additions will appear in the Results section, the experimental setup, and the associated tables. revision: yes

  2. Referee: [Methodology] Methodology / Failure-mode analysis: the paper states that failure modes were identified and mapped to the five interventions before the ablation, yet provides no quantitative breakdown of failure frequencies and no indication of whether the mapping was derived from blinded logs or from iterative prompting on the evaluation set itself. If the interventions were selected or tuned post-hoc on the same vulnerabilities, the attribution of gains to the treatments cannot be generalized.

    Authors: The failure-mode analysis was performed on a disjoint pilot set of vulnerabilities before the main evaluation set was finalized, and the interventions were selected on the basis of that pilot analysis together with prior literature. We will add a new table providing quantitative frequencies of each failure mode and an explicit description of the pilot-to-main separation. While the log review was not formally blinded, the temporal and set separation prevents post-hoc tuning on the reported evaluation data; we will state this limitation clearly. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or fitted predictions

full rationale

The paper reports measured success rates from controlled experiments on a fixed set of vulnerabilities, including a full-factorial ablation over five interventions. No equations, parameters fitted to subsets of data, or predictions derived from those fits appear anywhere. Failure-mode analysis is described as preceding the choice of interventions, but the central claims rest on direct experimental outcomes rather than any self-definitional mapping or self-citation chain that reduces the result to its inputs by construction. The study is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation paper with no mathematical derivations. No free parameters, axioms, or invented entities are required to support the reported experimental outcomes.

pith-pipeline@v0.9.0 · 5531 in / 1111 out tokens · 37919 ms · 2026-05-07T09:43:44.279554+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 27 canonical work pages · 2 internal anchors

  1. [1]

    Alrashedy, K., Aljasser, A., Tambwekar, P., Gombolay, M.: Can llms patch security issues? (2024),https://arxiv.org/abs/2312.00024, accessed by: 26.3.2025

  2. [2]

    Bucher, M.J.J., Martini, M.: Fine-tuned ’small’ llms (still) significantly outperform zero-shot generative ai models in text classification (2024),https://arxiv.org/ abs/2406.08660, accessed by: 26.3.2025 SLMs for Linux Privilege Escalation Attacks 21

  3. [3]

    Deng, G., Liu, Y., Mayoral-Vilches, V., Liu, P., Li, Y., Xu, Y., Zhang, T., Liu, Y., Pinzger, M., Rass, S.: Pentestgpt: An llm-empowered automatic penetration testing tool (2024),https://arxiv.org/abs/2308.06782, accessed by: 26.3.2025

  4. [4]

    In: Proceedings of the 37th International Conference on Neural Information Processing Systems

    Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: efficient fine- tuning of quantized llms. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023)

  5. [5]

    control bars

    Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Guo, Q., Wang, M., Wang, H.: Retrieval-augmented generation for large language models: A survey. CoRRabs/2312.10997(2023),https://doi.org/10.48550/arXiv. 2312.10997, accessed by: 26.3.2025

  6. [6]

    Garcia, S., Lukas, O., Rigaki, M., Catania, C.: NetSecGame, a RL env for train- ing and evaluating AI agents in network security tasks.,https://github.com/ stratosphereips/NetSecGame

  7. [7]

    In: First Conference on Language Modeling (2024),https://openreview

    Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: First Conference on Language Modeling (2024),https://openreview. net/forum?id=tEYskw1VY2

  8. [8]

    In: Proceedings of the 31st ACM Joint European Software En- gineering Conference and Symposium on the Foundations of Software Engineer- ing

    Happe, A., Cito, J.: Getting pwn’d by ai: Penetration testing with large lan- guage models. In: Proceedings of the 31st ACM Joint European Software En- gineering Conference and Symposium on the Foundations of Software Engineer- ing. p. 2082–2086. ESEC/FSE 2023, Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3611643....

  9. [9]

    org/abs/2405.02106, accessed by: 26.3.2025

    Happe, A., Cito, J.: Got root? a linux priv-esc benchmark (2024),https://arxiv. org/abs/2405.02106, accessed by: 26.3.2025

  10. [10]

    Happe, A., Kaplan, A., Cito, J.: Llms as hackers: Autonomous linux privilege esca- lationattacks(2024),https://arxiv.org/abs/2310.11409,accessedby:26.3.2025

  11. [11]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models (2021),https://arxiv. org/abs/2106.09685, accessed by: 26.3.2025

  12. [12]

    In: Proceedings of the Workshop on Autonomous Cybersecurity

    Huang, J., Zhu, Q.: Penheal: A two-stage llm framework for automated pentest- ing and optimal remediation. In: Proceedings of the Workshop on Autonomous Cybersecurity. p. 11–22. AutonomousCyber ’24, Association for Computing Ma- chinery, New York, NY, USA (2024). https://doi.org/10.1145/3689933.3690831, https://doi.org/10.1145/3689933.3690831

  13. [13]

    In: Koyejo, S., Mohamed, S., Agar- wal, A., Belgrave, D., Cho, K., Oh, A

    Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large lan- guage models are zero-shot reasoners. In: Koyejo, S., Mohamed, S., Agar- wal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Infor- mation Processing Systems. vol. 35, pp. 22199–22213. Curran Associates, Inc. (2022),https://proceedings.neurips.cc/paper_files/paper/2022/file/ ...

  14. [14]

    In: Proceedings of the 34th Interna- tional Conference on Neural Information Processing Systems

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Proceedings of the 34th Interna- tional Conference on Neural Information Processing Systems. NIPS ’20, Curran Associates Inc., Red Hook, NY...

  15. [15]

    arXiv preprint arXiv:2412.19442 , year =

    Li, H., Li, Y., Tian, A., Tang, T., Xu, Z., Chen, X., Hu, N., Dong, W., Li, Q., Chen, L.: A survey on large language model acceleration based on kv cache management. arXiv preprint arXiv:2412.19442 (2024) 22 Benjamin Probst, Andreas Happe, and Jürgen Cito

  16. [16]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P.: Lost in the middle: How language models use long con- texts. Transactions of the Association for Computational Linguistics12, 157– 173 (2024). https://doi.org/10.1162/tacl_a_00638,https://aclanthology.org/ 2024.tacl-1.9/

  17. [17]

    Medicina60(1), 148 (2024)

    Miao, J., Thongprayoon, C., Suppadungsuk, S., Krisanapan, P., Radhakrishnan, Y., Cheungpasitporn, W.: Chain of thought utilization in large language models and application in nephrology. Medicina60(1), 148 (2024)

  18. [18]

    Pinna, E., Cardaci, A.: Gtfobins.https://gtfobins.github.io/(2025), accessed: 2025-07-30

  19. [19]

    Polop, C.: Hacktricks: Linux privilege escalation.https://book.hacktricks.xyz/ linux-hardening/privilege-escalation(2025), accessed: 2025-07-30

  20. [20]

    CIPHER: Cybersecurity Intelligent Penetration-Testing Helper for Ethical Researcher

    Pratama, D., Suryanto, N., Adiputra, A.A., Le, T.T.H., Kadiptya, A.Y., Iqbal, M., Kim, H.: Cipher: Cybersecurity intelligent penetration-testing helper for ethical researcher. Sensors24(21), 6878 (Oct 2024). https://doi.org/10.3390/s24216878, http://dx.doi.org/10.3390/s24216878

  21. [21]

    Available at SSRN:https://ssrn.com/abstract=4850031or https://dx.doi.org/10.2139/ssrn.4850031(2024), accessed by: 26.3.2025

    Ragab, R., Altahhan, A.: Fine-tuning of small/medium llms for business qa on structured data. Available at SSRN:https://ssrn.com/abstract=4850031or https://dx.doi.org/10.2139/ssrn.4850031(2024), accessed by: 26.3.2025

  22. [22]

    Expert Systems with Applications 299, 129987 (2026)

    Rigaki, M., Catania, C.A., García, S.: Building adaptative and transparent cyber agents with local language models. Expert Systems with Applications 299, 129987 (2026). https://doi.org/https://doi.org/10.1016/j.eswa.2025.129987, https://www.sciencedirect.com/science/article/pii/S0957417425036024

  23. [23]

    Out of the Cage: How Stochastic Parrots Win in Cyber Security Environments

    Rigaki, M., Lukáš, O., Catania, C., Garcia, S.: Out of the cage: How stochastic par- rots win in cyber security environments. In: Proceedings of the 16th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART. pp. 774–781. INSTICC, SciTePress (2024). https://doi.org/10.5220/0012391800003636

  24. [24]

    IEEE Access13, 38889–38900 (2025)

    Shestov, A., Levichev, R., Mussabayev, R., Maslov, E., Zadorozhny, P., Cheshkov, A., Mussabayev, R., Toleu, A., Tolegen, G., Krassovitskiy, A.: Finetuning large language models for vulnerability detection. IEEE Access13, 38889–38900 (2025). https://doi.org/10.1109/ACCESS.2025.3546700, accessed by: 26.3.2025

  25. [25]

    Sprague, Z., Yin, F., Rodriguez, J.D., Jiang, D., Wadhwa, M., Singhal, P., Zhao, X., Ye, X., Mahowald, K., Durrett, G.: To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning (2024),https://arxiv.org/abs/ 2409.12183, accessed by: 26.3.2025

  26. [26]

    Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A.M., Wolf, T.: Zephyr: Direct distillation of lm alignment (2023),https://arxiv.org/ abs/2310.16944

  27. [27]

    CoRRabs/2411.03350(2024),https://doi.org/10.48550/arXiv.2411.03350, accessed by: 26.3.2025

    Wang, F., Zhang, Z., Zhang, X., Wu, Z., Mo, T., Lu, Q., Wang, W., Li, R., Xu, J., Tang, X., He, Q., Ma, Y., Huang, M., Wang, S.: A comprehensive survey of small language models in the era of large language models: Tech- niques, enhancements, applications, collaboration with llms, and trustworthiness. CoRRabs/2411.03350(2024),https://doi.org/10.48550/arXiv...

  28. [28]

    Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

    Wang, L., Xu, W., Lan, Y., Hu, Z., Lan, Y., Lee, R.K.W., Lim, E.P.: Plan-and- solve prompting: Improving zero-shot chain-of-thought reasoning by large lan- guage models. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics SLMs for Linux Privilege Escalation Attacks 23 ...

  29. [29]

    In: Proceedings of the 20th ACM Asia Conference on Computer and Communications Security

    Weber, D.M., Tzachristas, I., Sui, A.: Perses: Unlocking privilege escala- tion for small llms via extensible heterogeneity. In: Proceedings of the 20th ACM Asia Conference on Computer and Communications Security. p. 344–357. ASIA CCS ’25, Association for Computing Machinery, New York, NY, USA (2025). https://doi.org/10.1145/3708821.3736189,https://doi.or...

  30. [30]

    NIPS ’22, Curran Associates Inc., Red Hook, NY, USA (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E.H., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models.In:Proceedingsofthe36thInternationalConferenceonNeuralInformation Processing Systems. NIPS ’22, Curran Associates Inc., Red Hook, NY, USA (2022)

  31. [31]

    In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

    Xia, P., Zhu, K., Li, H., Zhu, H., Li, Y., Li, G., Zhang, L., Yao, H.: RULE: Reliable multimodal RAG for factuality in medical vision language models. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 1081–

  32. [32]

    https://doi.org/10.18653/v1/2024.emnlp-main.62,https://aclanthology

    Association for Computational Linguistics, Miami, Florida, USA (Nov 2024). https://doi.org/10.18653/v1/2024.emnlp-main.62,https://aclanthology. org/2024.emnlp-main.62/

  33. [33]

    Xu, J., Stokes, J.W., McDonald, G., Bai, X., Marshall, D., Wang, S., Swaminathan, A., Li, Z.: Autoattacker: A large language model guided system to implement au- tomatic cyber-attacks (2024),https://arxiv.org/abs/2403.01038, accessed by: 26.3.2025

  34. [34]

    A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

    Xu, Z., Liu, Y., Deng, G., Li, Y., Picek, S.: A comprehensive study of jail- break attack versus defense for large language models. In: Ku, L.W., Martins, A., Srikumar, V. (eds.) Findings of the Association for Computational Linguistics: ACL 2024. pp. 7432–7449. Association for Computational Linguistics, Bangkok, Thailand (Aug 2024). https://doi.org/10.18...

  35. [35]

    Yang, R.: Casegpt: a case reasoning framework based on language models and retrieval-augmented generation (2024),https://arxiv.org/abs/2407.07913, ac- cessed by: 26.3.2025

  36. [36]

    In: Proceedings of the Fourth ACM International Conference on AI in Finance

    Zhang, B., Yang, H., Zhou, T., Ali Babar, M., Liu, X.Y.: Enhancing fi- nancial sentiment analysis via retrieval augmented large language models. In: Proceedings of the Fourth ACM International Conference on AI in Finance. p. 349–356. ICAIF ’23, Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3604237.3626866,https://d...

  37. [37]

    Sudo ver

    Zheng,J.,Hong,H.,Wang,X.,Su,J.,Liang,Y.,Wu,S.:Fine-tuninglargelanguage models for domain-specific machine translation. CoRRabs/2402.15061(2024), https://doi.org/10.48550/arXiv.2402.15061, accessed by: 26.3.2025 A Preliminary Prototype Analysis A.1 Preliminary Prototype We conduct a preliminary analysis to determine promising treatments. Multiple treatment...