arxiv: 2604.14317 · v1 · submitted 2026-04-15 · 💻 cs.CR · cs.AI

Recognition: unknown

Challenges and Future Directions in Agentic Reverse Engineering Systems

Salem Radey , Jack West , Kassem Fawaz

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:46 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords agentic systemsreverse engineeringbinary analysisLLM agentssecurity challengesobfuscationfuture directions

0 comments

The pith

Agentic systems for binary reverse engineering still struggle with obfuscation, timing, and unique architectures despite recent advances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how large language model-based agentic systems perform on reverse engineering tasks using static, dynamic, and hybrid approaches. It identifies key limitations such as token constraints, difficulties handling obfuscated code, and absence of program guardrails. A sympathetic reader would care because these systems are increasingly applied to security-critical tasks like binary analysis, and understanding their failures points to needed improvements for reliable use in real-world settings. The authors position future directions for overcoming these from a security perspective.

Core claim

Through analysis of existing agentic tool usage in reverse engineering, the paper finds that cutting-edge systems continue to fail in complex scenarios involving obfuscation, timing, and unique architectures. The examination covers static, dynamic, and hybrid agents and highlights limitations including token constraints, struggles with obfuscation, and a lack of program guardrails, leading to outlined challenges and future directions for system designers.

What carries the argument

Analysis of agentic tool usage across static, dynamic, and hybrid agents for binary reverse engineering tasks.

Load-bearing premise

The analysis of existing agentic tool usage captures the primary and representative limitations across realistic reverse engineering settings.

What would settle it

Demonstration of an agentic system that successfully performs reverse engineering on obfuscated binaries with unique architectures without hitting token limits or requiring manual guardrails.

Figures

Figures reproduced from arXiv: 2604.14317 by Jack West, Kassem Fawaz, Salem Radey.

read the original abstract

Agentic systems built on large language models (LLMs) are increasingly being used for complex security tasks, including binary reverse engineering (RE). Despite recent growth in popularity and capability, these systems continue to face limitations in realistic settings. Cutting-edge systems still fail in complex RE scenarios that involve obfuscation, timing, and unique architecture. In this work, we examine how agentic systems perform reverse engineering tasks with static, dynamic, and hybrid agents. Through an analysis of existing agentic tool usage, we identify several limitations, including token constraints, struggles with obfuscation, and a lack of program guardrails. From these findings, we outline current challenges and position future directions for system designers to overcome from a security perspective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clear position paper that organizes known limitations in LLM agents for reverse engineering but adds no new data or techniques.

read the letter

This paper is a position piece that maps out where current agentic systems fall short on binary reverse engineering tasks. It focuses on static, dynamic, and hybrid setups and calls out recurring problems like token limits, trouble with obfuscated code, and absent guardrails, then suggests directions to fix them from a security viewpoint. The main value is the organized checklist it gives to people building these tools so they can address the same issues faster rather than rediscovering them one by one. The authors ground the points in how existing tools actually behave, which keeps the discussion practical instead of abstract. The soft spots are straightforward: everything stays observational. There are no new experiments, no quantitative failure rates across a test set of binaries, and no detailed account of how the reviewed tools were chosen. That leaves room for selection bias or incomplete coverage, though the paper does not present its list as exhaustive or statistically validated. The central claim about failures on obfuscation, timing, and unique architectures functions as motivation rather than a measured result. Readers already working on AI for cybersecurity or binary analysis will find this useful as a quick reference on current pain points. It is not aimed at people looking for novel methods or broad theoretical advances. The work deserves peer review because the structure is clean, the motivation ties directly to real tool behaviors, and the field benefits from targeted overviews that highlight where effort should go next even without fresh empirical results.

Referee Report

0 major / 3 minor

Summary. The paper is a position piece that analyzes the use of LLM-based agentic systems for binary reverse engineering tasks via static, dynamic, and hybrid agent approaches. Drawing on a qualitative review of existing tool usage, it identifies limitations such as token constraints, struggles with code obfuscation, timing dependencies, unique architectures, and insufficient program guardrails. These observations motivate a discussion of current challenges and proposed future directions for more secure and effective agentic RE systems.

Significance. If the limitations identified are broadly representative, the paper provides a timely synthesis of gaps in an emerging area at the intersection of AI and security. Its value lies in framing concrete challenges (obfuscation handling, guardrails) as motivation for future work rather than claiming new empirical results; this can help guide system designers toward more robust designs. The observational approach is appropriate for a position paper and avoids overclaiming.

minor comments (3)

The abstract and introduction would benefit from a brief statement of the scope and selection criteria for the 'existing agentic tool usage' reviewed, to allow readers to evaluate potential selection bias in the identified limitations.
Claims about failures in scenarios involving obfuscation, timing, and unique architectures are central but presented at a high level; adding one or two concrete, cited examples from the reviewed systems would strengthen the motivation without altering the position-piece nature.
The future-directions section could more explicitly link each proposed direction back to the specific limitations enumerated earlier, improving traceability for readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our position paper and for recommending minor revision. We appreciate the recognition that the observational approach is appropriate for this type of work and that the synthesis of limitations can help guide future system design.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an observational position paper whose central claims derive from a qualitative review of external agentic RE tools and literature. Limitations (token constraints, obfuscation struggles, missing guardrails) are listed as direct observations from that review rather than from any fitted parameters, self-referential predictions, or equations. No derivation chain, uniqueness theorem, or ansatz is invoked; future directions follow logically from the enumerated challenges without requiring the analysis to be exhaustive or statistically representative. The paper contains no self-citation load-bearing steps or renamings of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on domain assumptions about the current state of LLM agent capabilities in security tasks, with no free parameters, invented entities, or ad-hoc axioms introduced beyond standard expectations for agentic systems.

axioms (1)

domain assumption LLM-based agents can be meaningfully evaluated on reverse engineering tasks via static, dynamic, and hybrid modes.
Invoked in the abstract when describing the examination of agent performance.

pith-pipeline@v0.9.0 · 5412 in / 1181 out tokens · 31284 ms · 2026-05-10T12:46:34.483416+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 24 canonical work pages · 1 internal anchor

[1]

A survey on agentic security: Applications, threats and defenses,

A. Shahriar, M. N. Rahman, S. Ahmed, F. Sadeque, and M. R. Parvez, “A survey on agentic security: Applications, threats and defenses,” arXiv preprint arXiv:2510.06445, 2025

work page arXiv 2025
[2]

Hacksynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing.ArXiv, abs/2412.01778, dec 2024

L. Muzsai, D. Imolai, and A. Luk ´acs, “Hacksynth: Llm agent and evaluation framework for autonomous penetration testing,” 2024. [Online]. Available: https://arxiv.org/abs/2412.01778

work page arXiv 2024
[3]

Humans welcome to observe

X. He, D. Wu, Y . Zhai, and K. Sun, “SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems,” May 2025, arXiv:2505.24201 [cs]. [Online]. Available: http://arxiv.org/abs/2505.24201

work page arXiv 2025
[4]

Clearagent: Agentic binary analysis for effective vulnerability detection,

X. Chen, A. Zhou, C. Ye, and C. Zhang, “Clearagent: Agentic binary analysis for effective vulnerability detection,” inProceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Languages, 2025, pp. 130–137

2025
[5]

Clang Static Analyzer — clang-analyzer.llvm.org,

“Clang Static Analyzer — clang-analyzer.llvm.org,” https://clang- analyzer.llvm.org/
[6]

A picture is worth 500 labels: A case study of demographic dispar- ities in local machine learning models for instagram and tiktok,

J. West, L. Thiemt, S. Ahmed, M. Bartig, K. Fawaz, and S. Banerjee, “A picture is worth 500 labels: A case study of demographic dispar- ities in local machine learning models for instagram and tiktok,” in 2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024, pp. 369–387

2024
[7]

Jnfuzz-droid: a lightweight fuzzing and taint analysis framework for native code of android applications,

J. Cao, F. Guo, and Y . Qu, “Jnfuzz-droid: a lightweight fuzzing and taint analysis framework for native code of android applications,” Empirical Software Engineering, vol. 30, no. 5, p. 113, 2025

2025
[8]

PentestGPt: Evaluating and Harnessing Large Language Models for Automated Penetration Testing

G. Deng, Y . Liu, A. Robotics, A.-A.-U. Klagenfurt, P. Liu, Y . Li, T. Zhang, Y . Liu, A.-A.-U. Klagenfurt, and S. Rass, “PentestGPt: Evaluating and Harnessing Large Language Models for Automated Penetration Testing.”
[9]

CVE-LLM: Ontology-Assisted Automatic Vulnerability Evaluation Using Large Language Models,

R. Ghosh, H.-M. v. Stockhausen, M. Schmitt, G. M. Vasile, S. K. Karn, and O. Farri, “CVE-LLM: Ontology-Assisted Automatic Vulnerability Evaluation Using Large Language Models,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 28, pp. 28 757–28 765, Apr. 2025. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/35139

2025
[10]

On the decidability of disassembling binaries,

D. Engel, F. Verbeek, and B. Ravindran, “On the decidability of disassembling binaries,” inInternational Symposium on Theoretical Aspects of Software Engineering. Springer, 2024, pp. 127–145

2024
[11]

Lamd: Context-driven android malware detection and classification with llms,

X. Qian, X. Zheng, Y . He, S. Yang, and L. Cavallaro, “Lamd: Context-driven android malware detection and classification with llms,” 2025. [Online]. Available: https://arxiv.org/abs/2502.13055

work page arXiv 2025
[12]

Llm4decompile: Decompiling binary code with large language models,

H. Tan, Q. Luo, J. Li, and Y . Zhang, “Llm4decompile: Decompiling binary code with large language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2024, p. 3473–3487. [On- line]. Available: http://dx.doi.org/10.18653/v1/2024.emnlp-main.203

work page doi:10.18653/v1/2024.emnlp-main.203 2024
[13]

Quantifying and mitigating the impact of obfuscations on machine- learning-based decompilation improvement,

L. Dramko, D. B ¨ol¨oni-Turgut, C. Le Goues, and E. Schwartz, “Quantifying and mitigating the impact of obfuscations on machine- learning-based decompilation improvement,” inInternational Con- ference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 2025, pp. 244–266

2025
[14]

Binary Diff Summarization using Large Language Models,

M. Udeshi, V . S. C. Putrevu, P. Krishnamurthy, P. Anantharaman, S. Carrick, R. Karri, and F. Khorrami, “Binary Diff Summarization using Large Language Models,” Sep. 2025, arXiv:2509.23970 [cs]. [Online]. Available: http://arxiv.org/abs/2509.23970

work page arXiv 2025
[15]

Cyber-Zero/enigma-plus/config/commands/debug.sh at main · amazon-science/Cyber-Zero — github.com,

“Cyber-Zero/enigma-plus/config/commands/debug.sh at main · amazon-science/Cyber-Zero — github.com,” https://github.com/amazon-science/Cyber-Zero/blob/main/enigma- plus/config/commands/debug.sh
[16]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering,” 2024. [Online]. Available: https://arxiv.org/abs/2405.15793

work page internal anchor Pith review arXiv 2024
[17]

GitHub - radareorg/radare2: UNIX-like reverse engineering framework and command-line toolset — github.com,

“GitHub - radareorg/radare2: UNIX-like reverse engineering framework and command-line toolset — github.com,” https://github.com/radareorg/radare2
[18]

Frida • A world-class dynamic instrumentation toolkit — frida.re,

“Frida • A world-class dynamic instrumentation toolkit — frida.re,” https://frida.re/
[19]

dynamorio.org,

“dynamorio.org,” https://dynamorio.org/
[21]

Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM- driven Cyberattacks,

D. Pasquini, E. M. Kornaropoulos, and G. Ateniese, “Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM- driven Cyberattacks,” Nov. 2024, arXiv:2410.20911 [cs]. [Online]. Available: http://arxiv.org/abs/2410.20911

work page arXiv 2024
[22]

Malware dynamic analysis evasion techniques: A survey,

A. Afianian, S. Niksefat, B. Sadeghiyan, and D. Baptiste, “Malware dynamic analysis evasion techniques: A survey,”ACM Computing Surveys (CSUR), vol. 52, no. 6, pp. 1–28, 2019

2019
[23]

Multi-agent systems execute arbitrary malicious code.arXiv preprint arXiv:2503.12188, 2025

H. Triedman, R. Jha, and V . Shmatikov, “Multi-agent systems execute arbitrary malicious code,” 2025. [Online]. Available: https://arxiv.org/abs/2503.12188

work page arXiv 2025
[24]

What your firmware tells you is not how you should emulate it: A specification- guided approach for firmware emulation,

W. Zhou, L. Zhang, L. Guan, P. Liu, and Y . Zhang, “What your firmware tells you is not how you should emulate it: A specification- guided approach for firmware emulation,” inProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, 2022, pp. 3269–3283

2022
[25]

Pentestagent: Incorporating llm agents to automated penetration testing,

X. Shen, L. Wang, Z. Li, Y . Chen, W. Zhao, D. Sun, J. Wang, and W. Ruan, “Pentestagent: Incorporating llm agents to automated penetration testing,” inProceedings of the 20th ACM Asia Conference on Computer and Communications Security, 2025, pp. 375–391

2025
[26]

GitHub - NationalSecurityAgency/ghidra: Ghidra is a soft- ware reverse engineering (SRE) framework — github.com,

“GitHub - NationalSecurityAgency/ghidra: Ghidra is a soft- ware reverse engineering (SRE) framework — github.com,” https://github.com/NationalSecurityAgency/ghidra
[27]

Hex-Rays, https://hex-rays.com/ida-pro
[28]

Large language models: applications, limitations, challenges, and recommendations in cybersecurity, digital forensics, and ethical hacking,

J. P. A. Yaacoub, H. N. Noura, O. Salman, and G. Pujolle, “Large language models: applications, limitations, challenges, and recommendations in cybersecurity, digital forensics, and ethical hacking,”Annals of Telecommunications, Nov. 2025. [Online]. Available: https://doi.org/10.1007/s12243-025-01134-9

work page doi:10.1007/s12243-025-01134-9 2025
[29]

Large Language Models for Cybersecurity Intelligence, Threat Hunting, and Decision Support

S. Ren and S. Chen, “Large Language Models for Cybersecurity Intelligence, Threat Hunting, and Decision Support.”
[30]

Sok: Potentials and challenges of large language models for reverse engineering,

X. Hu, Z. Fu, S. Xie, S. H. H. Ding, and P. Charland, “SoK: Potentials and Challenges of Large Language Models for Reverse Engineering,” Sep. 2025, arXiv:2509.21821 [cs]. [Online]. Available: http://arxiv.org/abs/2509.21821

work page arXiv 2025
[31]

CompileAgent: Automated Real-World Repo- Level Compilation with Tool-Integrated LLM-based Agent System

L. Hu, G. Chen, X. Shang, S. Cheng, B. Wu, G. Li, X. Zhu, W. Zhang, and N. Yu, “CompileAgent: Automated Real-World Repo- Level Compilation with Tool-Integrated LLM-based Agent System.”
[32]

Recopilot: Reverse engineering copilot in binary analysis.arXiv preprint arXiv:2505.16366, 2025

G. Chen, H. Sun, D. Liu, Z. Wang, Q. Wang, B. Yin, L. Liu, and L. Ying, “ReCopilot: Reverse Engineering Copilot in Binary Analysis,” May 2025, arXiv:2505.16366 [cs]. [Online]. Available: http://arxiv.org/abs/2505.16366

work page arXiv 2025
[33]

Vulnbot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework.ArXiv, abs/2501.13411, jan 2025

H. Kong, D. Hu, J. Ge, L. Li, T. Li, and B. Wu, “VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework,” Jan. 2025, arXiv:2501.13411 [cs]. [Online]. Available: http://arxiv.org/abs/2501.13411

work page arXiv 2025
[34]

IRCopilot: Automated Incident Response with Large Language Models

X. Lin, J. Zhang, G. Deng, T. Liu, T. Zhang, Q. Guo, and R. Chen, “IRCopilot: Automated Incident Response with Large Language Models,” Oct. 2025, arXiv:2505.20945 [cs]. [Online]. Available: http://arxiv.org/abs/2505.20945

work page arXiv 2025
[36]

Cybersecurity ai: Evaluating agentic cybersecurity in attack/defense ctfs.arXiv preprint arXiv:2510.17521, 2025

F. Balassone, V . Mayoral-Vilches, S. Rass, M. Pinzger, G. Perrone, S. P. Romano, and P. Schartner, “Cybersecurity AI: Evaluating Agentic Cybersecurity in Attack/Defense CTFs,” Oct. 2025, arXiv:2510.17521 [cs]. [Online]. Available: http://arxiv.org/abs/2510.17521

work page arXiv 2025
[37]

Salt4decompile: Inferring source-level abstract logic tree for llm-based binary decompilation,

Y . Wang, X. Xu, X. Zhu, X. Gu, and B. Shen, “Salt4decompile: Inferring source-level abstract logic tree for llm-based binary decompilation,” 2025. [Online]. Available: https://arxiv.org/abs/2509.14646

work page arXiv 2025
[38]

Disassembling obfuscated executables with llm,

H. Rong, Y . Duan, H. Zhang, X. Wang, H. Chen, S. Duan, and S. Wang, “Disassembling obfuscated executables with llm,” 2024. [Online]. Available: https://arxiv.org/abs/2407.08924

work page arXiv 2024
[39]

Wadec: Decompiling webassembly using large language model,

X. She, Y . Zhao, and H. Wang, “Wadec: Decompiling webassembly using large language model,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 481–492. [Online]. Available: https://doi.org/10.1145/3691620.3695020

work page doi:10.1145/3691620.3695020 2024
[40]

GitHub - skylot/jadx: Dex to Java decompiler — github.com,

Skylot, “GitHub - skylot/jadx: Dex to Java decompiler — github.com,” https://github.com/skylot/jadx
[41]

Application fundamentals — App architecture — Android Developers — developer.android.com,

“Application fundamentals — App architecture — Android Developers — developer.android.com,” https://developer.android.com/guide/components/fundamentals
[42]

Binmetric: A comprehensive binary analysis benchmark for large language models.arXiv preprint arXiv:2505.07360, 2025

X. Shang, G. Chen, S. Cheng, B. Wu, L. Hu, G. Li, W. Zhang, and N. Yu, “Binmetric: A comprehensive binary analysis benchmark for large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2505.07360

work page arXiv 2025
[43]

Exploration, analysis, and manipulation of source code using srcml,

J. I. Maletic and M. L. Collard, “Exploration, analysis, and manipulation of source code using srcml,” May 2015. [Online]. Available: http://dx.doi.org/10.1109/ICSE.2015.302

work page doi:10.1109/icse.2015.302 2015
[44]

Decompiling the synergy: An empirical study of human–llm teaming in software reverse engineering

Z. L. Basque, S. Doria, A. Soneji, W. Gibbs, A. Doup ´e, Y . Shoshi- taishvili, E. Losiouk, R. Wang, and S. Aonzo, “Decompiling the synergy: An empirical study of human–llm teaming in software reverse engineering.”
[45]

GDB: The GNU Project Debugger — sourceware.org,

“GDB: The GNU Project Debugger — sourceware.org,” https://www.sourceware.org/gdb/
[46]

Cyber-zero: Training cybersecurity agents without runtime

T. Y . Zhuo, D. Wang, H. Ding, V . Kumar, and Z. Wang, “Cyber-zero: Training cybersecurity agents without runtime,” 2025. [Online]. Available: https://arxiv.org/abs/2508.00910

work page arXiv 2025
[47]

Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, and Ofir Press

T. Abramovich, M. Udeshi, M. Shao, K. Lieret, H. Xi, K. Milner, S. Jancheska, J. Yang, C. E. Jimenez, F. Khorrami, P. Krishnamurthy, B. Dolan-Gavitt, M. Shafique, K. Narasimhan, R. Karri, and O. Press, “Enigma: Interactive tools substantially assist lm agents in finding security vulnerabilities,” 2025. [Online]. Available: https://arxiv.org/abs/2409.16165

work page arXiv 2025
[48]

Debuggers - CTF Handbook — ctf101.org the gnu debugger (gdb),

O. L. . C. LLC, “Debuggers - CTF Handbook — ctf101.org the gnu debugger (gdb),” https://ctf101.org/reverse-engineering/what-is-gdb/, 2024

2024
[49]

Training language model agents to find vulnerabilities with ctf-dojo,

T. Y . Zhuo, D. Wang, H. Ding, V . Kumar, and Z. Wang, “Training language model agents to find vulnerabilities with ctf-dojo,” 2025. [Online]. Available: https://arxiv.org/abs/2508.18370

work page arXiv 2025
[50]

GitHub - amazon-science/Cyber-Zero: Cyber-Zero: Training Cybersecurity Agents Without Runtime — github.com,

“GitHub - amazon-science/Cyber-Zero: Cyber-Zero: Training Cybersecurity Agents Without Runtime — github.com,” https://github.com/amazon-science/Cyber-Zero, 2025

2025
[51]

How far have we gone in binary code understanding using large language models,

X. Shang, S. Cheng, G. Chen, Y . Zhang, L. Hu, X. Yu, G. Li, W. Zhang, and N. Yu, “How far have we gone in binary code understanding using large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2404.09836

work page arXiv 2024
[52]

idev: Exploring and exploiting semantic deviations in arm instruction processing,

S. Qin, C. Zhang, K. Chen, and Z. Li, “idev: Exploring and exploiting semantic deviations in arm instruction processing,” inProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2021, pp. 580–592

2021
[53]

gem5: The gem5 simulator system — gem5.org,

“gem5: The gem5 simulator system — gem5.org,” https://www.gem5.org/