Recognition: unknown
Challenges and Future Directions in Agentic Reverse Engineering Systems
Pith reviewed 2026-05-10 12:46 UTC · model grok-4.3
The pith
Agentic systems for binary reverse engineering still struggle with obfuscation, timing, and unique architectures despite recent advances.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through analysis of existing agentic tool usage in reverse engineering, the paper finds that cutting-edge systems continue to fail in complex scenarios involving obfuscation, timing, and unique architectures. The examination covers static, dynamic, and hybrid agents and highlights limitations including token constraints, struggles with obfuscation, and a lack of program guardrails, leading to outlined challenges and future directions for system designers.
What carries the argument
Analysis of agentic tool usage across static, dynamic, and hybrid agents for binary reverse engineering tasks.
Load-bearing premise
The analysis of existing agentic tool usage captures the primary and representative limitations across realistic reverse engineering settings.
What would settle it
Demonstration of an agentic system that successfully performs reverse engineering on obfuscated binaries with unique architectures without hitting token limits or requiring manual guardrails.
Figures
read the original abstract
Agentic systems built on large language models (LLMs) are increasingly being used for complex security tasks, including binary reverse engineering (RE). Despite recent growth in popularity and capability, these systems continue to face limitations in realistic settings. Cutting-edge systems still fail in complex RE scenarios that involve obfuscation, timing, and unique architecture. In this work, we examine how agentic systems perform reverse engineering tasks with static, dynamic, and hybrid agents. Through an analysis of existing agentic tool usage, we identify several limitations, including token constraints, struggles with obfuscation, and a lack of program guardrails. From these findings, we outline current challenges and position future directions for system designers to overcome from a security perspective.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a position piece that analyzes the use of LLM-based agentic systems for binary reverse engineering tasks via static, dynamic, and hybrid agent approaches. Drawing on a qualitative review of existing tool usage, it identifies limitations such as token constraints, struggles with code obfuscation, timing dependencies, unique architectures, and insufficient program guardrails. These observations motivate a discussion of current challenges and proposed future directions for more secure and effective agentic RE systems.
Significance. If the limitations identified are broadly representative, the paper provides a timely synthesis of gaps in an emerging area at the intersection of AI and security. Its value lies in framing concrete challenges (obfuscation handling, guardrails) as motivation for future work rather than claiming new empirical results; this can help guide system designers toward more robust designs. The observational approach is appropriate for a position paper and avoids overclaiming.
minor comments (3)
- The abstract and introduction would benefit from a brief statement of the scope and selection criteria for the 'existing agentic tool usage' reviewed, to allow readers to evaluate potential selection bias in the identified limitations.
- Claims about failures in scenarios involving obfuscation, timing, and unique architectures are central but presented at a high level; adding one or two concrete, cited examples from the reviewed systems would strengthen the motivation without altering the position-piece nature.
- The future-directions section could more explicitly link each proposed direction back to the specific limitations enumerated earlier, improving traceability for readers.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of our position paper and for recommending minor revision. We appreciate the recognition that the observational approach is appropriate for this type of work and that the synthesis of limitations can help guide future system design.
Circularity Check
No significant circularity
full rationale
This is an observational position paper whose central claims derive from a qualitative review of external agentic RE tools and literature. Limitations (token constraints, obfuscation struggles, missing guardrails) are listed as direct observations from that review rather than from any fitted parameters, self-referential predictions, or equations. No derivation chain, uniqueness theorem, or ansatz is invoked; future directions follow logically from the enumerated challenges without requiring the analysis to be exhaustive or statistically representative. The paper contains no self-citation load-bearing steps or renamings of known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-based agents can be meaningfully evaluated on reverse engineering tasks via static, dynamic, and hybrid modes.
Reference graph
Works this paper leans on
-
[1]
A survey on agentic security: Applications, threats and defenses,
A. Shahriar, M. N. Rahman, S. Ahmed, F. Sadeque, and M. R. Parvez, “A survey on agentic security: Applications, threats and defenses,” arXiv preprint arXiv:2510.06445, 2025
-
[2]
L. Muzsai, D. Imolai, and A. Luk ´acs, “Hacksynth: Llm agent and evaluation framework for autonomous penetration testing,” 2024. [Online]. Available: https://arxiv.org/abs/2412.01778
-
[3]
X. He, D. Wu, Y . Zhai, and K. Sun, “SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems,” May 2025, arXiv:2505.24201 [cs]. [Online]. Available: http://arxiv.org/abs/2505.24201
-
[4]
Clearagent: Agentic binary analysis for effective vulnerability detection,
X. Chen, A. Zhou, C. Ye, and C. Zhang, “Clearagent: Agentic binary analysis for effective vulnerability detection,” inProceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Languages, 2025, pp. 130–137
2025
-
[5]
Clang Static Analyzer — clang-analyzer.llvm.org,
“Clang Static Analyzer — clang-analyzer.llvm.org,” https://clang- analyzer.llvm.org/
-
[6]
A picture is worth 500 labels: A case study of demographic dispar- ities in local machine learning models for instagram and tiktok,
J. West, L. Thiemt, S. Ahmed, M. Bartig, K. Fawaz, and S. Banerjee, “A picture is worth 500 labels: A case study of demographic dispar- ities in local machine learning models for instagram and tiktok,” in 2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024, pp. 369–387
2024
-
[7]
Jnfuzz-droid: a lightweight fuzzing and taint analysis framework for native code of android applications,
J. Cao, F. Guo, and Y . Qu, “Jnfuzz-droid: a lightweight fuzzing and taint analysis framework for native code of android applications,” Empirical Software Engineering, vol. 30, no. 5, p. 113, 2025
2025
-
[8]
PentestGPt: Evaluating and Harnessing Large Language Models for Automated Penetration Testing
G. Deng, Y . Liu, A. Robotics, A.-A.-U. Klagenfurt, P. Liu, Y . Li, T. Zhang, Y . Liu, A.-A.-U. Klagenfurt, and S. Rass, “PentestGPt: Evaluating and Harnessing Large Language Models for Automated Penetration Testing.”
-
[9]
CVE-LLM: Ontology-Assisted Automatic Vulnerability Evaluation Using Large Language Models,
R. Ghosh, H.-M. v. Stockhausen, M. Schmitt, G. M. Vasile, S. K. Karn, and O. Farri, “CVE-LLM: Ontology-Assisted Automatic Vulnerability Evaluation Using Large Language Models,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 28, pp. 28 757–28 765, Apr. 2025. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/35139
2025
-
[10]
On the decidability of disassembling binaries,
D. Engel, F. Verbeek, and B. Ravindran, “On the decidability of disassembling binaries,” inInternational Symposium on Theoretical Aspects of Software Engineering. Springer, 2024, pp. 127–145
2024
-
[11]
Lamd: Context-driven android malware detection and classification with llms,
X. Qian, X. Zheng, Y . He, S. Yang, and L. Cavallaro, “Lamd: Context-driven android malware detection and classification with llms,” 2025. [Online]. Available: https://arxiv.org/abs/2502.13055
-
[12]
Llm4decompile: Decompiling binary code with large language models,
H. Tan, Q. Luo, J. Li, and Y . Zhang, “Llm4decompile: Decompiling binary code with large language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2024, p. 3473–3487. [On- line]. Available: http://dx.doi.org/10.18653/v1/2024.emnlp-main.203
-
[13]
Quantifying and mitigating the impact of obfuscations on machine- learning-based decompilation improvement,
L. Dramko, D. B ¨ol¨oni-Turgut, C. Le Goues, and E. Schwartz, “Quantifying and mitigating the impact of obfuscations on machine- learning-based decompilation improvement,” inInternational Con- ference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 2025, pp. 244–266
2025
-
[14]
Binary Diff Summarization using Large Language Models,
M. Udeshi, V . S. C. Putrevu, P. Krishnamurthy, P. Anantharaman, S. Carrick, R. Karri, and F. Khorrami, “Binary Diff Summarization using Large Language Models,” Sep. 2025, arXiv:2509.23970 [cs]. [Online]. Available: http://arxiv.org/abs/2509.23970
-
[15]
Cyber-Zero/enigma-plus/config/commands/debug.sh at main · amazon-science/Cyber-Zero — github.com,
“Cyber-Zero/enigma-plus/config/commands/debug.sh at main · amazon-science/Cyber-Zero — github.com,” https://github.com/amazon-science/Cyber-Zero/blob/main/enigma- plus/config/commands/debug.sh
-
[16]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering,” 2024. [Online]. Available: https://arxiv.org/abs/2405.15793
work page internal anchor Pith review arXiv 2024
-
[17]
GitHub - radareorg/radare2: UNIX-like reverse engineering framework and command-line toolset — github.com,
“GitHub - radareorg/radare2: UNIX-like reverse engineering framework and command-line toolset — github.com,” https://github.com/radareorg/radare2
-
[18]
Frida • A world-class dynamic instrumentation toolkit — frida.re,
“Frida • A world-class dynamic instrumentation toolkit — frida.re,” https://frida.re/
-
[19]
dynamorio.org,
“dynamorio.org,” https://dynamorio.org/
-
[21]
Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM- driven Cyberattacks,
D. Pasquini, E. M. Kornaropoulos, and G. Ateniese, “Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM- driven Cyberattacks,” Nov. 2024, arXiv:2410.20911 [cs]. [Online]. Available: http://arxiv.org/abs/2410.20911
-
[22]
Malware dynamic analysis evasion techniques: A survey,
A. Afianian, S. Niksefat, B. Sadeghiyan, and D. Baptiste, “Malware dynamic analysis evasion techniques: A survey,”ACM Computing Surveys (CSUR), vol. 52, no. 6, pp. 1–28, 2019
2019
-
[23]
Multi-agent systems execute arbitrary malicious code.arXiv preprint arXiv:2503.12188, 2025
H. Triedman, R. Jha, and V . Shmatikov, “Multi-agent systems execute arbitrary malicious code,” 2025. [Online]. Available: https://arxiv.org/abs/2503.12188
-
[24]
What your firmware tells you is not how you should emulate it: A specification- guided approach for firmware emulation,
W. Zhou, L. Zhang, L. Guan, P. Liu, and Y . Zhang, “What your firmware tells you is not how you should emulate it: A specification- guided approach for firmware emulation,” inProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, 2022, pp. 3269–3283
2022
-
[25]
Pentestagent: Incorporating llm agents to automated penetration testing,
X. Shen, L. Wang, Z. Li, Y . Chen, W. Zhao, D. Sun, J. Wang, and W. Ruan, “Pentestagent: Incorporating llm agents to automated penetration testing,” inProceedings of the 20th ACM Asia Conference on Computer and Communications Security, 2025, pp. 375–391
2025
-
[26]
GitHub - NationalSecurityAgency/ghidra: Ghidra is a soft- ware reverse engineering (SRE) framework — github.com,
“GitHub - NationalSecurityAgency/ghidra: Ghidra is a soft- ware reverse engineering (SRE) framework — github.com,” https://github.com/NationalSecurityAgency/ghidra
-
[27]
Hex-Rays, https://hex-rays.com/ida-pro
-
[28]
J. P. A. Yaacoub, H. N. Noura, O. Salman, and G. Pujolle, “Large language models: applications, limitations, challenges, and recommendations in cybersecurity, digital forensics, and ethical hacking,”Annals of Telecommunications, Nov. 2025. [Online]. Available: https://doi.org/10.1007/s12243-025-01134-9
-
[29]
Large Language Models for Cybersecurity Intelligence, Threat Hunting, and Decision Support
S. Ren and S. Chen, “Large Language Models for Cybersecurity Intelligence, Threat Hunting, and Decision Support.”
-
[30]
Sok: Potentials and challenges of large language models for reverse engineering,
X. Hu, Z. Fu, S. Xie, S. H. H. Ding, and P. Charland, “SoK: Potentials and Challenges of Large Language Models for Reverse Engineering,” Sep. 2025, arXiv:2509.21821 [cs]. [Online]. Available: http://arxiv.org/abs/2509.21821
-
[31]
CompileAgent: Automated Real-World Repo- Level Compilation with Tool-Integrated LLM-based Agent System
L. Hu, G. Chen, X. Shang, S. Cheng, B. Wu, G. Li, X. Zhu, W. Zhang, and N. Yu, “CompileAgent: Automated Real-World Repo- Level Compilation with Tool-Integrated LLM-based Agent System.”
-
[32]
Recopilot: Reverse engineering copilot in binary analysis.arXiv preprint arXiv:2505.16366, 2025
G. Chen, H. Sun, D. Liu, Z. Wang, Q. Wang, B. Yin, L. Liu, and L. Ying, “ReCopilot: Reverse Engineering Copilot in Binary Analysis,” May 2025, arXiv:2505.16366 [cs]. [Online]. Available: http://arxiv.org/abs/2505.16366
-
[33]
H. Kong, D. Hu, J. Ge, L. Li, T. Li, and B. Wu, “VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework,” Jan. 2025, arXiv:2501.13411 [cs]. [Online]. Available: http://arxiv.org/abs/2501.13411
-
[34]
IRCopilot: Automated Incident Response with Large Language Models
X. Lin, J. Zhang, G. Deng, T. Liu, T. Zhang, Q. Guo, and R. Chen, “IRCopilot: Automated Incident Response with Large Language Models,” Oct. 2025, arXiv:2505.20945 [cs]. [Online]. Available: http://arxiv.org/abs/2505.20945
-
[36]
F. Balassone, V . Mayoral-Vilches, S. Rass, M. Pinzger, G. Perrone, S. P. Romano, and P. Schartner, “Cybersecurity AI: Evaluating Agentic Cybersecurity in Attack/Defense CTFs,” Oct. 2025, arXiv:2510.17521 [cs]. [Online]. Available: http://arxiv.org/abs/2510.17521
-
[37]
Salt4decompile: Inferring source-level abstract logic tree for llm-based binary decompilation,
Y . Wang, X. Xu, X. Zhu, X. Gu, and B. Shen, “Salt4decompile: Inferring source-level abstract logic tree for llm-based binary decompilation,” 2025. [Online]. Available: https://arxiv.org/abs/2509.14646
-
[38]
Disassembling obfuscated executables with llm,
H. Rong, Y . Duan, H. Zhang, X. Wang, H. Chen, S. Duan, and S. Wang, “Disassembling obfuscated executables with llm,” 2024. [Online]. Available: https://arxiv.org/abs/2407.08924
-
[39]
Wadec: Decompiling webassembly using large language model,
X. She, Y . Zhao, and H. Wang, “Wadec: Decompiling webassembly using large language model,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 481–492. [Online]. Available: https://doi.org/10.1145/3691620.3695020
-
[40]
GitHub - skylot/jadx: Dex to Java decompiler — github.com,
Skylot, “GitHub - skylot/jadx: Dex to Java decompiler — github.com,” https://github.com/skylot/jadx
-
[41]
Application fundamentals — App architecture — Android Developers — developer.android.com,
“Application fundamentals — App architecture — Android Developers — developer.android.com,” https://developer.android.com/guide/components/fundamentals
-
[42]
X. Shang, G. Chen, S. Cheng, B. Wu, L. Hu, G. Li, W. Zhang, and N. Yu, “Binmetric: A comprehensive binary analysis benchmark for large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2505.07360
-
[43]
Exploration, analysis, and manipulation of source code using srcml,
J. I. Maletic and M. L. Collard, “Exploration, analysis, and manipulation of source code using srcml,” May 2015. [Online]. Available: http://dx.doi.org/10.1109/ICSE.2015.302
-
[44]
Decompiling the synergy: An empirical study of human–llm teaming in software reverse engineering
Z. L. Basque, S. Doria, A. Soneji, W. Gibbs, A. Doup ´e, Y . Shoshi- taishvili, E. Losiouk, R. Wang, and S. Aonzo, “Decompiling the synergy: An empirical study of human–llm teaming in software reverse engineering.”
-
[45]
GDB: The GNU Project Debugger — sourceware.org,
“GDB: The GNU Project Debugger — sourceware.org,” https://www.sourceware.org/gdb/
-
[46]
Cyber-zero: Training cybersecurity agents without runtime
T. Y . Zhuo, D. Wang, H. Ding, V . Kumar, and Z. Wang, “Cyber-zero: Training cybersecurity agents without runtime,” 2025. [Online]. Available: https://arxiv.org/abs/2508.00910
-
[47]
T. Abramovich, M. Udeshi, M. Shao, K. Lieret, H. Xi, K. Milner, S. Jancheska, J. Yang, C. E. Jimenez, F. Khorrami, P. Krishnamurthy, B. Dolan-Gavitt, M. Shafique, K. Narasimhan, R. Karri, and O. Press, “Enigma: Interactive tools substantially assist lm agents in finding security vulnerabilities,” 2025. [Online]. Available: https://arxiv.org/abs/2409.16165
-
[48]
Debuggers - CTF Handbook — ctf101.org the gnu debugger (gdb),
O. L. . C. LLC, “Debuggers - CTF Handbook — ctf101.org the gnu debugger (gdb),” https://ctf101.org/reverse-engineering/what-is-gdb/, 2024
2024
-
[49]
Training language model agents to find vulnerabilities with ctf-dojo,
T. Y . Zhuo, D. Wang, H. Ding, V . Kumar, and Z. Wang, “Training language model agents to find vulnerabilities with ctf-dojo,” 2025. [Online]. Available: https://arxiv.org/abs/2508.18370
-
[50]
GitHub - amazon-science/Cyber-Zero: Cyber-Zero: Training Cybersecurity Agents Without Runtime — github.com,
“GitHub - amazon-science/Cyber-Zero: Cyber-Zero: Training Cybersecurity Agents Without Runtime — github.com,” https://github.com/amazon-science/Cyber-Zero, 2025
2025
-
[51]
How far have we gone in binary code understanding using large language models,
X. Shang, S. Cheng, G. Chen, Y . Zhang, L. Hu, X. Yu, G. Li, W. Zhang, and N. Yu, “How far have we gone in binary code understanding using large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2404.09836
-
[52]
idev: Exploring and exploiting semantic deviations in arm instruction processing,
S. Qin, C. Zhang, K. Chen, and Z. Li, “idev: Exploring and exploiting semantic deviations in arm instruction processing,” inProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2021, pp. 580–592
2021
-
[53]
gem5: The gem5 simulator system — gem5.org,
“gem5: The gem5 simulator system — gem5.org,” https://www.gem5.org/
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.