Recognition: unknown
Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery
Pith reviewed 2026-05-08 03:34 UTC · model grok-4.3
The pith
A multi-agent framework with layered syntactic, compile, and behavioral constraints recovers re-executable source from 84-97% of decompiled binaries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present Multi-level Constraint-Guided Decompilation (MCGD), a hierarchical pipeline that validates decompiled code first for syntactic correctness, then compilability via GCC, and finally behavioral equivalence through LLM-produced test cases. When any check fails, dedicated LLM agents iteratively edit the code guided by structured feedback from the failing validator. On 1,641 binaries this yields 84-97% re-executability, outperforming plain decompilers by 28-89 points and other LLM methods on the same backbone.
What carries the argument
The three-level validation pipeline (parsing, GCC compilation, LLM test-case execution) that triggers targeted LLM-agent refinement on detected failures.
If this is right
- Decompiled code from standard tools becomes largely re-executable rather than merely readable.
- Behavioral testing must supplement compilation checks to achieve functional correctness.
- The same refinement loop improves results across RetDec, Ghidra, and Angr decompilers.
- Ninety percent of cases reach correctness within two iterations at low per-binary cost.
- Constraint-guided agents outperform direct LLM decompilation when using identical model backbones.
Where Pith is reading between the lines
- Existing reverse-engineering suites could embed this post-processing step to raise the fraction of usable output.
- Replacing or augmenting the LLM test generator with symbolic or coverage-guided oracles might increase path coverage.
- Low cost per binary makes batch processing of large legacy or malware corpora feasible for security teams.
- The same layered-constraint plus agent-refinement pattern could transfer to related tasks such as automated code porting or legacy patching.
Load-bearing premise
LLM-generated test cases exercise enough of the program's behavior to ensure the refined code matches the original binary on all important paths.
What would settle it
A binary for which the refined code passes all generated tests and compiles yet produces different runtime output or crashes on an input outside the test suite.
Figures
read the original abstract
Decompilation -- recovering source code from compiled binaries -- is essential for security analysis, malware reverse engineering, and legacy software maintenance. However, existing decompilers produce code that often fails to compile or execute correctly, limiting their practical utility. We present a multi-agent framework that transforms decompiled code into re-executable source through Multi-level Constraint-Guided Decompilation (MCGD). Our approach employs a hierarchical validation pipeline with three constraint levels: (1) syntactic correctness via parsing, (2) compilability via GCC, and (3) behavioral equivalence via LLM-generated test cases. When validation fails, specialized LLM agents iteratively refine the code using structured error feedback. We evaluate our framework on 1,641 real-world binaries from ExeBench across three decompilers (RetDec, Ghidra, and Angr). Our framework achieves 84-97% re-executability, improving baseline decompiler output by 28-89 percentage points. In comparison with state-of-the-art LLM-based decompilation methods using the same GPT-4o backbone, our approach (84.1%) outperforms LLM4Decompile (80.3%), SK2Decompile (73.9%), and SALT4Decompile (61.8%). Our ablation study reveals that execution-based validation is critical: compile-only approaches achieve 0% behavioral correctness despite 91-99% compilation rates. The system converges efficiently, with 90%+ binaries reaching correctness within 2 iterations at an average cost of $0.03-0.05 per binary. Our results demonstrate that constraint-guided agentic refinement can bridge the gap between raw decompiler output and practically useful source code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a multi-agent framework (MCGD) for decompilation that applies hierarchical constraint-guided validation—syntactic parsing, GCC compilability, and behavioral equivalence via LLM-generated test cases—with iterative agent-based refinement on failures. Evaluated on 1,641 ExeBench binaries from RetDec, Ghidra, and Angr, it reports 84-97% re-executability (28-89 pp gains over baselines) and outperforms other GPT-4o-based methods (84.1% vs. 80.3%, 73.9%, 61.8%). An ablation shows compile-only validation reaches 91-99% compilation but 0% behavioral correctness; the system converges in ~2 iterations at low cost.
Significance. If the behavioral equivalence claims hold under rigorous coverage, the work offers a practical advance in turning unreliable decompiler output into executable source, with direct utility for security analysis and legacy maintenance. Strengths include the large-scale evaluation across three decompilers, direct same-backbone comparisons to LLM4Decompile/SK2Decompile/SALT4Decompile, the clear ablation isolating execution validation, and reported efficiency metrics ($0.03-0.05 per binary).
major comments (2)
- [§5] §5 (Evaluation) and abstract: the 84-97% re-executability and behavioral-correctness claims rest on level-3 validation using only LLM-generated test cases, yet no coverage metrics (branch/path), test-generation details, or independent oracle (e.g., differential execution against the original binary on held-out inputs) are reported. This is load-bearing because the refinement loop terminates on test passage and the ablation already shows compile-only yields 0% behavioral success.
- [§5.1] §5.1 (dataset and test-case generation): potential selection biases in the ExeBench subset and lack of statistical significance testing or variance reporting across the 1,641 binaries undermine the cross-method and cross-decompiler comparisons.
minor comments (2)
- The multi-agent architecture diagram (likely Figure 2 or 3) would benefit from explicit labeling of each agent's input/output and termination condition.
- [§3] Notation for the three constraint levels is introduced in the abstract but could be formalized with a short table or pseudocode in §3 for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the strengths of our large-scale evaluation, ablation studies, and efficiency metrics. We address each major comment below and will incorporate revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [§5] §5 (Evaluation) and abstract: the 84-97% re-executability and behavioral-correctness claims rest on level-3 validation using only LLM-generated test cases, yet no coverage metrics (branch/path), test-generation details, or independent oracle (e.g., differential execution against the original binary on held-out inputs) are reported. This is load-bearing because the refinement loop terminates on test passage and the ablation already shows compile-only yields 0% behavioral success.
Authors: We agree that additional details on test-case generation and validation would strengthen the presentation. The current manuscript describes the hierarchical pipeline and the critical role of execution validation (as shown by the ablation where compile-only yields 0% behavioral success), but does not report coverage statistics or an independent oracle. In the revised version we will: (1) expand the description of the LLM prompting strategy used to generate test cases, (2) report available coverage metrics (e.g., line coverage on the subset of functions where instrumentation is feasible), and (3) add a limitations paragraph discussing why a held-out differential-execution oracle was not employed (primarily due to the difficulty of synthesizing equivalent inputs for arbitrary real-world binaries). We believe the consistent gains across three decompilers and 1,641 binaries still provide evidence of practical utility, but we will make these clarifications explicit. revision: yes
-
Referee: [§5.1] §5.1 (dataset and test-case generation): potential selection biases in the ExeBench subset and lack of statistical significance testing or variance reporting across the 1,641 binaries undermine the cross-method and cross-decompiler comparisons.
Authors: The 1,641 binaries were drawn from ExeBench using the same filtering criteria applied in prior decompilation studies (compilable C code with no external dependencies) to ensure fair comparison with baselines. We acknowledge that explicit statistical testing and variance reporting are absent. In the revision we will add: (i) a clearer statement of the selection criteria, (ii) standard deviation or confidence intervals for the reported success rates, and (iii) statistical significance tests (e.g., McNemar’s test for paired method comparisons and ANOVA for cross-decompiler results). These additions will directly address concerns about bias and comparability. revision: yes
Circularity Check
No circularity in empirical evaluation framework
full rationale
The paper presents an empirical multi-agent decompilation system evaluated directly on the external public ExeBench dataset (1641 binaries) using three decompilers and compared against independent baselines (LLM4Decompile, SK2Decompile, SALT4Decompile) on the same GPT-4o backbone. Reported re-executability rates (84-97%) and ablation results (compile-only yields 0% behavioral correctness) are measured outcomes, not quantities derived from internal fitted parameters or self-referential definitions. No equations, uniqueness theorems, or ansatzes appear; the validation pipeline (syntactic, compilability, LLM test cases) is an explicit design choice whose effectiveness is tested externally rather than assumed by construction. This is a standard self-contained empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can generate test cases that adequately capture behavioral equivalence for the binaries under test
invented entities (1)
-
Specialized LLM agents for iterative refinement
no independent evidence
Reference graph
Works this paper leans on
- [1]
-
[2]
Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot Training LLMs for Project-specific Code Summarization. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE)
2022
-
[3]
Jordi Armengol-Estapé, Jackson Woodruff, Alexander Brauckmann, José Wesley de Souza Magalhães, and Michael F. P. O’Boyle. 2022. ExeBench: An ML-Scale Dataset of Executable C Functions. InProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. ACM. doi:10.1145/3520312. 3534867
-
[4]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732(2021)
work page internal anchor Pith review arXiv 2021
-
[5]
Gogul Balakrishnan and Thomas Reps. 2007. DIVINE: Discovering Variables in Executables. InProceedings of the 8th International Conference on Verification, Model Checking, and Abstract Interpretation (VMCAI). 1–28
2007
-
[6]
Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. InProceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 209–224
2008
-
[7]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review arXiv 2021
-
[8]
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024. Teaching Large Language Models to Self-Debug. InInternational Conference on Learning Representations (ICLR)
2024
-
[9]
Vitaly Chipounov, Volodymyr Kuznetsov, and George Candea. 2011. S2E: A Platform for In-Vivo Multi-Path Analysis of Software Systems. InProceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 265–278
2011
-
[10]
Cristina Cifuentes. 1994. Reverse Compilation Techniques. InPhD Thesis, Queens- land University of Technology
1994
-
[11]
Cheng Fu, Huili Chen, Haolan Liu, Xinyun Chen, Yuandong Tian, Farinaz Koushanfar, and Jishen Zhao. 2019. CODA: An End-to-End Neural Program Decompiler. InAdvances in Neural Information Processing Systems (NeurIPS)
2019
-
[12]
Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2019. Automatic Software Repair: A Survey.IEEE Transactions on Software Engineering45, 1 (2019), 34–67
2019
-
[13]
Hex-Rays. 2018. IDA Pro: The Interactive Disassembler. https://hex-rays.com/ida- pro/
2018
-
[14]
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2023. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework.arXiv preprint arXiv:2308.00352(2023)
work page internal anchor Pith review arXiv 2023
-
[15]
Katz, Jason Ruchti, and Eric Schulte
Deborah S. Katz, Jason Ruchti, and Eric Schulte. 2018. Using Recurrent Neu- ral Networks for Decompilation. In2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). 346–356
2018
-
[16]
Junaed Younus Khan and Gias Uddin. 2022. Automatic Code Documentation Gen- eration Using GPT-3. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE)
2022
-
[17]
Johannes Kinder and Helmut Veith. 2008. Jakstab: A Static Analysis Platform for Binaries. InProceedings of the 20th International Conference on Computer Aided Verification (CA V). 423–427
2008
-
[18]
Jakub Křoustek, Peter Matula, and Avast Software. 2017. RetDec: A Retargetable Machine-Code Decompiler Based on LLVM. https://github.com/avast/retdec. Presented at Botconf 2017
2017
-
[19]
Schwartz, Miltiadis Allamanis, Claire Le Goues, Graham Neubig, and Bogdan Vasilescu
Jeremy Lacomis, Pengcheng Yin, Edward J. Schwartz, Miltiadis Allamanis, Claire Le Goues, Graham Neubig, and Bogdan Vasilescu. 2019. DIRE: A Neural Ap- proach to Decompiled Identifier Naming. InProceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 628–639
2019
-
[20]
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. 2022. CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning.Advances in Neural Information Processing Systems (NeurIPS)(2022)
2022
-
[21]
Le, David Lo, and Claire Le Goues
Xuan Bach D. Le, David Lo, and Claire Le Goues. 2016. History Driven Program Repair.IEEE Transactions on Software Engineering42, 4 (2016), 318–339
2016
-
[22]
Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair. InIEEE Transactions on Software Engineering, Vol. 38. 54–72
2012
-
[23]
JongHyup Lee, Thanassis Avgerinos, and David Brumley. 2011. TIE: Principled Reverse Engineering of Types in Binary Programs. InProceedings of the 18th Network and Distributed System Security Symposium (NDSS)
2011
-
[24]
Jiliang Li, Yifan Zhang, Yu Huang, and Kevin Leach. 2025. Malmixer: Few-shot malware classification with retrieval-augmented semi-supervised learning. In 2025 IEEE 10th European Symposium on Security and Privacy (EuroS&P). IEEE, 268–288
2025
-
[25]
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Jo ao Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Log...
-
[26]
StarCoder: May the Source Be with You!arXiv preprint arXiv:2305.06161 (2023)
work page internal anchor Pith review arXiv 2023
-
[27]
Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...
2022
-
[28]
Ruigang Liang, Ying Cao, Peiwei Hu, and Kai Chen. 2021. Neutron: An Attention- based Neural Decompiler. InProceedings of the 2021 IEEE/ACM International Conference on Automated Software Engineering (ASE)
2021
-
[29]
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. Deepseek- v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)
work page internal anchor Pith review arXiv 2025
-
[30]
National Security Agency. 2019. Ghidra. https://ghidra-sre.org/. Software Reverse Engineering Framework
2019
-
[31]
Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chan- dra. 2013. SemFix: Program Repair via Semantic Analysis. InProceedings of the 11 35th International Conference on Software Engineering (ICSE). 772–781
2013
-
[32]
Demystifying gpt self-repair for code generation
Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2023. Demystifying GPT Self-Repair for Code Generation. arXiv preprint arXiv:2306.09896(2023)
-
[33]
Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. In2022 IEEE Symposium on Security and Privacy (SP). 754–768
2022
-
[34]
Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-Architecture Bug Search in Binary Executables. In2015 IEEE Symposium on Security and Privacy (SP). 709–724
2015
-
[35]
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cris- tian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, ...
work page internal anchor Pith review arXiv 2023
-
[36]
Kosta Serebryany. 2016. Continuous fuzzing with libfuzzer and addresssanitizer. In2016 IEEE Cybersecurity Development (SecDev). IEEE, 157–157
2016
-
[37]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems (NeurIPS)
2023
-
[38]
Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, and Giovanni Vigna. 2016. SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis. In2016 IEEE Symposium on Security and Privacy (SP). IEEE, 138–157. doi:10.1109/SP.2016.17
- [39]
- [40]
- [41]
-
[42]
Xiaohan Wang, Yuxin Hu, and Kevin Leach. 2025. Context-Guided Decompilation: A Step Towards Re-executability.arXiv preprint arXiv:2511.01763(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [43]
-
[44]
Cerdic Wei Kit Wong. 2022. American fuzzy lop (AFL) fuzzer. (2022)
2022
-
[45]
Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-trained Language Models. InProceedings of the 45th International Conference on Software Engineering (ICSE). IEEE
2023
-
[46]
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-based Graph Embedding for Cross-Platform Binary Code Simi- larity Detection. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS). 363–376
2017
-
[47]
Khaled Yakdan, Sebastian Eschweiler, Elmar Gerhards-Padilla, and Matthew Smith. 2015. No More Gotos: Decompilation Using Pattern-Independent Control- Flow Structuring and Semantic-Preserving Transformations. InProceedings of the 22nd Network and Distributed System Security Symposium (NDSS). Internet Society
2015
-
[48]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)
2023
- [49]
-
[50]
Yifan Zhang and Kevin Leach. 2025. Training Large Language Models to Compre- hend LLVM IR via Feedback-Driven Optimization. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 1477–1478. 12
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.