arxiv: 2604.23940 · v2 · submitted 2026-04-27 · 💻 cs.SE · cs.AI

Recognition: unknown

Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery

Yifan Zhang , Xiaohan Wang , Yueke Zhang , Yu Huang , Kevin Leach

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:34 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords decompilationbinary recoverymulti-agent systemsconstraint validationexecutable sourceLLM refinementreverse engineeringbehavioral equivalence

0 comments

The pith

A multi-agent framework with layered syntactic, compile, and behavioral constraints recovers re-executable source from 84-97% of decompiled binaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a multi-agent system to transform raw decompiler output into code that actually compiles and runs equivalently to the original binary. It applies three sequential constraint checks—syntax parsing, GCC compilation, and behavioral matching via LLM-generated tests—and routes failures to specialized agents that refine the code using the error details. Tested on 1,641 real binaries from RetDec, Ghidra, and Angr, the method lifts success rates dramatically over baselines and other GPT-4o decompilation approaches. Ablation results show that execution validation is indispensable, since compile-only pipelines reach near-zero functional correctness. The process finishes quickly and cheaply for most inputs, directly addressing the practical gap between readable decompiled code and usable source.

Core claim

The authors present Multi-level Constraint-Guided Decompilation (MCGD), a hierarchical pipeline that validates decompiled code first for syntactic correctness, then compilability via GCC, and finally behavioral equivalence through LLM-produced test cases. When any check fails, dedicated LLM agents iteratively edit the code guided by structured feedback from the failing validator. On 1,641 binaries this yields 84-97% re-executability, outperforming plain decompilers by 28-89 points and other LLM methods on the same backbone.

What carries the argument

The three-level validation pipeline (parsing, GCC compilation, LLM test-case execution) that triggers targeted LLM-agent refinement on detected failures.

If this is right

Decompiled code from standard tools becomes largely re-executable rather than merely readable.
Behavioral testing must supplement compilation checks to achieve functional correctness.
The same refinement loop improves results across RetDec, Ghidra, and Angr decompilers.
Ninety percent of cases reach correctness within two iterations at low per-binary cost.
Constraint-guided agents outperform direct LLM decompilation when using identical model backbones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing reverse-engineering suites could embed this post-processing step to raise the fraction of usable output.
Replacing or augmenting the LLM test generator with symbolic or coverage-guided oracles might increase path coverage.
Low cost per binary makes batch processing of large legacy or malware corpora feasible for security teams.
The same layered-constraint plus agent-refinement pattern could transfer to related tasks such as automated code porting or legacy patching.

Load-bearing premise

LLM-generated test cases exercise enough of the program's behavior to ensure the refined code matches the original binary on all important paths.

What would settle it

A binary for which the refined code passes all generated tests and compiles yet produces different runtime output or crashes on an input outside the test suite.

Figures

Figures reproduced from arXiv: 2604.23940 by Kevin Leach, Xiaohan Wang, Yifan Zhang, Yueke Zhang, Yu Huang.

**Figure 1.** Figure 1: Motivating example: (a) Raw Ghidra output contains undefined functions and type errors. (b) Multi-level constraints view at source ↗

**Figure 2.** Figure 2: Overview of Agent4Decompile. A binary is first processed by a traditional decompiler (e.g., Ghidra) to produce initial code 𝐶0. The code then passes through a threelevel constraint hierarchy: L1 (syntax), L2 (compilation), and L3 (execution). At each level, failures trigger a specialized LLM agent that repairs the code using error feedback. Code that passes all three levels yields re-executable output 𝐶 … view at source ↗

**Figure 4.** Figure 4: Convergence analysis: re-executability improves view at source ↗

**Figure 5.** Figure 5: Failure root cause distribution on the 1,641-binary view at source ↗

read the original abstract

Decompilation -- recovering source code from compiled binaries -- is essential for security analysis, malware reverse engineering, and legacy software maintenance. However, existing decompilers produce code that often fails to compile or execute correctly, limiting their practical utility. We present a multi-agent framework that transforms decompiled code into re-executable source through Multi-level Constraint-Guided Decompilation (MCGD). Our approach employs a hierarchical validation pipeline with three constraint levels: (1) syntactic correctness via parsing, (2) compilability via GCC, and (3) behavioral equivalence via LLM-generated test cases. When validation fails, specialized LLM agents iteratively refine the code using structured error feedback. We evaluate our framework on 1,641 real-world binaries from ExeBench across three decompilers (RetDec, Ghidra, and Angr). Our framework achieves 84-97% re-executability, improving baseline decompiler output by 28-89 percentage points. In comparison with state-of-the-art LLM-based decompilation methods using the same GPT-4o backbone, our approach (84.1%) outperforms LLM4Decompile (80.3%), SK2Decompile (73.9%), and SALT4Decompile (61.8%). Our ablation study reveals that execution-based validation is critical: compile-only approaches achieve 0% behavioral correctness despite 91-99% compilation rates. The system converges efficiently, with 90%+ binaries reaching correctness within 2 iterations at an average cost of $0.03-0.05 per binary. Our results demonstrate that constraint-guided agentic refinement can bridge the gap between raw decompiler output and practically useful source code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The multi-agent constraint pipeline lifts decompiler output to high re-executability rates on 1641 binaries, but the behavioral checks rest on unverified LLM test coverage.

read the letter

The main takeaway is that layering syntax checks, compilation, and LLM-generated test runs with iterative agent fixes turns a lot of broken decompiler output into code that actually executes, reaching 84-97% re-executability and beating several recent LLM baselines on the same GPT-4o backbone. The ablation makes the point clearly: compile-only validation hits 91-99% compilation but 0% behavioral correctness, so the test stage carries the real weight. The system also converges fast, with most cases fixed in two rounds at low cost. That combination of hierarchy and feedback loop is the concrete advance over prior single-pass LLM decompilation work. The evaluation uses a public ExeBench subset and direct head-to-head numbers, which is straightforward to check. The soft spot is exactly where the stress test flags it. Behavioral equivalence is judged only against the LLM-generated tests, with no reported coverage metrics, path counts, or independent oracle such as differential runs on held-out inputs from the original binary. Since the paper itself shows that weaker validation collapses to zero useful output, any gaps in those tests would let incorrect decompilations through and inflate the headline percentages. The abstract gives no details on how the tests were prompted or validated for completeness across the 1641 cases. This is a practical paper aimed at people who need working source from binaries for security work or legacy maintenance. Readers already experimenting with agent loops for code repair will see the most immediate value in the refinement mechanics. It has enough empirical grounding and a clear ablation to merit a serious referee, though reviewers will need to press on the test-suite quality before the re-executability claims can be taken at face value. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper introduces a multi-agent framework (MCGD) for decompilation that applies hierarchical constraint-guided validation—syntactic parsing, GCC compilability, and behavioral equivalence via LLM-generated test cases—with iterative agent-based refinement on failures. Evaluated on 1,641 ExeBench binaries from RetDec, Ghidra, and Angr, it reports 84-97% re-executability (28-89 pp gains over baselines) and outperforms other GPT-4o-based methods (84.1% vs. 80.3%, 73.9%, 61.8%). An ablation shows compile-only validation reaches 91-99% compilation but 0% behavioral correctness; the system converges in ~2 iterations at low cost.

Significance. If the behavioral equivalence claims hold under rigorous coverage, the work offers a practical advance in turning unreliable decompiler output into executable source, with direct utility for security analysis and legacy maintenance. Strengths include the large-scale evaluation across three decompilers, direct same-backbone comparisons to LLM4Decompile/SK2Decompile/SALT4Decompile, the clear ablation isolating execution validation, and reported efficiency metrics ($0.03-0.05 per binary).

major comments (2)

[§5] §5 (Evaluation) and abstract: the 84-97% re-executability and behavioral-correctness claims rest on level-3 validation using only LLM-generated test cases, yet no coverage metrics (branch/path), test-generation details, or independent oracle (e.g., differential execution against the original binary on held-out inputs) are reported. This is load-bearing because the refinement loop terminates on test passage and the ablation already shows compile-only yields 0% behavioral success.
[§5.1] §5.1 (dataset and test-case generation): potential selection biases in the ExeBench subset and lack of statistical significance testing or variance reporting across the 1,641 binaries undermine the cross-method and cross-decompiler comparisons.

minor comments (2)

The multi-agent architecture diagram (likely Figure 2 or 3) would benefit from explicit labeling of each agent's input/output and termination condition.
[§3] Notation for the three constraint levels is introduced in the abstract but could be formalized with a short table or pseudocode in §3 for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the strengths of our large-scale evaluation, ablation studies, and efficiency metrics. We address each major comment below and will incorporate revisions to improve clarity and rigor.

read point-by-point responses

Referee: [§5] §5 (Evaluation) and abstract: the 84-97% re-executability and behavioral-correctness claims rest on level-3 validation using only LLM-generated test cases, yet no coverage metrics (branch/path), test-generation details, or independent oracle (e.g., differential execution against the original binary on held-out inputs) are reported. This is load-bearing because the refinement loop terminates on test passage and the ablation already shows compile-only yields 0% behavioral success.

Authors: We agree that additional details on test-case generation and validation would strengthen the presentation. The current manuscript describes the hierarchical pipeline and the critical role of execution validation (as shown by the ablation where compile-only yields 0% behavioral success), but does not report coverage statistics or an independent oracle. In the revised version we will: (1) expand the description of the LLM prompting strategy used to generate test cases, (2) report available coverage metrics (e.g., line coverage on the subset of functions where instrumentation is feasible), and (3) add a limitations paragraph discussing why a held-out differential-execution oracle was not employed (primarily due to the difficulty of synthesizing equivalent inputs for arbitrary real-world binaries). We believe the consistent gains across three decompilers and 1,641 binaries still provide evidence of practical utility, but we will make these clarifications explicit. revision: yes
Referee: [§5.1] §5.1 (dataset and test-case generation): potential selection biases in the ExeBench subset and lack of statistical significance testing or variance reporting across the 1,641 binaries undermine the cross-method and cross-decompiler comparisons.

Authors: The 1,641 binaries were drawn from ExeBench using the same filtering criteria applied in prior decompilation studies (compilable C code with no external dependencies) to ensure fair comparison with baselines. We acknowledge that explicit statistical testing and variance reporting are absent. In the revision we will add: (i) a clearer statement of the selection criteria, (ii) standard deviation or confidence intervals for the reported success rates, and (iii) statistical significance tests (e.g., McNemar’s test for paired method comparisons and ANOVA for cross-decompiler results). These additions will directly address concerns about bias and comparability. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation framework

full rationale

The paper presents an empirical multi-agent decompilation system evaluated directly on the external public ExeBench dataset (1641 binaries) using three decompilers and compared against independent baselines (LLM4Decompile, SK2Decompile, SALT4Decompile) on the same GPT-4o backbone. Reported re-executability rates (84-97%) and ablation results (compile-only yields 0% behavioral correctness) are measured outcomes, not quantities derived from internal fitted parameters or self-referential definitions. No equations, uniqueness theorems, or ansatzes appear; the validation pipeline (syntactic, compilability, LLM test cases) is an explicit design choice whose effectiveness is tested externally rather than assumed by construction. This is a standard self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that LLMs can reliably generate test cases and perform code repair from error feedback; no explicit free parameters are fitted to data, and the only invented entity is the set of specialized refinement agents whose effectiveness is demonstrated only internally.

axioms (1)

domain assumption LLMs can generate test cases that adequately capture behavioral equivalence for the binaries under test
Invoked in the behavioral equivalence validation step and the ablation study.

invented entities (1)

Specialized LLM agents for iterative refinement no independent evidence
purpose: To receive structured error feedback from the three validation levels and rewrite code until it passes
Core component of the multi-agent framework; no independent evidence outside the reported results is provided.

pith-pipeline@v0.9.0 · 5613 in / 1420 out tokens · 71909 ms · 2026-05-08T03:34:11.848593+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 16 canonical work pages · 7 internal anchors

[1]

Manish Acharya, Yifan Zhang, Kevin Leach, and Yu Huang. 2025. Optimizing code runtime performance through context-aware retrieval-augmented generation. arXiv preprint arXiv:2501.16692(2025)

work page arXiv 2025
[2]

Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot Training LLMs for Project-specific Code Summarization. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE)

2022
[3]

Jordi Armengol-Estapé, Jackson Woodruff, Alexander Brauckmann, José Wesley de Souza Magalhães, and Michael F. P. O’Boyle. 2022. ExeBench: An ML-Scale Dataset of Executable C Functions. InProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. ACM. doi:10.1145/3520312. 3534867

work page doi:10.1145/3520312 2022
[4]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732(2021)

work page internal anchor Pith review arXiv 2021
[5]

Gogul Balakrishnan and Thomas Reps. 2007. DIVINE: Discovering Variables in Executables. InProceedings of the 8th International Conference on Verification, Model Checking, and Abstract Interpretation (VMCAI). 1–28

2007
[6]

Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. InProceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 209–224

2008
[7]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review arXiv 2021
[8]

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024. Teaching Large Language Models to Self-Debug. InInternational Conference on Learning Representations (ICLR)

2024
[9]

Vitaly Chipounov, Volodymyr Kuznetsov, and George Candea. 2011. S2E: A Platform for In-Vivo Multi-Path Analysis of Software Systems. InProceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 265–278

2011
[10]

Cristina Cifuentes. 1994. Reverse Compilation Techniques. InPhD Thesis, Queens- land University of Technology

1994
[11]

Cheng Fu, Huili Chen, Haolan Liu, Xinyun Chen, Yuandong Tian, Farinaz Koushanfar, and Jishen Zhao. 2019. CODA: An End-to-End Neural Program Decompiler. InAdvances in Neural Information Processing Systems (NeurIPS)

2019
[12]

Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2019. Automatic Software Repair: A Survey.IEEE Transactions on Software Engineering45, 1 (2019), 34–67

2019
[13]

Hex-Rays. 2018. IDA Pro: The Interactive Disassembler. https://hex-rays.com/ida- pro/

2018
[14]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2023. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework.arXiv preprint arXiv:2308.00352(2023)

work page internal anchor Pith review arXiv 2023
[15]

Katz, Jason Ruchti, and Eric Schulte

Deborah S. Katz, Jason Ruchti, and Eric Schulte. 2018. Using Recurrent Neu- ral Networks for Decompilation. In2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). 346–356

2018
[16]

Junaed Younus Khan and Gias Uddin. 2022. Automatic Code Documentation Gen- eration Using GPT-3. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE)

2022
[17]

Johannes Kinder and Helmut Veith. 2008. Jakstab: A Static Analysis Platform for Binaries. InProceedings of the 20th International Conference on Computer Aided Verification (CA V). 423–427

2008
[18]

Jakub Křoustek, Peter Matula, and Avast Software. 2017. RetDec: A Retargetable Machine-Code Decompiler Based on LLVM. https://github.com/avast/retdec. Presented at Botconf 2017

2017
[19]

Schwartz, Miltiadis Allamanis, Claire Le Goues, Graham Neubig, and Bogdan Vasilescu

Jeremy Lacomis, Pengcheng Yin, Edward J. Schwartz, Miltiadis Allamanis, Claire Le Goues, Graham Neubig, and Bogdan Vasilescu. 2019. DIRE: A Neural Ap- proach to Decompiled Identifier Naming. InProceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 628–639

2019
[20]

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. 2022. CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning.Advances in Neural Information Processing Systems (NeurIPS)(2022)

2022
[21]

Le, David Lo, and Claire Le Goues

Xuan Bach D. Le, David Lo, and Claire Le Goues. 2016. History Driven Program Repair.IEEE Transactions on Software Engineering42, 4 (2016), 318–339

2016
[22]

Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair. InIEEE Transactions on Software Engineering, Vol. 38. 54–72

2012
[23]

JongHyup Lee, Thanassis Avgerinos, and David Brumley. 2011. TIE: Principled Reverse Engineering of Types in Binary Programs. InProceedings of the 18th Network and Distributed System Security Symposium (NDSS)

2011
[24]

Jiliang Li, Yifan Zhang, Yu Huang, and Kevin Leach. 2025. Malmixer: Few-shot malware classification with retrieval-augmented semi-supervised learning. In 2025 IEEE 10th European Symposium on Security and Privacy (EuroS&P). IEEE, 268–288

2025
[25]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Jo ao Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Log...
[26]

StarCoder: May the Source Be with You!arXiv preprint arXiv:2305.06161 (2023)

work page internal anchor Pith review arXiv 2023
[27]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

2022
[28]

Ruigang Liang, Ying Cao, Peiwei Hu, and Kai Chen. 2021. Neutron: An Attention- based Neural Decompiler. InProceedings of the 2021 IEEE/ACM International Conference on Automated Software Engineering (ASE)

2021
[29]

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. Deepseek- v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)

work page internal anchor Pith review arXiv 2025
[30]

National Security Agency. 2019. Ghidra. https://ghidra-sre.org/. Software Reverse Engineering Framework

2019
[31]

Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chan- dra. 2013. SemFix: Program Repair via Semantic Analysis. InProceedings of the 11 35th International Conference on Software Engineering (ICSE). 772–781

2013
[32]

Demystifying gpt self-repair for code generation

Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2023. Demystifying GPT Self-Repair for Code Generation. arXiv preprint arXiv:2306.09896(2023)

work page arXiv 2023
[33]

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. In2022 IEEE Symposium on Security and Privacy (SP). 754–768

2022
[34]

Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-Architecture Bug Search in Binary Executables. In2015 IEEE Symposium on Security and Privacy (SP). 709–724

2015
[35]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cris- tian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, ...

work page internal anchor Pith review arXiv 2023
[36]

Kosta Serebryany. 2016. Continuous fuzzing with libfuzzer and addresssanitizer. In2016 IEEE Cybersecurity Development (SecDev). IEEE, 157–157

2016
[37]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems (NeurIPS)

2023
[38]

Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, and Giovanni Vigna. 2016. SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis. In2016 IEEE Symposium on Security and Privacy (SP). IEEE, 138–157. doi:10.1109/SP.2016.17

work page doi:10.1109/sp.2016.17 2016
[39]

André Silva, Sen Fang, and Martin Monperrus. 2023. RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair.arXiv preprint arXiv:2312.15698(2023)

work page arXiv 2023
[40]

Hanzhuo Tan, Weihao Li, Xiaolong Tian, Siyi Wang, Jiaming Liu, Jing Li, and Yuqun Zhang. 2025. SK2Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin.arXiv preprint arXiv:2509.22114(2025)

work page arXiv 2025
[41]

Hanzhuo Tan, Qi Luo, Jing Li, and Yuqun Zhang. 2024. LLM4Decompile: De- compiling Binary Code with Large Language Models. arXiv:2403.05286 [cs.PL] https://arxiv.org/abs/2403.05286

work page arXiv 2024
[42]

Xiaohan Wang, Yuxin Hu, and Kevin Leach. 2025. Context-Guided Decompilation: A Step Towards Re-executability.arXiv preprint arXiv:2511.01763(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Yongpan Wang, Xin Xu, Xiaojie Zhu, Xiaodong Gu, and Beijun Shen. 2025. Salt4decompile: Inferring source-level abstract logic tree for llm-based binary decompilation.arXiv preprint arXiv:2509.14646(2025)

work page arXiv 2025
[44]

Cerdic Wei Kit Wong. 2022. American fuzzy lop (AFL) fuzzer. (2022)

2022
[45]

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-trained Language Models. InProceedings of the 45th International Conference on Software Engineering (ICSE). IEEE

2023
[46]

Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-based Graph Embedding for Cross-Platform Binary Code Simi- larity Detection. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS). 363–376

2017
[47]

Khaled Yakdan, Sebastian Eschweiler, Elmar Gerhards-Padilla, and Matthew Smith. 2015. No More Gotos: Decompilation Using Pattern-Independent Control- Flow Structuring and Semantic-Preserving Transformations. InProceedings of the 22nd Network and Distributed System Security Symposium (NDSS). Internet Society

2015
[48]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)

2023
[49]

Yifan Zhang, Chen Huang, Kevin Cao, Yueke Zhang, Scott Thomas Andersen, Huajie Shao, Kevin Leach, and Yu Huang. 2022. Pre-training representations of binary code using contrastive learning.arXiv preprint arXiv:2210.05102(2022)

work page arXiv 2022
[50]

Yifan Zhang and Kevin Leach. 2025. Training Large Language Models to Compre- hend LLVM IR via Feedback-Driven Optimization. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 1477–1478. 12

2025