pith. machine review for the scientific record. sign in

arxiv: 2604.23940 · v2 · submitted 2026-04-27 · 💻 cs.SE · cs.AI

Recognition: unknown

Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:34 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords decompilationbinary recoverymulti-agent systemsconstraint validationexecutable sourceLLM refinementreverse engineeringbehavioral equivalence
0
0 comments X

The pith

A multi-agent framework with layered syntactic, compile, and behavioral constraints recovers re-executable source from 84-97% of decompiled binaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a multi-agent system to transform raw decompiler output into code that actually compiles and runs equivalently to the original binary. It applies three sequential constraint checks—syntax parsing, GCC compilation, and behavioral matching via LLM-generated tests—and routes failures to specialized agents that refine the code using the error details. Tested on 1,641 real binaries from RetDec, Ghidra, and Angr, the method lifts success rates dramatically over baselines and other GPT-4o decompilation approaches. Ablation results show that execution validation is indispensable, since compile-only pipelines reach near-zero functional correctness. The process finishes quickly and cheaply for most inputs, directly addressing the practical gap between readable decompiled code and usable source.

Core claim

The authors present Multi-level Constraint-Guided Decompilation (MCGD), a hierarchical pipeline that validates decompiled code first for syntactic correctness, then compilability via GCC, and finally behavioral equivalence through LLM-produced test cases. When any check fails, dedicated LLM agents iteratively edit the code guided by structured feedback from the failing validator. On 1,641 binaries this yields 84-97% re-executability, outperforming plain decompilers by 28-89 points and other LLM methods on the same backbone.

What carries the argument

The three-level validation pipeline (parsing, GCC compilation, LLM test-case execution) that triggers targeted LLM-agent refinement on detected failures.

If this is right

  • Decompiled code from standard tools becomes largely re-executable rather than merely readable.
  • Behavioral testing must supplement compilation checks to achieve functional correctness.
  • The same refinement loop improves results across RetDec, Ghidra, and Angr decompilers.
  • Ninety percent of cases reach correctness within two iterations at low per-binary cost.
  • Constraint-guided agents outperform direct LLM decompilation when using identical model backbones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Existing reverse-engineering suites could embed this post-processing step to raise the fraction of usable output.
  • Replacing or augmenting the LLM test generator with symbolic or coverage-guided oracles might increase path coverage.
  • Low cost per binary makes batch processing of large legacy or malware corpora feasible for security teams.
  • The same layered-constraint plus agent-refinement pattern could transfer to related tasks such as automated code porting or legacy patching.

Load-bearing premise

LLM-generated test cases exercise enough of the program's behavior to ensure the refined code matches the original binary on all important paths.

What would settle it

A binary for which the refined code passes all generated tests and compiles yet produces different runtime output or crashes on an input outside the test suite.

Figures

Figures reproduced from arXiv: 2604.23940 by Kevin Leach, Xiaohan Wang, Yifan Zhang, Yueke Zhang, Yu Huang.

Figure 1
Figure 1. Figure 1: Motivating example: (a) Raw Ghidra output contains undefined functions and type errors. (b) Multi-level constraints view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Agent4Decompile. A binary is first processed by a traditional decompiler (e.g., Ghidra) to pro￾duce initial code 𝐶0. The code then passes through a three￾level constraint hierarchy: L1 (syntax), L2 (compilation), and L3 (execution). At each level, failures trigger a specialized LLM agent that repairs the code using error feedback. Code that passes all three levels yields re-executable output 𝐶 … view at source ↗
Figure 4
Figure 4. Figure 4: Convergence analysis: re-executability improves view at source ↗
Figure 5
Figure 5. Figure 5: Failure root cause distribution on the 1,641-binary view at source ↗
read the original abstract

Decompilation -- recovering source code from compiled binaries -- is essential for security analysis, malware reverse engineering, and legacy software maintenance. However, existing decompilers produce code that often fails to compile or execute correctly, limiting their practical utility. We present a multi-agent framework that transforms decompiled code into re-executable source through Multi-level Constraint-Guided Decompilation (MCGD). Our approach employs a hierarchical validation pipeline with three constraint levels: (1) syntactic correctness via parsing, (2) compilability via GCC, and (3) behavioral equivalence via LLM-generated test cases. When validation fails, specialized LLM agents iteratively refine the code using structured error feedback. We evaluate our framework on 1,641 real-world binaries from ExeBench across three decompilers (RetDec, Ghidra, and Angr). Our framework achieves 84-97% re-executability, improving baseline decompiler output by 28-89 percentage points. In comparison with state-of-the-art LLM-based decompilation methods using the same GPT-4o backbone, our approach (84.1%) outperforms LLM4Decompile (80.3%), SK2Decompile (73.9%), and SALT4Decompile (61.8%). Our ablation study reveals that execution-based validation is critical: compile-only approaches achieve 0% behavioral correctness despite 91-99% compilation rates. The system converges efficiently, with 90%+ binaries reaching correctness within 2 iterations at an average cost of $0.03-0.05 per binary. Our results demonstrate that constraint-guided agentic refinement can bridge the gap between raw decompiler output and practically useful source code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a multi-agent framework (MCGD) for decompilation that applies hierarchical constraint-guided validation—syntactic parsing, GCC compilability, and behavioral equivalence via LLM-generated test cases—with iterative agent-based refinement on failures. Evaluated on 1,641 ExeBench binaries from RetDec, Ghidra, and Angr, it reports 84-97% re-executability (28-89 pp gains over baselines) and outperforms other GPT-4o-based methods (84.1% vs. 80.3%, 73.9%, 61.8%). An ablation shows compile-only validation reaches 91-99% compilation but 0% behavioral correctness; the system converges in ~2 iterations at low cost.

Significance. If the behavioral equivalence claims hold under rigorous coverage, the work offers a practical advance in turning unreliable decompiler output into executable source, with direct utility for security analysis and legacy maintenance. Strengths include the large-scale evaluation across three decompilers, direct same-backbone comparisons to LLM4Decompile/SK2Decompile/SALT4Decompile, the clear ablation isolating execution validation, and reported efficiency metrics ($0.03-0.05 per binary).

major comments (2)
  1. [§5] §5 (Evaluation) and abstract: the 84-97% re-executability and behavioral-correctness claims rest on level-3 validation using only LLM-generated test cases, yet no coverage metrics (branch/path), test-generation details, or independent oracle (e.g., differential execution against the original binary on held-out inputs) are reported. This is load-bearing because the refinement loop terminates on test passage and the ablation already shows compile-only yields 0% behavioral success.
  2. [§5.1] §5.1 (dataset and test-case generation): potential selection biases in the ExeBench subset and lack of statistical significance testing or variance reporting across the 1,641 binaries undermine the cross-method and cross-decompiler comparisons.
minor comments (2)
  1. The multi-agent architecture diagram (likely Figure 2 or 3) would benefit from explicit labeling of each agent's input/output and termination condition.
  2. [§3] Notation for the three constraint levels is introduced in the abstract but could be formalized with a short table or pseudocode in §3 for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the strengths of our large-scale evaluation, ablation studies, and efficiency metrics. We address each major comment below and will incorporate revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§5] §5 (Evaluation) and abstract: the 84-97% re-executability and behavioral-correctness claims rest on level-3 validation using only LLM-generated test cases, yet no coverage metrics (branch/path), test-generation details, or independent oracle (e.g., differential execution against the original binary on held-out inputs) are reported. This is load-bearing because the refinement loop terminates on test passage and the ablation already shows compile-only yields 0% behavioral success.

    Authors: We agree that additional details on test-case generation and validation would strengthen the presentation. The current manuscript describes the hierarchical pipeline and the critical role of execution validation (as shown by the ablation where compile-only yields 0% behavioral success), but does not report coverage statistics or an independent oracle. In the revised version we will: (1) expand the description of the LLM prompting strategy used to generate test cases, (2) report available coverage metrics (e.g., line coverage on the subset of functions where instrumentation is feasible), and (3) add a limitations paragraph discussing why a held-out differential-execution oracle was not employed (primarily due to the difficulty of synthesizing equivalent inputs for arbitrary real-world binaries). We believe the consistent gains across three decompilers and 1,641 binaries still provide evidence of practical utility, but we will make these clarifications explicit. revision: yes

  2. Referee: [§5.1] §5.1 (dataset and test-case generation): potential selection biases in the ExeBench subset and lack of statistical significance testing or variance reporting across the 1,641 binaries undermine the cross-method and cross-decompiler comparisons.

    Authors: The 1,641 binaries were drawn from ExeBench using the same filtering criteria applied in prior decompilation studies (compilable C code with no external dependencies) to ensure fair comparison with baselines. We acknowledge that explicit statistical testing and variance reporting are absent. In the revision we will add: (i) a clearer statement of the selection criteria, (ii) standard deviation or confidence intervals for the reported success rates, and (iii) statistical significance tests (e.g., McNemar’s test for paired method comparisons and ANOVA for cross-decompiler results). These additions will directly address concerns about bias and comparability. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation framework

full rationale

The paper presents an empirical multi-agent decompilation system evaluated directly on the external public ExeBench dataset (1641 binaries) using three decompilers and compared against independent baselines (LLM4Decompile, SK2Decompile, SALT4Decompile) on the same GPT-4o backbone. Reported re-executability rates (84-97%) and ablation results (compile-only yields 0% behavioral correctness) are measured outcomes, not quantities derived from internal fitted parameters or self-referential definitions. No equations, uniqueness theorems, or ansatzes appear; the validation pipeline (syntactic, compilability, LLM test cases) is an explicit design choice whose effectiveness is tested externally rather than assumed by construction. This is a standard self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that LLMs can reliably generate test cases and perform code repair from error feedback; no explicit free parameters are fitted to data, and the only invented entity is the set of specialized refinement agents whose effectiveness is demonstrated only internally.

axioms (1)
  • domain assumption LLMs can generate test cases that adequately capture behavioral equivalence for the binaries under test
    Invoked in the behavioral equivalence validation step and the ablation study.
invented entities (1)
  • Specialized LLM agents for iterative refinement no independent evidence
    purpose: To receive structured error feedback from the three validation levels and rewrite code until it passes
    Core component of the multi-agent framework; no independent evidence outside the reported results is provided.

pith-pipeline@v0.9.0 · 5613 in / 1420 out tokens · 71909 ms · 2026-05-08T03:34:11.848593+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 16 canonical work pages · 7 internal anchors

  1. [1]

    Manish Acharya, Yifan Zhang, Kevin Leach, and Yu Huang. 2025. Optimizing code runtime performance through context-aware retrieval-augmented generation. arXiv preprint arXiv:2501.16692(2025)

  2. [2]

    Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot Training LLMs for Project-specific Code Summarization. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE)

  3. [3]

    Jordi Armengol-Estapé, Jackson Woodruff, Alexander Brauckmann, José Wesley de Souza Magalhães, and Michael F. P. O’Boyle. 2022. ExeBench: An ML-Scale Dataset of Executable C Functions. InProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. ACM. doi:10.1145/3520312. 3534867

  4. [4]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732(2021)

  5. [5]

    Gogul Balakrishnan and Thomas Reps. 2007. DIVINE: Discovering Variables in Executables. InProceedings of the 8th International Conference on Verification, Model Checking, and Abstract Interpretation (VMCAI). 1–28

  6. [6]

    Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. InProceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 209–224

  7. [7]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  8. [8]

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024. Teaching Large Language Models to Self-Debug. InInternational Conference on Learning Representations (ICLR)

  9. [9]

    Vitaly Chipounov, Volodymyr Kuznetsov, and George Candea. 2011. S2E: A Platform for In-Vivo Multi-Path Analysis of Software Systems. InProceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 265–278

  10. [10]

    Cristina Cifuentes. 1994. Reverse Compilation Techniques. InPhD Thesis, Queens- land University of Technology

  11. [11]

    Cheng Fu, Huili Chen, Haolan Liu, Xinyun Chen, Yuandong Tian, Farinaz Koushanfar, and Jishen Zhao. 2019. CODA: An End-to-End Neural Program Decompiler. InAdvances in Neural Information Processing Systems (NeurIPS)

  12. [12]

    Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2019. Automatic Software Repair: A Survey.IEEE Transactions on Software Engineering45, 1 (2019), 34–67

  13. [13]

    Hex-Rays. 2018. IDA Pro: The Interactive Disassembler. https://hex-rays.com/ida- pro/

  14. [14]

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2023. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework.arXiv preprint arXiv:2308.00352(2023)

  15. [15]

    Katz, Jason Ruchti, and Eric Schulte

    Deborah S. Katz, Jason Ruchti, and Eric Schulte. 2018. Using Recurrent Neu- ral Networks for Decompilation. In2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). 346–356

  16. [16]

    Junaed Younus Khan and Gias Uddin. 2022. Automatic Code Documentation Gen- eration Using GPT-3. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE)

  17. [17]

    Johannes Kinder and Helmut Veith. 2008. Jakstab: A Static Analysis Platform for Binaries. InProceedings of the 20th International Conference on Computer Aided Verification (CA V). 423–427

  18. [18]

    Jakub Křoustek, Peter Matula, and Avast Software. 2017. RetDec: A Retargetable Machine-Code Decompiler Based on LLVM. https://github.com/avast/retdec. Presented at Botconf 2017

  19. [19]

    Schwartz, Miltiadis Allamanis, Claire Le Goues, Graham Neubig, and Bogdan Vasilescu

    Jeremy Lacomis, Pengcheng Yin, Edward J. Schwartz, Miltiadis Allamanis, Claire Le Goues, Graham Neubig, and Bogdan Vasilescu. 2019. DIRE: A Neural Ap- proach to Decompiled Identifier Naming. InProceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 628–639

  20. [20]

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. 2022. CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning.Advances in Neural Information Processing Systems (NeurIPS)(2022)

  21. [21]

    Le, David Lo, and Claire Le Goues

    Xuan Bach D. Le, David Lo, and Claire Le Goues. 2016. History Driven Program Repair.IEEE Transactions on Software Engineering42, 4 (2016), 318–339

  22. [22]

    Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair. InIEEE Transactions on Software Engineering, Vol. 38. 54–72

  23. [23]

    JongHyup Lee, Thanassis Avgerinos, and David Brumley. 2011. TIE: Principled Reverse Engineering of Types in Binary Programs. InProceedings of the 18th Network and Distributed System Security Symposium (NDSS)

  24. [24]

    Jiliang Li, Yifan Zhang, Yu Huang, and Kevin Leach. 2025. Malmixer: Few-shot malware classification with retrieval-augmented semi-supervised learning. In 2025 IEEE 10th European Symposium on Security and Privacy (EuroS&P). IEEE, 268–288

  25. [25]

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Jo ao Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Log...

  26. [26]

    StarCoder: May the Source Be with You!arXiv preprint arXiv:2305.06161 (2023)

  27. [27]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

  28. [28]

    Ruigang Liang, Ying Cao, Peiwei Hu, and Kai Chen. 2021. Neutron: An Attention- based Neural Decompiler. InProceedings of the 2021 IEEE/ACM International Conference on Automated Software Engineering (ASE)

  29. [29]

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. Deepseek- v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)

  30. [30]

    National Security Agency. 2019. Ghidra. https://ghidra-sre.org/. Software Reverse Engineering Framework

  31. [31]

    Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chan- dra. 2013. SemFix: Program Repair via Semantic Analysis. InProceedings of the 11 35th International Conference on Software Engineering (ICSE). 772–781

  32. [32]

    Demystifying gpt self-repair for code generation

    Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2023. Demystifying GPT Self-Repair for Code Generation. arXiv preprint arXiv:2306.09896(2023)

  33. [33]

    Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. In2022 IEEE Symposium on Security and Privacy (SP). 754–768

  34. [34]

    Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-Architecture Bug Search in Binary Executables. In2015 IEEE Symposium on Security and Privacy (SP). 709–724

  35. [35]

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cris- tian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, ...

  36. [36]

    Kosta Serebryany. 2016. Continuous fuzzing with libfuzzer and addresssanitizer. In2016 IEEE Cybersecurity Development (SecDev). IEEE, 157–157

  37. [37]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems (NeurIPS)

  38. [38]

    Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, and Giovanni Vigna. 2016. SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis. In2016 IEEE Symposium on Security and Privacy (SP). IEEE, 138–157. doi:10.1109/SP.2016.17

  39. [39]

    André Silva, Sen Fang, and Martin Monperrus. 2023. RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair.arXiv preprint arXiv:2312.15698(2023)

  40. [40]

    Hanzhuo Tan, Weihao Li, Xiaolong Tian, Siyi Wang, Jiaming Liu, Jing Li, and Yuqun Zhang. 2025. SK2Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin.arXiv preprint arXiv:2509.22114(2025)

  41. [41]

    Hanzhuo Tan, Qi Luo, Jing Li, and Yuqun Zhang. 2024. LLM4Decompile: De- compiling Binary Code with Large Language Models. arXiv:2403.05286 [cs.PL] https://arxiv.org/abs/2403.05286

  42. [42]

    Xiaohan Wang, Yuxin Hu, and Kevin Leach. 2025. Context-Guided Decompilation: A Step Towards Re-executability.arXiv preprint arXiv:2511.01763(2025)

  43. [43]

    Yongpan Wang, Xin Xu, Xiaojie Zhu, Xiaodong Gu, and Beijun Shen. 2025. Salt4decompile: Inferring source-level abstract logic tree for llm-based binary decompilation.arXiv preprint arXiv:2509.14646(2025)

  44. [44]

    Cerdic Wei Kit Wong. 2022. American fuzzy lop (AFL) fuzzer. (2022)

  45. [45]

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-trained Language Models. InProceedings of the 45th International Conference on Software Engineering (ICSE). IEEE

  46. [46]

    Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-based Graph Embedding for Cross-Platform Binary Code Simi- larity Detection. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS). 363–376

  47. [47]

    Khaled Yakdan, Sebastian Eschweiler, Elmar Gerhards-Padilla, and Matthew Smith. 2015. No More Gotos: Decompilation Using Pattern-Independent Control- Flow Structuring and Semantic-Preserving Transformations. InProceedings of the 22nd Network and Distributed System Security Symposium (NDSS). Internet Society

  48. [48]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)

  49. [49]

    Yifan Zhang, Chen Huang, Kevin Cao, Yueke Zhang, Scott Thomas Andersen, Huajie Shao, Kevin Leach, and Yu Huang. 2022. Pre-training representations of binary code using contrastive learning.arXiv preprint arXiv:2210.05102(2022)

  50. [50]

    Yifan Zhang and Kevin Leach. 2025. Training Large Language Models to Compre- hend LLVM IR via Feedback-Driven Optimization. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 1477–1478. 12