pith. machine review for the scientific record. sign in

arxiv: 2605.10597 · v1 · submitted 2026-05-11 · 💻 cs.SE · cs.AI

Recognition: no theorem link

CrackMeBench: Binary Reverse Engineering for Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:39 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords binary reverse engineeringlanguage model agentsCrackMeBenchbenchmarkcybersecurityexecutable analysisvalidation logic recovery
0
0 comments X

The pith

CrackMeBench introduces a benchmark to test language-model agents on binary reverse engineering tasks using only executables and oracles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes CrackMeBench as a way to evaluate agents on recovering validation logic from compiled binaries in educational CrackMe problems. It uses a standardized shell interface in a sandboxed environment with common reverse engineering tools, scoring submissions based on whether they pass the program's checks. The benchmark features eight public tasks and twelve generated ones from C, Rust, and Go templates to allow reproducible testing. Model evaluations within a five-minute limit reveal differences in capabilities, particularly on harder tasks. This provides a focused testbed for advancing from source code reasoning to autonomous binary analysis.

Core claim

CrackMeBench is a benchmark for language-model agents on deterministic binary validation problems, combining public calibration CrackMes with generated tasks, where agents use local tools to produce accepted inputs or keys, with detailed logging of performance metrics.

What carries the argument

The central mechanism is the use of executable oracles and externally scored submissions in a no-network Docker sandbox, allowing precise measurement of agent success on symbol-poor binaries.

If this is right

  • Models achieve different pass@3 rates on the generated tasks, with one reaching 11 out of 12.
  • The generated half of the tasks separates model performances more clearly than the public split.
  • Detailed records of command traces and tool usage enable analysis of agent strategies.
  • Restricts scope to purpose-built educational programs to focus measurement on binary analysis progress.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be extended to include tasks that require handling more complex obfuscation or larger binaries to better simulate real scenarios.
  • Success on this benchmark might correlate with improved agent performance in broader cybersecurity capture-the-flag challenges.
  • By providing a reproducible testbed, it allows tracking incremental improvements in agent capabilities over time as models evolve.

Load-bearing premise

The twelve generated tasks built from seeded templates represent the difficulty and structure of real-world binary reverse-engineering problems.

What would settle it

Running the top-performing models on a collection of unmodified public CrackMe binaries outside the benchmark's calibration set and observing pass rates significantly below those reported on the generated split would challenge the claim of representativeness.

Figures

Figures reproduced from arXiv: 2605.10597 by Arthur Gervais, Isaac David.

Figure 1
Figure 1. Figure 1: End-to-end CrackMeBench execution pipeline. The model never receives direct filesystem [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: pass@3 by model and split. Generated tasks are the main score; public CrackMes are [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average wall-clock time, provider-reported token usage, and estimated USD cost per task. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Benchmarks for coding agents increasingly measure source-level software repair, and cybersecurity benchmarks increasingly measure broad capture-the-flag performance. Classical binary reverse engineering remains less precisely specified: given only an executable, can an agent recover validation logic and produce an input, serial, artifact, or key generator accepted by the program? We introduce CrackMeBench, a benchmark for evaluating language-model agents on educational CrackMe-style reverse-engineering tasks. CrackMeBench focuses on deterministic binary validation problems with executable oracles, symbol-poor binaries, explicit local tool access, and externally scored submissions rather than free-form explanations. The v0 benchmark combines eight public calibration CrackMes with twelve generated main-score tasks built from seeded C, Rust, and Go templates, and agents run through an equal shell interface in a no-network Linux Docker sandbox with standard reverse-engineering tools. In a three-model evaluation with a five-minute budget and three scored submissions per task, pass@3 on the generated split is 11/12 tasks (92%) for GPT-5.5, 7/12 (58%) for Claude Opus 4.7, and 5/12 (42%) for Kimi K2. The harder generated half separates the models more sharply, with pass@3 of 5/6, 2/6, and 1/6, respectively; on the eight-task public calibration split, pass@3 is 3/8, 2/8, and 1/8. CrackMeBench records pass@1 and pass@3, scored submissions, wall-clock time, command traces, tool categories, provider-reported token usage, estimated cost, and qualitative failure labels, providing a reproducible testbed for measuring progress from source-code reasoning toward autonomous binary analysis while restricting scope to educational, purpose-built programs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces CrackMeBench, a benchmark for language-model agents performing educational CrackMe-style binary reverse engineering. It comprises eight public calibration tasks and twelve generated tasks derived from seeded C/Rust/Go templates, executed in a no-network Linux Docker sandbox with standard RE tools and externally scored submissions. The central empirical claim is that, under a five-minute budget with three submissions per task, pass@3 reaches 11/12 (92%) for GPT-5.5, 7/12 (58%) for Claude Opus 4.7, and 5/12 (42%) for Kimi K2 on the generated split (with sharper separation on the harder half), while public-split pass@3 is lower at 3/8, 2/8, and 1/8 respectively. The benchmark logs pass@1/3, command traces, tool usage, token counts, costs, and qualitative failure labels to support reproducible measurement of progress from source-level reasoning toward autonomous binary analysis, while explicitly limiting scope to purpose-built educational programs.

Significance. If the generated tasks adequately sample the targeted class of deterministic validation problems, the work supplies a concrete, tool-equipped, and externally scored testbed that fills a gap between source-code repair benchmarks and broad CTF evaluations. Strengths include the emphasis on executable oracles rather than free-form explanations, the provision of full interaction traces, and the separation of performance on easier versus harder generated tasks. These elements enable direct, falsifiable comparisons of agent capabilities in disassembly, data-flow reasoning, and key recovery.

major comments (1)
  1. [§3.2] §3.2 (Generated Task Construction): The description of how the twelve main-score tasks were produced from seeded templates provides no explicit list of templates, variation parameters, difficulty-filtering rules, or balancing criteria across languages and control-flow patterns. Without these details it is impossible to determine whether the reported 92% pass@3 for GPT-5.5 reflects general reverse-engineering competence or exploitation of recurring template structures (simple string comparisons, checksums, or limited brute-force opportunities) that may not appear in other educational CrackMes.
minor comments (2)
  1. [Abstract and §5] The abstract and §5 refer to model names (GPT-5.5, Claude Opus 4.7) that are not standard; a footnote or appendix clarifying the exact model identifiers and API versions used would improve reproducibility.
  2. [Results tables] Table 2 (or equivalent results table) would benefit from an additional column reporting the distribution of command categories (disassembly, debugging, scripting) across successful versus failed runs to make the tool-usage analysis more quantitative.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The feedback highlights an important area for improving the clarity and reproducibility of the generated task construction process.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Generated Task Construction): The description of how the twelve main-score tasks were produced from seeded templates provides no explicit list of templates, variation parameters, difficulty-filtering rules, or balancing criteria across languages and control-flow patterns. Without these details it is impossible to determine whether the reported 92% pass@3 for GPT-5.5 reflects general reverse-engineering competence or exploitation of recurring template structures (simple string comparisons, checksums, or limited brute-force opportunities) that may not appear in other educational CrackMes.

    Authors: We agree that the current description of the generated tasks in §3.2 is insufficiently detailed for full reproducibility and to rule out potential template-specific biases. In the revised manuscript we will add: (1) an explicit enumeration of the four seeded templates per language (C, Rust, Go), (2) the complete set of variation parameters (control-flow transformations, validation predicate complexity, and input domain sizes), (3) the deterministic difficulty-filtering rules applied to select the final twelve tasks from the larger candidate pool, and (4) the balancing criteria used to ensure coverage across languages and control-flow patterns. These additions will be presented in a new table and accompanying text so that readers can independently judge whether the observed performance differences reflect general reverse-engineering capabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on benchmark tasks

full rationale

The paper introduces CrackMeBench, a benchmark mixing eight public calibration CrackMes with twelve tasks generated from seeded C/Rust/Go templates, then reports pass@3 scores from direct agent evaluations under fixed time and tool constraints. No equations, parameter fits, predictions derived from inputs, uniqueness theorems, or self-citations appear as load-bearing steps in any derivation chain. All reported results are straightforward empirical counts of successful submissions on externally scored binaries, with no reduction of outputs to the benchmark construction itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the sandbox environment and generated tasks constitute a fair test of autonomous binary analysis; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Agents receive only the executable, local tools, and an equal shell interface inside a no-network Linux Docker sandbox.
    Stated directly as the execution setting for all scored submissions.
  • domain assumption The twelve generated tasks are deterministic binary validation problems whose oracles can be run externally to score submissions.
    Core to the benchmark construction and scoring method.

pith-pipeline@v0.9.0 · 5620 in / 1280 out tokens · 42794 ms · 2026-05-12T04:39:03.242405+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

  1. [1]

    2021 , eprint =

    Evaluating Large Language Models Trained on Code , author =. 2021 , eprint =

  2. [2]

    2021 , eprint =

    Program Synthesis with Large Language Models , author =. 2021 , eprint =

  3. [3]

    Measuring Coding Challenge Competence With

    Hendrycks, Dan and Basart, Steven and Kadavath, Saurav and Mazeika, Mantas and Arora, Akul and Guo, Ethan and Burns, Collin and Puranik, Samir and He, Horace and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Coding Challenge Competence With. 2021 , url =

  4. [4]

    Competition-Level Code Generation with

    Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Competition-Level Code Generation with. Science , volume =. 2022 , doi =

  5. [5]

    and Yu, Tao , booktitle =

    Lai, Yuhang and Li, Chengxi and Wang, Yiming and Zhang, Tianyi and Zhong, Ruiqi and Zettlemoyer, Luke and Yih, Wen-Tau and Fried, Daniel and Wang, Sida I. and Yu, Tao , booktitle =. 2023 , url =

  6. [6]

    2025 , url =

    Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , booktitle =. 2025 , url =

  7. [7]

    and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R

    Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R. , booktitle =. 2024 , url =

  8. [8]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , year =. 2405.15793 , archivePrefix =

  9. [9]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Xia, Chunqiu Steven and Deng, Yinlin and Dunn, Soren and Zhang, Lingming , year =. Agentless: Demystifying. 2407.01489 , archivePrefix =

  10. [10]

    2023 , url =

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , url =

  11. [11]

    Advances in Neural Information Processing Systems , year =

    Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems , year =

  12. [12]

    arXiv preprint arXiv:2306.14898 , year=

    Yang, John and Prabhakar, Akshara and Narasimhan, Karthik and Yao, Shunyu , year =. 2306.14898 , archivePrefix =

  13. [13]

    CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

    Bhatt, Manish and Chennabasappa, Sahana and Li, Yue and Nikolaidis, Cyrus and Song, Daniel and Wan, Shengye and Ahmad, Faizan and Aschermann, Cornelius and Chen, Yaohui and Kapil, Dhaval and Molnar, David and Whitman, Spencer and Saxe, Joshua , year =. 2404.13161 , archivePrefix =

  14. [14]

    2024 , url =

    Shao, Minghao and Jancheska, Sofija and Udeshi, Meet and Dolan-Gavitt, Brendan and Xi, Haoran and Milner, Kimberly and Chen, Boyuan and Yin, Max and Garg, Siddharth and Krishnamurthy, Prashanth and Khorrami, Farshad and Karri, Ramesh and Shafique, Muhammad , booktitle =. 2024 , url =

  15. [15]

    International Conference on Learning Representations , year =

    Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models , author =. International Conference on Learning Representations , year =

  16. [16]

    Cy- berGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale.arXiv preprint arXiv:2506.02548, 2025

    Wang, Zhun and Shi, Tianneng and He, Jingxuan and Cai, Matthew and Zhang, Jialin and Song, Dawn , year =. 2506.02548 , archivePrefix =

  17. [17]

    Multi-agent Penetration Testing AI for the Web.ArXiv, abs/2508.20816, aug 2025

    David, Isaac and Gervais, Arthur , year =. Multi-Agent Penetration Testing. 2508.20816 , archivePrefix =

  18. [18]

    2026 , eprint =

    Towards Optimal Agentic Architectures for Offensive Security Tasks , author =. 2026 , eprint =

  19. [19]

    2502.10931 , archivePrefix =

    Udeshi, Meet and Shao, Minghao and Xi, Haoran and Rani, Nanda and Milner, Kimberly and Putrevu, Venkata Sai Charan and Dolan-Gavitt, Brendan and Shukla, Sandeep Kumar and Krishnamurthy, Prashanth and Khorrami, Farshad and Karri, Ramesh and Shafique, Muhammad , year =. 2502.10931 , archivePrefix =

  20. [20]

    arXiv preprint arXiv:2505.17107 , url=

    Shao, Minghao and Xi, Haoran and Rani, Nanda and Udeshi, Meet and Putrevu, Venkata Sai Charan and Milner, Kimberly and Dolan-Gavitt, Brendan and Shukla, Sandeep Kumar and Krishnamurthy, Prashanth and Khorrami, Farshad and Karri, Ramesh and Shafique, Muhammad , year =. 2505.17107 , archivePrefix =

  21. [21]

    Catastrophic cyber capabilities benchmark (3cb): Robustly evaluating llm agent cyber offense capabilities

    Anurin, Andrey and Ng, Jonathan and Schaffer, Kibo and Schreiber, Jason and Kran, Esben , year =. Catastrophic Cyber Capabilities Benchmark (. 2410.09114 , archivePrefix =

  22. [22]

    Sec-bench: Automated benchmarking of llm agents on real-world software security tasks.arXivpreprintarXiv:2506.11791, 2025

    Lee, Hwiwon and Zhang, Ziqi and Lu, Hanxiao and Zhang, Lingming , year =. 2506.11791 , archivePrefix =

  23. [23]

    and Krishnamurthy, Prashanth and Khorrami, Farshad and Shafique, Muhammad and Karri, Ramesh , year =

    Rani, Nanda and Milner, Kimberly and Shao, Minghao and Udeshi, Meet and Xi, Haoran and Putrevu, Venkata Sai Charan and Aggarwal, Saksham and Shukla, Sandeep K. and Krishnamurthy, Prashanth and Khorrami, Farshad and Shafique, Muhammad and Karri, Ramesh , year =. 2602.08023 , archivePrefix =

  24. [24]

    2026 , howpublished =

  25. [25]

    2505.11340 , archivePrefix =

    Gao, Zeyu and Cui, Yuxin and Wang, Hao and Qin, Siliang and Wang, Yuanda and Zhang, Bolun and Zhang, Chao , year =. 2505.11340 , archivePrefix =

  26. [26]

    2505.12668 , archivePrefix =

    Tan, Hanzhuo and Tian, Xiaolong and Qi, Hanrui and Liu, Jiaming and Gao, Zuchen and Wang, Siyi and Luo, Qi and Li, Jing and Zhang, Yuqun , year =. 2505.12668 , archivePrefix =

  27. [27]

    2026 , howpublished =

    Ghidra Software Reverse Engineering Framework , author =. 2026 , howpublished =

  28. [28]

    2026 , howpublished =

    radare2 Reverse Engineering Framework , author =. 2026 , howpublished =

  29. [29]

    Shoshitaishvili, Yan and Wang, Ruoyu and Salls, Christopher and Stephens, Nick and Polino, Mario and Dutcher, Audrey and Grosen, John and Feng, Siji and Hauser, Christophe and Kruegel, Christopher and Vigna, Giovanni , booktitle =

  30. [30]

    , booktitle =

    Cadar, Cristian and Dunbar, Daniel and Engler, Dawson R. , booktitle =

  31. [31]

    Network and Distributed System Security Symposium , year =

    Automated Whitebox Fuzz Testing , author =. Network and Distributed System Security Symposium , year =

  32. [32]

    Unleashing

    Cha, Sang Kil and Avgerinos, Thanassis and Rebert, Alexandre and Brumley, David , booktitle =. Unleashing

  33. [33]

    Network and Distributed System Security Symposium , year =

    Driller: Augmenting Fuzzing Through Selective Symbolic Execution , author =. Network and Distributed System Security Symposium , year =

  34. [34]

    Yun, Insu and Lee, Sangho and Xu, Meng and Jang, Yeongjin and Kim, Taesoo , booktitle =

  35. [35]

    Chipounov, Vitaly and Kuznetsov, Volodymyr and Candea, George , booktitle =

  36. [36]

    , booktitle =

    Brumley, David and Jager, Ivan and Avgerinos, Thanassis and Schwartz, Edward J. , booktitle =

  37. [37]

    Song, Dawn Xiaodong and Brumley, David and Yin, Heng and Caballero, Juan and Jager, Ivan and Kang, Min Gyung and Liang, Zhenkai and Newsome, James and Poosankam, Pongsin and Saxena, Prateek , booktitle =

  38. [38]

    ACM SIGPLAN Conference on Programming Language Design and Implementation , year =

    Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation , author =. ACM SIGPLAN Conference on Programming Language Design and Implementation , year =

  39. [39]

    ACM SIGPLAN Conference on Programming Language Design and Implementation , year =

    Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation , author =. ACM SIGPLAN Conference on Programming Language Design and Implementation , year =

  40. [40]

    IEEE European Symposium on Security and Privacy , year =

    Compiler-Agnostic Function Detection in Binaries , author =. IEEE European Symposium on Security and Privacy , year =

  41. [41]

    Network and Distributed System Security Symposium , year =

    Ramblr: Making Reassembly Great Again , author =. Network and Distributed System Security Symposium , year =