Recognition: unknown
REBENCH: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names (Extended Version)
Pith reviewed 2026-05-07 09:27 UTC · model grok-4.3
The pith
REBench creates a standardized benchmark for LLMs on recovering names and types from stripped binaries using stored byte-level stack data for ground truth.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
REBench consolidates a superset of existing datasets and adopts a knowledge-base-driven methodology that stores byte-level stack information to generate ground truth. This ensures that task difficulty is preserved while maintaining universal applicability. The design enables fair evaluation across tasks while avoiding simplifications that could bias results. When applied to measure LLM performance, the benchmark demonstrates difficulties in complex tasks.
What carries the argument
The knowledge-base-driven methodology that stores byte-level stack information to generate ground truth labels.
If this is right
- Existing disparate datasets can be replaced by a single superset that spans multiple architectures and optimization levels.
- LLM approaches can be compared directly using identical ground truth without preprocessing differences.
- Task difficulty remains consistent with real reverse engineering rather than being reduced by artificial simplifications.
- Performance gaps between simple and complex reverse engineering subtasks become measurable in a repeatable way.
Where Pith is reading between the lines
- The benchmark could serve as a training signal for fine-tuning models specifically on binary analysis patterns.
- Hybrid systems combining LLMs with traditional static analysis tools might be scored more reliably against the same ground truth.
- Extending the knowledge base to additional binary formats or runtime behaviors would test whether the current fairness properties hold at larger scale.
Load-bearing premise
Byte-level stack information stored in the knowledge base accurately reflects the real difficulty and challenges of reverse engineering without introducing new biases or simplifications.
What would settle it
If models evaluated on REBench produce accuracy scores that diverge sharply from their accuracy on an independently assembled collection of real stripped binaries drawn from the same source distributions, the claim of preserved difficulty would be open to question.
Figures
read the original abstract
Large Language Models (LLMs) have achieved remarkable progress in recent years, driving their adoption across a wide range of domains, including computer security. In reverse engineering, LLMs are increasingly applied to critical tasks such as function and variable name recovery and type inference. However, despite the rapid growth of research in this area, progress has been hindered by the absence of a standardized dataset. Existing studies rely on disparate datasets, preprocessing pipelines, and evaluation metrics, making fair comparisons between approaches difficult and obscuring a clear understanding of LLM capabilities in binary analysis. To address these challenges, we present REBench, a comprehensive benchmark dataset for evaluating LLMs on binary reverse engineering tasks. REBench consolidates a superset of existing datasets, comprising hundreds of millions of lines of source code and a diverse collection of binaries spanning multiple architectures and optimization levels. REBench adopts a knowledge-base-driven methodology that stores byte-level stack information to generate ground truth, ensuring that task difficulty is preserved while maintaining universal applicability. This design enables fair evaluation across tasks while avoiding simplifications that could bias results. As a use case, we apply REBench to measure the reverse engineering performance of LLMs and the result demonstrates difficulties in complex tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces REBench, a large-scale benchmark dataset for evaluating LLMs on stripped-binary reverse engineering tasks such as type inference and name recovery. It consolidates hundreds of millions of lines of source code and binaries across multiple architectures and optimization levels, and employs a knowledge-base-driven methodology that stores byte-level stack information to synthesize ground-truth labels. The authors claim this procedural construction preserves real task difficulty, ensures fairness and universal applicability, and avoids simplifications or biases that could distort results. As a use case, they apply REBench to LLMs and report that the models exhibit difficulties on complex tasks.
Significance. If the KB-driven construction can be shown to faithfully reproduce the information loss and aliasing of real stripped, optimized binaries without leaking exploitable higher-level cues, REBench would provide a much-needed standardized, reproducible evaluation framework for LLM-based binary analysis. This could enable fair cross-paper comparisons and clearer measurement of LLM capabilities in security-critical reverse-engineering settings.
major comments (2)
- [Abstract] Abstract: The central claim that the knowledge-base-driven methodology 'ensures that task difficulty is preserved' and 'avoid[s] simplifications that could bias results' is load-bearing for the entire contribution, yet the abstract provides no explicit argument, measurement, or validation that byte-level stack records faithfully reproduce stripped-binary information loss or prevent leakage of consistent naming/stack-frame patterns that an LLM could exploit. This directly matches the weakest assumption identified in the stress-test note.
- [Abstract] Abstract (use-case paragraph): The statement that 'the result demonstrates difficulties in complex tasks' is presented without any metrics, task definitions, baselines, or quantitative breakdown of what constitutes 'complex' versus simpler tasks. Because the paper's value rests on demonstrating LLM limitations via this benchmark, the absence of supporting data undermines the claim.
minor comments (2)
- The phrase 'hundreds of millions of lines of source code' is ambiguous: clarify whether this refers to total source lines across all programs, unique functions, or another quantity, and state the exact number of binaries and functions in the consolidated dataset.
- The abstract refers to 'a superset of existing datasets' but does not list which prior datasets are included or how duplicates and preprocessing differences were resolved; this should be stated explicitly in the dataset-construction section.
Simulated Author's Rebuttal
Thank you for your thorough review and valuable suggestions. We address the major comments point by point below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the knowledge-base-driven methodology 'ensures that task difficulty is preserved' and 'avoid[s] simplifications that could bias results' is load-bearing for the entire contribution, yet the abstract provides no explicit argument, measurement, or validation that byte-level stack records faithfully reproduce stripped-binary information loss or prevent leakage of consistent naming/stack-frame patterns that an LLM could exploit. This directly matches the weakest assumption identified in the stress-test note.
Authors: We agree that the abstract, due to length constraints, does not detail validation. The full manuscript (Section 3) describes how the knowledge base records only byte-level stack frames and locations extracted from debug information prior to stripping, with ground-truth labels synthesized without injecting higher-level cues. This construction is intended to replicate real information loss. We will revise the abstract to reference the methodology and the empirical comparisons in the evaluation section that assess label fidelity and task difficulty against real stripped binaries. revision: yes
-
Referee: [Abstract] Abstract (use-case paragraph): The statement that 'the result demonstrates difficulties in complex tasks' is presented without any metrics, task definitions, baselines, or quantitative breakdown of what constitutes 'complex' versus simpler tasks. Because the paper's value rests on demonstrating LLM limitations via this benchmark, the absence of supporting data undermines the claim.
Authors: The abstract summarizes the use-case findings at a high level. The full paper's evaluation section defines complex tasks via explicit criteria (e.g., optimization level O3, functions with multiple variables or large stack frames, cross-architecture settings) and reports quantitative metrics including accuracy/F1 scores with breakdowns and comparisons to prior baselines. We will revise the abstract to include a concise quantitative summary of these results and task definitions. revision: yes
Circularity Check
No significant circularity; benchmark construction is self-contained
full rationale
The paper presents REBench as a procedural benchmark built via a knowledge-base storing byte-level stack information to synthesize ground truth for type and name recovery tasks. This is a dataset construction methodology with no mathematical derivations, equations, fitted parameters, or predictions that reduce to their inputs by construction. The abstract's claim that the design 'ensures that task difficulty is preserved' and 'avoids simplifications that could bias results' is a direct statement of the construction choices rather than a self-referential loop or load-bearing self-citation. No uniqueness theorems, ansatzes smuggled via prior work, or renaming of known results appear. The contribution stands as an independent artifact whose validity can be assessed externally against real binaries and human RE difficulty, placing any concerns under correctness rather than circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Byte-level stack information stored in the knowledge base provides accurate and unbiased ground truth for stripped-binary type and name recovery tasks.
Reference graph
Works this paper leans on
-
[1]
Large Language Model for Software Engineering
2023. Large Language Model for Software Engineering. https://github.com/gai 4se/LLM4SE?tab=readme-ov-file#model-list
2023
-
[2]
Python Transformers
2023. Python Transformers. https://pypi.org/project/transformers/
2023
-
[3]
Unsloth AI. 2024. Unsloth Phi-4. https://huggingface.co/unsloth/phi-4
2024
- [4]
-
[5]
Qibin Chen, Jeremy Lacomis, Edward J Schwartz, Claire Le Goues, Graham Neubig, and Bogdan Vasilescu. 2022. Augmenting decompiler output with learned variable names and types. In31st USENIX Security Symposium (USENIX Security 22). 4327–4343
2022
-
[6]
Yaniv David, Uri Alon, and Eran Yahav. 2020. Neural reverse engineering of stripped binaries using augmented control flow graphs.Proceedings of the ACM on Programming Languages4, OOPSLA (2020), 1–28
2020
-
[7]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems36 (2023), 10088–10115
2023
-
[8]
Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin. 2020. Deepbindiff: Learning program-wide code representations for binary diffing. InNetwork and distributed system security symposium
2020
-
[9]
Dror G Feitelson, Ayelet Mizrahi, Nofar Noy, Aviad Ben Shabat, Or Eliyahu, and Roy Sheffer. 2020. How developers choose names.IEEE Transactions on Software Engineering48, 1 (2020), 37–52
2020
-
[10]
Bo Feng, Alejandro Mera, and Long Lu. 2020. {P2IM}: Scalable and hardware- independent firmware testing via automatic peripheral interface modeling. In29th USENIX Security Symposium (USENIX Security 20). 1237–1254
2020
-
[11]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)
work page internal anchor Pith review arXiv 2025
-
[12]
Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev, and Martin Vechev
-
[13]
Debin: Predicting debug information in stripped binaries. InProceedings of REBENCH: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary T ypes and Names the 2018 ACM SIGSAC Conference on Computer and Communications Security. 1667–1680
2018
-
[14]
Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu
-
[15]
ACM59, 5 (2016), 122–131
On the naturalness of software.Commun. ACM59, 5 (2016), 122–131
2016
-
[16]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3
2022
- [17]
-
[18]
Linxi Jiang, Xin Jin, and Zhiqiang Lin. 2025. Beyond Classification: Inferring Function Names in Stripped Binaries via Domain Adapted LLMs.. InNDSS
2025
- [19]
-
[20]
Xin Jin, Kexin Pei, Jun Yeon Won, and Zhiqiang Lin. 2022. Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. InProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 1631–1645
2022
-
[21]
Hyunjin Kim, Jinyeong Bak, Kyunghyun Cho, and Hyungjoon Koo. 2023. A Transformer-based Function Symbol Name Inference Model from an Assembly Language for Binary Reversing. InProceedings of the 2023 ACM Asia Conference on Computer and Communications Security. 951–965
2023
-
[22]
Jeremy Lacomis, Pengcheng Yin, Edward Schwartz, Miltiadis Allamanis, Claire Le Goues, Graham Neubig, and Bogdan Vasilescu. 2019. Dire: A neural ap- proach to decompiled identifier naming. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 628–639
2019
-
[23]
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics36, 4 (2020), 1234–1240
2020
-
[24]
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Ko- cetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al
-
[25]
StarCoder: may the source be with you!arXiv preprint arXiv:2305.06161 (2023)
work page internal anchor Pith review arXiv 2023
- [26]
-
[27]
NSA. [n. d.]. Ghidra Decompiler. https://ghidra-sre.org/. Accessed: 2025-04-01
2025
-
[28]
OpenAI. [n. d.]. chatGPT. https://chat.openai.com/. Accessed: 2026-04-10
2026
-
[29]
Len or index or count, anything but v1
Kuntal Kumar Pal, Ati Priya Bajaj, Pratyay Banerjee, Audrey Dutcher, Mutsumi Nakamura, Zion Leonahenahe Basque, Himanshu Gupta, Saurabh Arjun Sawant, Ujjwala Anantheswaran, Yan Shoshitaishvili, et al . 2024. “Len or index or count, anything but v1”: Predicting Variable Names in Decompilation Output with Transfer Learning. In2024 IEEE Symposium on Security...
2024
-
[30]
Kexin Pei, Jonas Guan, Matthew Broughton, Zhongtian Chen, Songchen Yao, David Williams-King, Vikas Ummadisetty, Junfeng Yang, Baishakhi Ray, and Suman Jana. 2021. StateFormer: Fine-grained type recovery from binaries using generative state modeling. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on th...
2021
- [31]
-
[32]
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al . 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)
work page internal anchor Pith review arXiv 2023
-
[33]
Xiuwei Shang, Shaoyin Cheng, Guoqiang Chen, Yanming Zhang, Li Hu, Xiao Yu, Gangyang Li, Weiming Zhang, and Nenghai Yu. 2024. How Far Have We Gone in Binary Code Understanding Using Large Language Models. In2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 1–12
2024
-
[34]
Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm.git hub.io/blog/qwen2.5/
2024
-
[35]
Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388
work page internal anchor Pith review arXiv 2025
-
[36]
Lisa Torrey and Jude Shavlik. 2010. Transfer learning. InHandbook of research on machine learning applications and trends: algorithms, methods, and techniques. IGI global, 242–264
2010
-
[37]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)
work page internal anchor Pith review arXiv 2023
-
[38]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)
2017
-
[39]
Danning Xie, Zhuo Zhang, Nan Jiang, Xiangzhe Xu, Lin Tan, and Xiangyu Zhang
-
[40]
InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security
Resym: Harnessing llms to recover variable and data structure symbols from stripped binaries. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 4554–4568
2024
-
[41]
Xiangzhe Xu, Shiwei Feng, Yapeng Ye, Guangyu Shen, Zian Su, Siyuan Cheng, Guanhong Tao, Qingkai Shi, Zhuo Zhang, and Xiangyu Zhang. 2023. Improving Binary Code Similarity Transformer Models by Semantics-Driven Instruction Deemphasis. (2023)
2023
- [42]
-
[43]
Jia Yang, Cai Fu, Xiao-Yang Liu, Heng Yin, and Pan Zhou. 2021. Codee: A tensor embedding scheme for binary code search.IEEE Transactions on Software Engineering48, 7 (2021), 2224–2244
2021
-
[44]
Xinze Yang, Chunkai Zhang, Yizhi Sun, Kairui Pang, Luru Jing, Shiyun Wa, and Chunli Lv. 2023. FinChain-BERT: A High-Accuracy Automatic Fraud Detection Model Based on NLP Methods for Financial Scenarios.Information14, 9 (2023), 499
2023
-
[45]
Zhuo Zhang, Yapeng Ye, Wei You, Guanhong Tao, Wen-chuan Lee, Yonghwi Kwon, Yousra Aafer, and Xiangyu Zhang. 2021. Osprey: Recovery of variable and data structure via probabilistic analysis for stripped binary. In2021 IEEE Symposium on Security and Privacy (SP). IEEE, 813–832
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.