arxiv: 2604.27319 · v1 · submitted 2026-04-30 · 💻 cs.CR · cs.LG· cs.SE

Recognition: unknown

REBENCH: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names (Extended Version)

Jun Yeon Won , Xin Jin , Shiqing Ma , Zhiqiang Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:27 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.SE

keywords reverse engineeringlarge language modelsbenchmark datasetstripped binariestype inferencename recoverybinary analysisground truth generation

0 comments

The pith

REBench creates a standardized benchmark for LLMs on recovering names and types from stripped binaries using stored byte-level stack data for ground truth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces REBench as a consolidated dataset for evaluating large language models on reverse engineering tasks such as function and variable name recovery and type inference from stripped binaries. It draws from hundreds of millions of lines of source code and binaries across architectures and optimization levels to address the lack of standardized evaluation resources. The central mechanism stores byte-level stack information in a knowledge base to generate ground truth labels. This approach aims to keep the original task difficulty intact and support fair comparisons without the simplifications that have biased prior work. As a demonstration, the benchmark reveals that current LLMs face notable difficulties on the more complex instances of these tasks.

Core claim

REBench consolidates a superset of existing datasets and adopts a knowledge-base-driven methodology that stores byte-level stack information to generate ground truth. This ensures that task difficulty is preserved while maintaining universal applicability. The design enables fair evaluation across tasks while avoiding simplifications that could bias results. When applied to measure LLM performance, the benchmark demonstrates difficulties in complex tasks.

What carries the argument

The knowledge-base-driven methodology that stores byte-level stack information to generate ground truth labels.

If this is right

Existing disparate datasets can be replaced by a single superset that spans multiple architectures and optimization levels.
LLM approaches can be compared directly using identical ground truth without preprocessing differences.
Task difficulty remains consistent with real reverse engineering rather than being reduced by artificial simplifications.
Performance gaps between simple and complex reverse engineering subtasks become measurable in a repeatable way.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could serve as a training signal for fine-tuning models specifically on binary analysis patterns.
Hybrid systems combining LLMs with traditional static analysis tools might be scored more reliably against the same ground truth.
Extending the knowledge base to additional binary formats or runtime behaviors would test whether the current fairness properties hold at larger scale.

Load-bearing premise

Byte-level stack information stored in the knowledge base accurately reflects the real difficulty and challenges of reverse engineering without introducing new biases or simplifications.

What would settle it

If models evaluated on REBench produce accuracy scores that diverge sharply from their accuracy on an independently assembled collection of real stripped binaries drawn from the same source distributions, the claim of preserved difficulty would be open to question.

Figures

Figures reproduced from arXiv: 2604.27319 by Jun Yeon Won, Shiqing Ma, Xin Jin, Zhiqiang Lin.

**Figure 1.** Figure 1: Differences of Decompiled Code Between With and Without Debug Symbols. view at source ↗

**Figure 2.** Figure 2: The Process of Normalization. The Example with Debug Symbols is Used For Better Understanding. view at source ↗

**Figure 3.** Figure 3: Example of LLM Inference Results. Comments are the view at source ↗

**Figure 4.** Figure 4: The Prompt Used to Infer Types and Names view at source ↗

**Figure 5.** Figure 5: F1 Score on Function and Variable Name Recovery with and without CodeWordNet in X64 Architecture via Decompiled view at source ↗

**Figure 6.** Figure 6: F1 Score of Name Recovery and Type Inference with Different Sizes of CodeLlama in x64 Architecture. view at source ↗

read the original abstract

Large Language Models (LLMs) have achieved remarkable progress in recent years, driving their adoption across a wide range of domains, including computer security. In reverse engineering, LLMs are increasingly applied to critical tasks such as function and variable name recovery and type inference. However, despite the rapid growth of research in this area, progress has been hindered by the absence of a standardized dataset. Existing studies rely on disparate datasets, preprocessing pipelines, and evaluation metrics, making fair comparisons between approaches difficult and obscuring a clear understanding of LLM capabilities in binary analysis. To address these challenges, we present REBench, a comprehensive benchmark dataset for evaluating LLMs on binary reverse engineering tasks. REBench consolidates a superset of existing datasets, comprising hundreds of millions of lines of source code and a diverse collection of binaries spanning multiple architectures and optimization levels. REBench adopts a knowledge-base-driven methodology that stores byte-level stack information to generate ground truth, ensuring that task difficulty is preserved while maintaining universal applicability. This design enables fair evaluation across tasks while avoiding simplifications that could bias results. As a use case, we apply REBench to measure the reverse engineering performance of LLMs and the result demonstrates difficulties in complex tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REBench consolidates prior datasets for LLM binary analysis with a KB byte-level ground-truth method, but the abstract gives no evidence that it avoids leaking cues or simplifying real stripped-binary difficulty.

read the letter

The main thing to know about this paper is that it introduces REBench, a consolidated benchmark for LLMs on stripped-binary type and name recovery, built with a knowledge-base method for ground truth. This could help standardize the area. What it does well is gather a large superset of prior datasets spanning many architectures and optimization levels. The KB approach that stores byte-level stack information is an attempt to generate labels without the usual simplifications. If it holds up, it would make comparisons between LLM approaches more reliable. The soft spot is whether this KB construction actually preserves the real challenges of reverse engineering. The abstract claims it avoids biases, but there is no detail on measurements or arguments confirming no leakage of higher-level information that LLMs might exploit. The stress-test concern about faithful reproduction of information loss is worth checking in the full text. The reported results on LLM difficulties with complex tasks are not surprising and serve more as an illustration than a core finding. This paper is for people in the LLM-for-security space who need a common test set. A reader focused on binary analysis tools would get practical value from it. It deserves a serious referee because the standardization goal is important for the subfield. I would recommend peer review, with referees asked to verify the ground truth generation process in detail.

Referee Report

2 major / 2 minor

Summary. The paper introduces REBench, a large-scale benchmark dataset for evaluating LLMs on stripped-binary reverse engineering tasks such as type inference and name recovery. It consolidates hundreds of millions of lines of source code and binaries across multiple architectures and optimization levels, and employs a knowledge-base-driven methodology that stores byte-level stack information to synthesize ground-truth labels. The authors claim this procedural construction preserves real task difficulty, ensures fairness and universal applicability, and avoids simplifications or biases that could distort results. As a use case, they apply REBench to LLMs and report that the models exhibit difficulties on complex tasks.

Significance. If the KB-driven construction can be shown to faithfully reproduce the information loss and aliasing of real stripped, optimized binaries without leaking exploitable higher-level cues, REBench would provide a much-needed standardized, reproducible evaluation framework for LLM-based binary analysis. This could enable fair cross-paper comparisons and clearer measurement of LLM capabilities in security-critical reverse-engineering settings.

major comments (2)

[Abstract] Abstract: The central claim that the knowledge-base-driven methodology 'ensures that task difficulty is preserved' and 'avoid[s] simplifications that could bias results' is load-bearing for the entire contribution, yet the abstract provides no explicit argument, measurement, or validation that byte-level stack records faithfully reproduce stripped-binary information loss or prevent leakage of consistent naming/stack-frame patterns that an LLM could exploit. This directly matches the weakest assumption identified in the stress-test note.
[Abstract] Abstract (use-case paragraph): The statement that 'the result demonstrates difficulties in complex tasks' is presented without any metrics, task definitions, baselines, or quantitative breakdown of what constitutes 'complex' versus simpler tasks. Because the paper's value rests on demonstrating LLM limitations via this benchmark, the absence of supporting data undermines the claim.

minor comments (2)

The phrase 'hundreds of millions of lines of source code' is ambiguous: clarify whether this refers to total source lines across all programs, unique functions, or another quantity, and state the exact number of binaries and functions in the consolidated dataset.
The abstract refers to 'a superset of existing datasets' but does not list which prior datasets are included or how duplicates and preprocessing differences were resolved; this should be stated explicitly in the dataset-construction section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough review and valuable suggestions. We address the major comments point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the knowledge-base-driven methodology 'ensures that task difficulty is preserved' and 'avoid[s] simplifications that could bias results' is load-bearing for the entire contribution, yet the abstract provides no explicit argument, measurement, or validation that byte-level stack records faithfully reproduce stripped-binary information loss or prevent leakage of consistent naming/stack-frame patterns that an LLM could exploit. This directly matches the weakest assumption identified in the stress-test note.

Authors: We agree that the abstract, due to length constraints, does not detail validation. The full manuscript (Section 3) describes how the knowledge base records only byte-level stack frames and locations extracted from debug information prior to stripping, with ground-truth labels synthesized without injecting higher-level cues. This construction is intended to replicate real information loss. We will revise the abstract to reference the methodology and the empirical comparisons in the evaluation section that assess label fidelity and task difficulty against real stripped binaries. revision: yes
Referee: [Abstract] Abstract (use-case paragraph): The statement that 'the result demonstrates difficulties in complex tasks' is presented without any metrics, task definitions, baselines, or quantitative breakdown of what constitutes 'complex' versus simpler tasks. Because the paper's value rests on demonstrating LLM limitations via this benchmark, the absence of supporting data undermines the claim.

Authors: The abstract summarizes the use-case findings at a high level. The full paper's evaluation section defines complex tasks via explicit criteria (e.g., optimization level O3, functions with multiple variables or large stack frames, cross-architecture settings) and reports quantitative metrics including accuracy/F1 scores with breakdowns and comparisons to prior baselines. We will revise the abstract to include a concise quantitative summary of these results and task definitions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark construction is self-contained

full rationale

The paper presents REBench as a procedural benchmark built via a knowledge-base storing byte-level stack information to synthesize ground truth for type and name recovery tasks. This is a dataset construction methodology with no mathematical derivations, equations, fitted parameters, or predictions that reduce to their inputs by construction. The abstract's claim that the design 'ensures that task difficulty is preserved' and 'avoids simplifications that could bias results' is a direct statement of the construction choices rather than a self-referential loop or load-bearing self-citation. No uniqueness theorems, ansatzes smuggled via prior work, or renaming of known results appear. The contribution stands as an independent artifact whose validity can be assessed externally against real binaries and human RE difficulty, placing any concerns under correctness rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the assumption that byte-level stack information from source code can serve as unbiased ground truth for real reverse-engineering difficulty across all tasks and architectures.

axioms (1)

domain assumption Byte-level stack information stored in the knowledge base provides accurate and unbiased ground truth for stripped-binary type and name recovery tasks.
Invoked to justify the fair-by-construction claim without post-hoc simplifications.

pith-pipeline@v0.9.0 · 5532 in / 1153 out tokens · 52743 ms · 2026-05-07T09:27:07.058751+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 11 canonical work pages · 5 internal anchors

[1]

Large Language Model for Software Engineering

2023. Large Language Model for Software Engineering. https://github.com/gai 4se/LLM4SE?tab=readme-ov-file#model-list

2023
[2]

Python Transformers

2023. Python Transformers. https://pypi.org/project/transformers/

2023
[3]

Unsloth AI. 2024. Unsloth Phi-4. https://huggingface.co/unsloth/phi-4

2024
[4]

Pratyay Banerjee, Kuntal Kumar Pal, Fish Wang, and Chitta Baral. 2021. Variable name recovery in decompiled binary code using constrained masked language modeling.arXiv preprint arXiv:2103.12801(2021)

work page arXiv 2021
[5]

Qibin Chen, Jeremy Lacomis, Edward J Schwartz, Claire Le Goues, Graham Neubig, and Bogdan Vasilescu. 2022. Augmenting decompiler output with learned variable names and types. In31st USENIX Security Symposium (USENIX Security 22). 4327–4343

2022
[6]

Yaniv David, Uri Alon, and Eran Yahav. 2020. Neural reverse engineering of stripped binaries using augmented control flow graphs.Proceedings of the ACM on Programming Languages4, OOPSLA (2020), 1–28

2020
[7]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems36 (2023), 10088–10115

2023
[8]

Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin. 2020. Deepbindiff: Learning program-wide code representations for binary diffing. InNetwork and distributed system security symposium

2020
[9]

Dror G Feitelson, Ayelet Mizrahi, Nofar Noy, Aviad Ben Shabat, Or Eliyahu, and Roy Sheffer. 2020. How developers choose names.IEEE Transactions on Software Engineering48, 1 (2020), 37–52

2020
[10]

Bo Feng, Alejandro Mera, and Long Lu. 2020. {P2IM}: Scalable and hardware- independent firmware testing via automatic peripheral interface modeling. In29th USENIX Security Symposium (USENIX Security 20). 1237–1254

2020
[11]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review arXiv 2025
[12]

Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev, and Martin Vechev
[13]

Debin: Predicting debug information in stripped binaries. InProceedings of REBENCH: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary T ypes and Names the 2018 ACM SIGSAC Conference on Computer and Communications Security. 1667–1680

2018
[14]

Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu
[15]

ACM59, 5 (2016), 122–131

On the naturalness of software.Commun. ACM59, 5 (2016), 122–131

2016
[16]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3

2022
[17]

Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. 2019. Clinicalbert: Modeling clinical notes and predicting hospital readmission.arXiv preprint arXiv:1904.05342(2019)

work page arXiv 2019
[18]

Linxi Jiang, Xin Jin, and Zhiqiang Lin. 2025. Beyond Classification: Inferring Function Names in Stripped Binaries via Domain Adapted LLMs.. InNDSS

2025
[19]

Xin Jin, Jonathan Larson, Weiwei Yang, and Zhiqiang Lin. 2023. Binary code summarization: Benchmarking chatgpt/gpt-4 and other large language models. arXiv preprint arXiv:2312.09601(2023)

work page arXiv 2023
[20]

Xin Jin, Kexin Pei, Jun Yeon Won, and Zhiqiang Lin. 2022. Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. InProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 1631–1645

2022
[21]

Hyunjin Kim, Jinyeong Bak, Kyunghyun Cho, and Hyungjoon Koo. 2023. A Transformer-based Function Symbol Name Inference Model from an Assembly Language for Binary Reversing. InProceedings of the 2023 ACM Asia Conference on Computer and Communications Security. 951–965

2023
[22]

Jeremy Lacomis, Pengcheng Yin, Edward Schwartz, Miltiadis Allamanis, Claire Le Goues, Graham Neubig, and Bogdan Vasilescu. 2019. Dire: A neural ap- proach to decompiled identifier naming. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 628–639

2019
[23]

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics36, 4 (2020), 1234–1240

2020
[24]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Ko- cetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al
[25]

StarCoder: may the source be with you!arXiv preprint arXiv:2305.06161 (2023)

work page internal anchor Pith review arXiv 2023
[26]

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. WizardCoder: Empowering Code Large Language Models with Evol-Instruct.arXiv preprint arXiv:2306.08568(2023)

work page arXiv 2023
[27]

NSA. [n. d.]. Ghidra Decompiler. https://ghidra-sre.org/. Accessed: 2025-04-01

2025
[28]

OpenAI. [n. d.]. chatGPT. https://chat.openai.com/. Accessed: 2026-04-10

2026
[29]

Len or index or count, anything but v1

Kuntal Kumar Pal, Ati Priya Bajaj, Pratyay Banerjee, Audrey Dutcher, Mutsumi Nakamura, Zion Leonahenahe Basque, Himanshu Gupta, Saurabh Arjun Sawant, Ujjwala Anantheswaran, Yan Shoshitaishvili, et al . 2024. “Len or index or count, anything but v1”: Predicting Variable Names in Decompilation Output with Transfer Learning. In2024 IEEE Symposium on Security...

2024
[30]

Kexin Pei, Jonas Guan, Matthew Broughton, Zhongtian Chen, Songchen Yao, David Williams-King, Vikas Ummadisetty, Junfeng Yang, Baishakhi Ray, and Suman Jana. 2021. StateFormer: Fine-grained type recovery from binaries using generative state modeling. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on th...

2021
[31]

Kexin Pei, Zhou Xuan, Junfeng Yang, Suman Jana, and Baishakhi Ray. 2020. Trex: Learning execution semantics from micro-traces for binary similarity.arXiv preprint arXiv:2012.08680(2020)

work page arXiv 2020
[32]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al . 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review arXiv 2023
[33]

Xiuwei Shang, Shaoyin Cheng, Guoqiang Chen, Yanming Zhang, Li Hu, Xiao Yu, Gangyang Li, Weiming Zhang, and Nenghai Yu. 2024. How Far Have We Gone in Binary Code Understanding Using Large Language Models. In2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 1–12

2024
[34]

Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm.git hub.io/blog/qwen2.5/

2024
[35]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

work page internal anchor Pith review arXiv 2025
[36]

Lisa Torrey and Jude Shavlik. 2010. Transfer learning. InHandbook of research on machine learning applications and trends: algorithms, methods, and techniques. IGI global, 242–264

2010
[37]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)

work page internal anchor Pith review arXiv 2023
[38]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

2017
[39]

Danning Xie, Zhuo Zhang, Nan Jiang, Xiangzhe Xu, Lin Tan, and Xiangyu Zhang
[40]

InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security

Resym: Harnessing llms to recover variable and data structure symbols from stripped binaries. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 4554–4568

2024
[41]

Xiangzhe Xu, Shiwei Feng, Yapeng Ye, Guangyu Shen, Zian Su, Siyuan Cheng, Guanhong Tao, Qingkai Shi, Zhuo Zhang, and Xiangyu Zhang. 2023. Improving Binary Code Similarity Transformer Models by Semantics-Driven Instruction Deemphasis. (2023)

2023
[42]

Xiangzhe Xu, Zhuo Zhang, Shiwei Feng, Yapeng Ye, Zian Su, Nan Jiang, Siyuan Cheng, Lin Tan, and Xiangyu Zhang. 2023. LmPa: Improving Decompilation by Synergy of Large Language Model and Program Analysis.arXiv preprint arXiv:2306.02546(2023)

work page arXiv 2023
[43]

Jia Yang, Cai Fu, Xiao-Yang Liu, Heng Yin, and Pan Zhou. 2021. Codee: A tensor embedding scheme for binary code search.IEEE Transactions on Software Engineering48, 7 (2021), 2224–2244

2021
[44]

Xinze Yang, Chunkai Zhang, Yizhi Sun, Kairui Pang, Luru Jing, Shiyun Wa, and Chunli Lv. 2023. FinChain-BERT: A High-Accuracy Automatic Fraud Detection Model Based on NLP Methods for Financial Scenarios.Information14, 9 (2023), 499

2023
[45]

Zhuo Zhang, Yapeng Ye, Wei You, Guanhong Tao, Wen-chuan Lee, Yonghwi Kwon, Yousra Aafer, and Xiangyu Zhang. 2021. Osprey: Recovery of variable and data structure via probabilistic analysis for stripped binary. In2021 IEEE Symposium on Security and Privacy (SP). IEEE, 813–832

2021