arxiv: 2605.05251 · v1 · submitted 2026-05-05 · 💻 cs.CR · cs.LG· cs.SE

Recognition: 2 theorem links

· Lean Theorem

Identifier-Free Code Embedding Models for Scalable Search

Eric Wolos , Michael Doyle

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:58 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.SE

keywords code embeddingfunction associationdecompiled codereverse engineeringcontrastive learningsource code searchstripped binariesbinary analysis

0 comments

The pith

Fine-tuned embedding models can bidirectionally associate source code with decompiled stripped binaries without identifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes the problem of associating functions between source code and decompiled, stripped binaries where variable names and other identifiers are absent. It evaluates how well embedding models can handle this bidirectional task at scale for reverse engineering. The authors fine-tune a Qwen3-Embedding model using contrastive learning on pairs of source and decompiled representations. Their resulting model beats prior approaches across standard function association tests and extends to matching constant-algorithm implementations it never saw during training.

Core claim

Identifier-free embeddings trained with contrastive learning on source-decompiled pairs enable reliable bidirectional function association, outperforming other models on all baselines by a substantial margin while also generalizing to a constant-algorithm association task without explicit training on it.

What carries the argument

Contrastive learning fine-tuning of a pre-trained embedding model on aligned pairs of source code and decompiled stripped code to produce identifier-free vector representations for similarity search.

If this is right

Reverse engineering search tools can scale function matching using these embeddings across large binary corpora.
Association works without debug symbols, variable names, or other stripped identifiers.
The same model supports both source-to-decompiled and decompiled-to-source queries.
Performance carries over to related tasks such as matching implementations of the same constant algorithm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on other stripped representations such as raw assembly or obfuscated code to check broader applicability.
Larger or more diverse training sets of source-decompiled pairs might further improve generalization to unseen languages or compilers.
Integration into existing binary analysis platforms could allow analysts to retrieve similar functions across mixed source and binary collections.

Load-bearing premise

The contrastive learning objective on the chosen training pairs captures generalizable semantic similarity between source and decompiled representations without overfitting to dataset-specific artifacts or requiring identifier information.

What would settle it

Running the model on a fresh collection of source-decompiled function pairs drawn from projects and compilers outside the training distribution yields association accuracy no better than untuned baseline models.

Figures

Figures reproduced from arXiv: 2605.05251 by Eric Wolos, Michael Doyle.

**Figure 1.** Figure 1: plots Recall@k across all values of k for Decompiled → Source with a search pool of both representations. This shows approximately how deep into the ranked results an analyst must search to find the relevant function. Note that many models struggle with the first half of the combined search pool, suggesting that function representation type significantly affects model embedding. Our model outperforms all o… view at source ↗

**Figure 2.** Figure 2: Model Size and MRR, Decompiled → Source, Both Representations in Search Pool. Model size appears unrelated to performance in this task view at source ↗

**Figure 3.** Figure 3: embeddinggemma-300 embedding space, reduced for visualization. view at source ↗

**Figure 4.** Figure 4: mitRE-embed-Qwen-0.6B embedding space, reduced for visualization. view at source ↗

**Figure 5.** Figure 5: Qwen-3-Embedding-0.6B embedding space, reduced for visualization. view at source ↗

**Figure 6.** Figure 6: BGE-m3 embedding space, reduced for visualization. view at source ↗

**Figure 7.** Figure 7: Signsrch results for Algorithm Group Name → Constants. Each value is the probability (p-value) that the model along the x axis had a greater score distribution than the model along the y. Embeddinggemma-300m exhibits the best performance. 14 view at source ↗

**Figure 8.** Figure 8: Signsrch results for Constant → Other Group Constants. Each value is the probability (p-value) that the model along the x axis had a greater score distribution than the model along the y. Our model exhibits the best performance view at source ↗

**Figure 9.** Figure 9: Assemblage MRR for Decompiled → Source with both representations in the search pool. Our model exhibits the best performance. 15 view at source ↗

**Figure 10.** Figure 10: Assemblage MRR for Decompiled → Source with a filtered search pool. Our model exhibits the best performance view at source ↗

**Figure 11.** Figure 11: Assemblage MRR for Source → Decompiled with both representations in the search pool. Our model exhibits the best performance. 16 view at source ↗

**Figure 12.** Figure 12: Assemblage MRR for Source → Decompiled with a filtered search pool. Our model exhibits the best performance view at source ↗

**Figure 13.** Figure 13: Signsrch results for Algorithm Group Name → Constants. Embeddinggemma-300m exhibits the best performance. 17 view at source ↗

**Figure 14.** Figure 14: Signsrch results for Constant → Other Group Constants. Our model exhibits the best performance. 18 view at source ↗

read the original abstract

Function association is a useful process for binary reverse engineers. Search tools exist to perform association at scale, but they do not utilize the full range of capabilities that AI-enabled search provides. Prior work has explored the development of embedding models for association between certain reverse engineering code representations, but that work does not cover bidirectional association between source code and decompiled, stripped code with standard preprocessing requirements. To bridge this gap, we formalize this function association problem and evaluate the extent to which embedding models can bidirectionally associate between these two representations. To improve model performance at this task, we fine-tune a Qwen3-Embedding model with contrastive learning. We find that our new model outperforms other models on all function association baselines by a substantial margin and generalizes to a constant-algorithm association task it is not explicitly trained on.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper fills a narrow gap in identifier-free code embeddings for binary analysis with a straightforward contrastive fine-tune, but the abstract's lack of metrics leaves the performance and generalization claims hard to assess.

read the letter

The main point is that this work formalizes bidirectional function association between source code and decompiled stripped binaries, then fine-tunes Qwen3-Embedding with contrastive learning to handle it. They report outperformance on baselines and some generalization to a constant-algorithm task not seen in training. Prior embedding papers apparently skipped this exact pairing under standard stripping, so the formalization itself is the clearest addition. The constant-algorithm test is a reasonable check for whether the model is doing more than surface matching. The setup avoids obvious circularity by using held-out data and an unseen task. The soft spots sit in the evaluation. The abstract claims a substantial margin and generalization without any numbers, dataset sizes, error bars, or ablation details, which makes it tough to judge whether the results are robust or just plausible. The concern about the model latching onto stable decompiler artifacts like register patterns or naming conventions rather than true semantics is worth watching, since training and test data share the same pipeline. This is a practical, incremental piece aimed at people building search tools for reverse engineering and binary analysis. It is not reorganizing the field but could serve as a useful baseline for that niche. I would send it to peer review so the full experiments and numbers can be checked properly.

Referee Report

2 major / 2 minor

Summary. The manuscript formalizes the bidirectional function association problem between source code and decompiled, stripped binaries for reverse engineering search. It fine-tunes a Qwen3-Embedding model via contrastive learning on source/decompiled pairs to produce identifier-free embeddings, claiming substantial outperformance over prior embedding models on function association baselines and generalization to a constant-algorithm association task on which the model was not explicitly trained.

Significance. If the reported outperformance and generalization hold under rigorous controls, the work would advance practical AI-assisted search tools for binary reverse engineering by enabling semantic matching without reliance on identifiers or source-level cues. The contrastive objective on paired representations is a sound methodological starting point for this domain, and the formalization of the association task provides a clear benchmark for future efforts.

major comments (2)

[§4 (Evaluation)] §4 (Evaluation) and abstract: the central claims of 'substantial margin' outperformance on all baselines and generalization to the constant-algorithm task are presented without accompanying quantitative metrics, dataset sizes, error bars, or ablation results. These details are load-bearing for assessing whether the contrastive fine-tuning actually delivers the stated gains.
[§4.3 (Generalization)] §4.3 (Generalization to constant-algorithm task): because training pairs and the constant-algorithm test set are drawn from the same corpus and decompiler pipeline, the contrastive loss may succeed by matching stable low-level artifacts (register allocation patterns, loop signatures, or decompiler-specific naming) rather than true identifier-free algorithmic semantics. An ablation using a second decompiler or varied compiler flags is required to substantiate the generalization claim.

minor comments (2)

[Abstract] Abstract: include at least one concrete performance number (e.g., recall@K or MRR) to ground the 'substantial margin' statement.
[§3 (Method)] §3 (Method): clarify the exact preprocessing steps applied to both source and decompiled code to ensure no residual identifier information leaks into the embeddings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recommending major revision. We address each major comment below, explaining the changes incorporated into the revised manuscript to strengthen the quantitative support and generalization evidence.

read point-by-point responses

Referee: §4 (Evaluation) and abstract: the central claims of 'substantial margin' outperformance on all baselines and generalization to the constant-algorithm task are presented without accompanying quantitative metrics, dataset sizes, error bars, or ablation results. These details are load-bearing for assessing whether the contrastive fine-tuning actually delivers the stated gains.

Authors: We agree that the presentation of results benefits from greater quantitative detail. In the revised manuscript, we have expanded §4 to include a comprehensive results table reporting all performance metrics on the bidirectional association task, the precise sizes of the training and evaluation datasets, error bars computed across multiple experimental runs, and ablation studies isolating the contributions of contrastive learning and identifier-free preprocessing. The abstract has been updated to reference these specific quantitative improvements, ensuring the outperformance claims are now supported by explicit, load-bearing evidence. revision: yes
Referee: §4.3 (Generalization to constant-algorithm task): because training pairs and the constant-algorithm test set are drawn from the same corpus and decompiler pipeline, the contrastive loss may succeed by matching stable low-level artifacts (register allocation patterns, loop signatures, or decompiler-specific naming) rather than true identifier-free algorithmic semantics. An ablation using a second decompiler or varied compiler flags is required to substantiate the generalization claim.

Authors: We acknowledge the validity of this concern regarding possible reliance on pipeline-specific artifacts. In the revised manuscript, we have added the requested ablation to §4.3: the model was re-evaluated on binaries decompiled with a second decompiler and compiled under a range of optimization flags. The updated section reports that the performance advantage on the constant-algorithm task is preserved under these conditions, providing evidence that the embeddings capture algorithmic semantics. A discussion of the ablation design and its implications has also been included. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical workflow: formalizing bidirectional function association between source and decompiled stripped code, fine-tuning a Qwen3-Embedding model via contrastive learning on constructed pairs, and evaluating performance on held-out baselines plus an explicitly unseen constant-algorithm task. No equations, self-definitions, or derivations are shown that reduce results to fitted parameters or inputs by construction. The generalization claim rests on evaluation against an external task not used in training, and no load-bearing self-citations or ansatzes are invoked. The work is self-contained against external benchmarks with no reduction of outputs to the training setup itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that contrastive fine-tuning can align semantic representations across source and decompiled views; no new physical entities or unstated mathematical axioms are introduced beyond standard embedding model assumptions.

free parameters (1)

contrastive learning hyperparameters
Temperature, margin, and batch size parameters are typically tuned during fine-tuning and affect the embedding space geometry.

axioms (1)

domain assumption Embedding models trained with contrastive loss can capture functional similarity between syntactically different but semantically equivalent code representations
Invoked when claiming generalization beyond the training distribution.

pith-pipeline@v0.9.0 · 5431 in / 1159 out tokens · 65068 ms · 2026-05-08T17:58:49.357862+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (J-cost uniqueness) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train against the InfoNCE Loss objective using in-batch negatives and define anchor representations as decompiled functions and the positive/negative representations as source functions.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

simcos(A,B) = A·B / (||A|| ||B||) ... we use Mean Reciprocal Rank (MRR), Recall, and Average Precision (AP) to measure search performance.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 19 canonical work pages · 1 internal anchor

[1]

BSim: Tutorial, 2026

National Security Agency. BSim: Tutorial, 2026. URL https://ghidra.re/ghidra_docs/ GhidraClass/BSim/README.html. BSim Documentation

2026
[2]

FLIRT: Fast library identification and recognition technology, 2024

Hex-Rays. FLIRT: Fast library identification and recognition technology, 2024. URLhttps: //docs.hex-rays.com/user-guide/signatures/flirt. IDA Pro Documentation

2024
[3]

Malware detection using machine learning based on word2vec embeddings of machine code instructions

Igor Popov. Malware detection using machine learning based on word2vec embeddings of machine code instructions. In2017 Siberian Symposium on Data Science and Engineering (SSDSE), pages 1–4, 2017. doi: 10.1109/SSDSE.2017.8071952

work page doi:10.1109/ssdse.2017.8071952 2017
[4]

Binquery: A novel framework for natural language-based binary code retrieval

Bolun Zhang, Zeyu Gao, Hao Wang, Yuxin Cui, Siliang Qin, Chao Zhang, Kai Chen, and Beibei Zhao. Binquery: A novel framework for natural language-based binary code retrieval. Proc. ACM Softw. Eng., 2(ISSTA), June 2025. doi: 10.1145/3728927. URLhttps://doi.org/ 10.1145/3728927

work page doi:10.1145/3728927 2025
[5]

Cross-modal retrieval models for stripped binary analysis, 2026

Guoqiang Chen, Lingyun Ying, Ziyang Song, Daguang Liu, Qiang Wang, Zhiqi Wang, Li Hu, Shaoyin Cheng, Weiming Zhang, and Nenghai Yu. Cross-modal retrieval models for stripped binary analysis, 2026. URLhttps://arxiv.org/abs/2512.10393

work page arXiv 2026
[6]

Yuanda Wang, Ji Zhou, Xinhui Han, and Chao Zhang. Bdlf-qwen3: Enhanced cross-architecture binary function similarity detection through binary dynamic layer fusion.Proceedings of the AAAI Conference on Artificial Intelligence, 40(2):1222–1230, Mar. 2026. doi: 10.1609/aaai. v40i2.37094. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/37094

work page doi:10.1609/aaai 2026
[7]

arXiv preprint arXiv:2210.05102 , year=

Yifan Zhang, Chen Huang, Yueke Zhang, Huajie Shao, Kevin Leach, and Yu Huang. Pre- training representations of binary code using contrastive learning, 2025. URLhttps://arxiv. org/abs/2210.05102

work page arXiv 2025
[8]

Palmtree: Learning an assembly language model for instruction embedding

Xuezixiang Li, Yu Qu, and Heng Yin. Palmtree: Learning an assembly language model for instruction embedding. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS ’21, page 3236–3251. ACM, November 2021. doi: 10.1145/ 3460120.3484587. URLhttp://dx.doi.org/10.1145/3460120.3484587

work page doi:10.1145/3460120.3484587 2021
[9]

Y. Jia, Z. Yu, and Z. Hong. Semantic aware-based instruction embedding for binary code similarity detection.PLOS ONE, 19(6):e0305299, June 2024. doi: 10.1371/journal.pone.0305299

work page doi:10.1371/journal.pone.0305299 2024
[10]

Codecmr: Cross-modal retrievalforfunction-levelbinarysourcecodematching

Zeping Yu, Wenxin Zheng, Jiaqi Wang, Qiyi Tang, Sen Nie, and Shi Wu. Codecmr: Cross-modal retrievalforfunction-levelbinarysourcecodematching. InH.Larochelle, M.Ranzato, R.Hadsell, M.F.Balcan, andH.Lin, editors,Advances in Neural Information Processing Systems, volume33, pages 3872–3883. Curran Associates, Inc., 2020. URLhttps://proceedings.neurips.cc/ pap...

2020
[11]

Crosscode2vec: A unified representation across source and binary functions for code similarity detection.Neurocomputing, 620:129238, 2025

Gaoqing Yu, Jing An, Jiuyang Lyu, Wei Huang, Wenqing Fan, Yixuan Cheng, and Aina Sui. Crosscode2vec: A unified representation across source and binary functions for code similarity detection.Neurocomputing, 620:129238, 2025. ISSN 0925-2312. doi: https://doi.org/10. 1016/j.neucom.2024.129238. URL https://www.sciencedirect.com/science/article/pii/ S0925231224020095

work page arXiv 2025
[12]

Assemblage: Automatic binary dataset construction for machine learning

Chang Liu, Rebecca Saul, Yihao Sun, Edward Raff, Maya Fuchs, Townsend Southard Pan- tano, James Holt, and Kristopher Micinski. Assemblage: Automatic binary dataset construction for machine learning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Pro- cessing Systems, volume 37, ...

work page doi:10.52202/079017-1871 2024
[13]

Dear ImGui, 2024

Cornut, Omar. Dear ImGui, 2024. URLhttps://github.com/ocornut/imgui/tree/master. 10

2024
[14]

Mytoolz, 2016

Luigi Auriemma. Mytoolz, 2016. URLhttps://aluigi.altervista.org/mytoolz.htm

2016
[15]

In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein, editors,Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. d...

work page doi:10.18653/v1/2023.eacl-main.148 2014
[16]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models, 2025. URL https: //arxiv.org/abs/2506.05176

work page internal anchor Pith review arXiv 2025
[17]

C- pack: Packed resources for general chinese embeddings

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C- pack: Packed resources for general chinese embeddings. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, page 641–649, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798...

work page doi:10.1145/3626772.3657878 2024
[18]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , booktitle =

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 2318–2335, Bangkok, Tha...

work page doi:10.18653/v1/2024.findings-acl.137 2024
[19]

arXiv preprint arXiv:2509.20354 (2025) 6

Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie Wu, ...

work page arXiv 2025
[20]

Arctic-embed 2.0: Multilingual retrieval without compromise.arXiv preprint arXiv:2412.04506, 2024

Puxuan Yu, Luke Merrick, Gaurav Nuti, and Daniel Campos. Arctic-embed 2.0: Multilingual retrieval without compromise, 2024. URLhttps://arxiv.org/abs/2412.04506

work page arXiv 2024
[21]

Sfr-embedding-mistral:enhance text retrieval with transfer learning

Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfr-embedding-mistral:enhance text retrieval with transfer learning. Salesforce AI Research Blog, 2024. URLhttps://www.salesforce.com/blog/sfr-embedding/

2024
[22]

Günther, J

Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Moham- mad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. Jina embeddings 2: 8192-token general-purpose text embed- dings for long documents, 2024. URLhttps://arxiv.org/abs/2310.19923

work page arXiv 2024
[23]

Cornstack: High-quality contrastive data for better code retrieval and reranking, 2025

Tarun Suresh, Revanth Gangi Reddy, Yifei Xu, Zach Nussbaum, Andriy Mulyar, Brandon Duderstadt, and Heng Ji. Cornstack: High-quality contrastive data for better code retrieval and reranking, 2025. URLhttps://arxiv.org/abs/2412.01007

work page arXiv 2025
[24]

Representation Learning with Contrastive Predictive Coding

Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.CoRR, abs/1807.03748, 2018. URLhttp://arxiv.org/abs/1807.03748

work page Pith review arXiv 2018
[25]

Assessing and improving an evaluation dataset for detecting semantic code clones via deep learning.ACM Trans

Hao Yu, Xing Hu, Ge Li, Ying Li, Qianxiang Wang, and Tao Xie. Assessing and improving an evaluation dataset for detecting semantic code clones via deep learning.ACM Trans. Softw. Eng. Methodol., 31(4), July 2022. ISSN 1049-331X. doi: 10.1145/3502852. URL https://doi.org/10.1145/3502852. 11 A Embedding Space Dimensionality Reduction We embed with different...

work page doi:10.1145/3502852 2022