arxiv: 2605.09961 · v1 · submitted 2026-05-11 · 💻 cs.CR

Recognition: no theorem link

Towards LLM-Based Analysis of Virtualization-Obfuscated Code through Automated Data Generation

Sangjun An , Hyeyeon Park , Yejin Son , Seoksu Lee , Eun-Sun Cho

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:38 UTC · model grok-4.3

classification 💻 cs.CR

keywords virtualization obfuscationLLM-based analysisstatic analysiscode decompositiondataset generationbinary obfuscationreverse engineering

0 comments

The pith

Virtualization-obfuscated binaries are decomposed into LLM-sized units labeled by structural roles using an automated static analysis framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of using large language models to analyze virtualization-obfuscated code, which produces very large and complex binaries that exceed typical LLM input limits and lack labeled data for training. It proposes focusing on structural analysis by breaking the binaries into the largest possible semantically coherent units that fit within constraints and labeling them based on their roles in the structure. A static analysis framework automates this labeling process to generate large-scale datasets. This enables LLMs to handle such obfuscated code more effectively without needing full semantic understanding upfront.

Core claim

Obfuscated binaries can be decomposed into the largest semantically coherent units that fit within LLM constraints and labeled according to their structural roles through an automated static analysis framework, which enables large-scale dataset generation and shows strong performance on real-world virtualization obfuscators.

What carries the argument

The static analysis framework for decomposing binaries into coherent units and assigning structural role labels to automate dataset creation.

If this is right

LLMs can process obfuscated code despite size limits by analyzing these labeled units.
Large-scale labeled datasets become available for training models on virtualization-obfuscated code.
Analysis of real-world obfuscated binaries improves without manual labeling.
Structural focus allows useful insights even when full semantics are hard to recover.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such decomposition might be combined with other analysis techniques to enhance overall reverse engineering.
Extending this to dynamic analysis could provide additional labels for better LLM performance.
This could reduce the barrier for researchers studying malware protected by virtualization obfuscation.

Load-bearing premise

Structural role labels assigned by static analysis are sufficient for LLMs to perform useful analysis and the decomposition preserves necessary information without introducing systematic errors.

What would settle it

Demonstrating that LLMs trained or prompted with these labeled units perform no better on downstream tasks like identifying obfuscation handlers than those using random or no labels would falsify the utility of the approach.

Figures

Figures reproduced from arXiv: 2605.09961 by Eun-Sun Cho, Hyeyeon Park, Sangjun An, Seoksu Lee, Yejin Son.

**Figure 2.** Figure 2: Overall framework of the proposed automated labeling and LLM-based analysis pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Architectural comparison of VM structures based on dispatch options: Loop-Switch versus Threaded. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Complete visualized CFG with color-coded virtualization components identified by BERT. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Virtualization-based obfuscation produces extremely large and structurally complex binaries, posing challenges for LLM-based analysis due to input size limits and the need for large-scale labeled data. We address this by focusing on structural rather than full semantic analysis. Obfuscated binaries are decomposed into the largest semantically coherent units that fit within LLM constraints and are labeled according to their structural roles. We implement a static analysis framework to automate labeling and enable large-scale dataset generation. Our prototype shows strong performance on real-world virtualization obfuscators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a static-analysis pipeline to auto-generate labeled chunks of virtualization-obfuscated binaries for LLMs, but supplies no numbers to show the labels are reliable.

read the letter

The core idea here is straightforward: break large obfuscated binaries into the biggest pieces that fit an LLM context window, then use static analysis to tag each piece with its structural role (handler, dispatcher, bytecode, etc.). This targets the data-scarcity problem that has slowed LLM work on virtualization obfuscation, where standard CFG recovery fails and manual labeling does not scale. That combination of decomposition plus automated structural labeling is the concrete step forward; prior work on binary analysis or LLM prompting for code has not, from the abstract, put these pieces together for this exact setting.

Referee Report

2 major / 1 minor

Summary. The paper proposes decomposing virtualization-obfuscated binaries into the largest semantically coherent units that fit within LLM context windows, then labeling those units by structural role (handler, dispatcher, bytecode, etc.) via an automated static-analysis framework. The resulting labeled units are intended to support large-scale dataset generation for LLM-based reverse engineering. The manuscript asserts that the prototype achieves strong performance on real-world virtualization obfuscators.

Significance. If the static-analysis labeling step proves reliable, the pipeline would supply a scalable source of training data for LLMs on a class of binaries that currently defeats both standard decompilers and direct LLM ingestion. The emphasis on structural rather than full semantic labels is a pragmatic engineering choice given current LLM limits. The automated nature of the labeling is a clear strength for reproducibility and scale.

major comments (2)

[Abstract] Abstract: the claim that the prototype 'shows strong performance on real-world virtualization obfuscators' is unsupported by any quantitative metrics (precision/recall of structural-role labels, number of binaries processed, error analysis, or ground-truth comparison). Because the central contribution is the automated labeling pipeline that enables LLM analysis, the absence of these figures makes the performance assertion unevaluable.
[Static Analysis Framework] Static-analysis framework description: virtualization obfuscation replaces conventional control flow with a custom VM interpreter and data-driven dispatch, which defeats standard CFG recovery. The manuscript provides no validation (manual inspection, inter-rater agreement, or comparison against hand-annotated ground truth) that the framework correctly identifies handlers, dispatchers, and bytecode on such binaries. Systematic mislabeling would render the generated dataset unusable for downstream LLM tasks.

minor comments (1)

The abstract would be clearer if it named the specific real-world obfuscators (e.g., VMProtect, Themida) on which the prototype was tested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments correctly identify areas where the current manuscript requires additional evidence to support its claims. We address each major comment below and commit to revisions that will strengthen the paper without altering its core contribution.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the prototype 'shows strong performance on real-world virtualization obfuscators' is unsupported by any quantitative metrics (precision/recall of structural-role labels, number of binaries processed, error analysis, or ground-truth comparison). Because the central contribution is the automated labeling pipeline that enables LLM analysis, the absence of these figures makes the performance assertion unevaluable.

Authors: We agree that the abstract's assertion of strong performance is currently unsupported by quantitative evidence and therefore unevaluable. In the revised manuscript we will (1) qualify or remove the unqualified claim from the abstract and (2) add a dedicated evaluation section that reports precision and recall for each structural-role label, the total number of binaries processed, error analysis, and direct comparisons against hand-annotated ground truth on a representative subset of real-world samples. These additions will make the performance claims concrete and directly address the referee's concern. revision: yes
Referee: [Static Analysis Framework] Static-analysis framework description: virtualization obfuscation replaces conventional control flow with a custom VM interpreter and data-driven dispatch, which defeats standard CFG recovery. The manuscript provides no validation (manual inspection, inter-rater agreement, or comparison against hand-annotated ground truth) that the framework correctly identifies handlers, dispatchers, and bytecode on such binaries. Systematic mislabeling would render the generated dataset unusable for downstream LLM tasks.

Authors: The referee accurately describes the validation gap. We will add a new subsection under the static-analysis framework that presents (a) manual inspection results on a sample of labeled units drawn from real-world obfuscated binaries, (b) any inter-rater agreement statistics obtained during labeling review, and (c) quantitative comparison against hand-annotated ground truth where such annotations exist. This validation will demonstrate the framework's reliability and directly mitigate the risk that systematic mislabeling would compromise downstream LLM training data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; engineering pipeline with external validation

full rationale

The paper describes an implementation of a static analysis framework to decompose obfuscated binaries and automate structural-role labeling for LLM input. No mathematical derivations, equations, parameter fittings, or self-citation chains are present that reduce any claim to its own inputs by construction. Performance claims are positioned as testable against real-world virtualization obfuscators and human annotations, making the work self-contained against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions from static binary analysis and LLM context-window engineering. No free parameters are fitted, no new entities are postulated, and no ad-hoc axioms are introduced beyond the domain assumption that structural labels are useful proxies for semantic understanding.

axioms (1)

domain assumption Static analysis can reliably identify structural roles (dispatcher, handler, etc.) in virtualization-obfuscated code without false positives that would corrupt the training set.
Invoked when the static framework is said to automate labeling.

pith-pipeline@v0.9.0 · 5387 in / 1255 out tokens · 42410 ms · 2026-05-12T03:38:16.332867+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

A taxonomy of obfuscating transformations,

C. Collberg, C. Thomborson, and D. Low, “A taxonomy of obfuscating transformations,” Department of Computer Science, University of Auckland, New Zealand, Tech. Rep. 148, 1997

work page 1997
[2]

A survey of code obfuscation techniques,

V . J. M. Salis and A. S. Sani, “A survey of code obfuscation techniques,” ACM Comput. Surv., vol. 53, no. 3, pp. 1–36, 2020

work page 2020
[3]

In2025 IEEE Symposium on Security and Privacy (SP)

N. Zhang, et al., “Inspecting Virtual Machine Diversification Inside Virtualization Obfuscation,” 2025 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 2025, pp. 3051–3069, doi: 10.1109/SP61157.2025.00071

work page doi:10.1109/sp61157.2025.00071 2025
[4]

Chosen-Instruction Attack Against Commercial Code Virtualization Obfuscators,

S. Li, et al., “Chosen-Instruction Attack Against Commercial Code Virtualization Obfuscators,” 29th Annual Network and Distributed System Security Symposium, NDSS 2022, San Diego, California, USA, April 24–28, 2022

work page 2022
[5]

Virtualization-based obfuscation: A survey,

J. Teschke and T. Kurth, “Virtualization-based obfuscation: A survey,” in Proc. 12th Int. Conf. Availability, Rel. Secur. (ARES), 2017, pp. 1–10

work page 2017
[6]

The tigress diversifying c compiler,

B. Kinder, “The tigress diversifying c compiler,” 2023. [Online]. Available: http://tigress.cs.arizona.edu/

work page 2023
[7]

LLVM: A compilation framework for lifelong program analysis & transformation,

C. Lattner and V . Adve, “LLVM: A compilation framework for lifelong program analysis & transformation,” in Proc. Int. Symp. Code Gener. Optim. (CGO), 2004, pp. 75–86. 7

work page 2004
[8]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” 2018, arXiv:1810.04805. [Online]. Available: https://arxiv.org/abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Palmtree: Learning an assembly language model for instruction embedding,

B. Li, S. Feng, and G. Tan, “Palmtree: Learning an assembly language model for instruction embedding,” in Proc. 2021 ACM SIGSAC Conf. Comput. Commun. Secur. (CCS), 2021, pp. 3236–3251

work page 2021
[10]

DeepBinDiff: Learning program-wide code representations for binary diffing,

H. Pei, B. G. Chun, and S. Kim, “DeepBinDiff: Learning program-wide code representations for binary diffing,” in Proc. 27th Network Distrib. Syst. Secur. Symp. (NDSS), 2020

work page 2020
[11]

Identifying virtualized code via deep learning,

G. Luan, S. Feng, and G. Tan, “Identifying virtualized code via deep learning,” in Proc. 36th IEEE/ACM Int. Conf. Autom. Softw. Eng. (ASE), 2021, pp. 1178–1189

work page 2021
[12]

BinBert: Binary code understanding with BERT,

X. Wang, J. Yang, and Z. Lin, “BinBert: Binary code understanding with BERT,” in Proc. 32nd IEEE Int. Conf. Softw. Maint. Evol. (ICSME), 2022

work page 2022
[13]

Attention is all you need,

S. Vaswani et al., “Attention is all you need,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst. (NIPS), 2017, pp. 5998–6008

work page 2017
[14]

CNN based Virtualized Obfuscated Malware Detection Technique,

D. Yoo et al., “CNN based Virtualized Obfuscated Malware Detection Technique,” in Proc. Symposium of the Korean Institute of Communications and Information Sciences, 2024, pp. 637–638

work page 2024
[15]

Analysis of anti-reversing functionalities of vmprotect and bypass method using pin,

S. Park and Y . Park, “Analysis of anti-reversing functionalities of vmprotect and bypass method using pin,” KIPS Trans. Comp. Comm. Syst., vol. 10, no. 11, pp. 297–304, 2021

work page 2021
[16]

VMHunt: A verifiable approach to partially-virtualized binary code simplification,

D. Xu et al., “VMHunt: A verifiable approach to partially-virtualized binary code simplification,” in Proc. ACM CCS, 2018, pp. 442–458

work page 2018
[17]

VMAttack: Deobfuscating virtualization-based packed binaries,

A. Kalysch et al., “VMAttack: Deobfuscating virtualization-based packed binaries,” in Proc. ARES, 2017, pp. 1–10. 8

work page 2017