pith. machine review for the scientific record. sign in

arxiv: 2605.09961 · v1 · submitted 2026-05-11 · 💻 cs.CR

Recognition: no theorem link

Towards LLM-Based Analysis of Virtualization-Obfuscated Code through Automated Data Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:38 UTC · model grok-4.3

classification 💻 cs.CR
keywords virtualization obfuscationLLM-based analysisstatic analysiscode decompositiondataset generationbinary obfuscationreverse engineering
0
0 comments X

The pith

Virtualization-obfuscated binaries are decomposed into LLM-sized units labeled by structural roles using an automated static analysis framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of using large language models to analyze virtualization-obfuscated code, which produces very large and complex binaries that exceed typical LLM input limits and lack labeled data for training. It proposes focusing on structural analysis by breaking the binaries into the largest possible semantically coherent units that fit within constraints and labeling them based on their roles in the structure. A static analysis framework automates this labeling process to generate large-scale datasets. This enables LLMs to handle such obfuscated code more effectively without needing full semantic understanding upfront.

Core claim

Obfuscated binaries can be decomposed into the largest semantically coherent units that fit within LLM constraints and labeled according to their structural roles through an automated static analysis framework, which enables large-scale dataset generation and shows strong performance on real-world virtualization obfuscators.

What carries the argument

The static analysis framework for decomposing binaries into coherent units and assigning structural role labels to automate dataset creation.

If this is right

  • LLMs can process obfuscated code despite size limits by analyzing these labeled units.
  • Large-scale labeled datasets become available for training models on virtualization-obfuscated code.
  • Analysis of real-world obfuscated binaries improves without manual labeling.
  • Structural focus allows useful insights even when full semantics are hard to recover.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such decomposition might be combined with other analysis techniques to enhance overall reverse engineering.
  • Extending this to dynamic analysis could provide additional labels for better LLM performance.
  • This could reduce the barrier for researchers studying malware protected by virtualization obfuscation.

Load-bearing premise

Structural role labels assigned by static analysis are sufficient for LLMs to perform useful analysis and the decomposition preserves necessary information without introducing systematic errors.

What would settle it

Demonstrating that LLMs trained or prompted with these labeled units perform no better on downstream tasks like identifying obfuscation handlers than those using random or no labels would falsify the utility of the approach.

Figures

Figures reproduced from arXiv: 2605.09961 by Eun-Sun Cho, Hyeyeon Park, Sangjun An, Seoksu Lee, Yejin Son.

Figure 1
Figure 1. Figure 1: CFG before and after virtualization obfuscation. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall framework of the proposed automated labeling and LLM-based analysis pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architectural comparison of VM structures based on dispatch options: Loop-Switch versus Threaded. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Complete visualized CFG with color-coded virtualization components identified by BERT. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Virtualization-based obfuscation produces extremely large and structurally complex binaries, posing challenges for LLM-based analysis due to input size limits and the need for large-scale labeled data. We address this by focusing on structural rather than full semantic analysis. Obfuscated binaries are decomposed into the largest semantically coherent units that fit within LLM constraints and are labeled according to their structural roles. We implement a static analysis framework to automate labeling and enable large-scale dataset generation. Our prototype shows strong performance on real-world virtualization obfuscators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes decomposing virtualization-obfuscated binaries into the largest semantically coherent units that fit within LLM context windows, then labeling those units by structural role (handler, dispatcher, bytecode, etc.) via an automated static-analysis framework. The resulting labeled units are intended to support large-scale dataset generation for LLM-based reverse engineering. The manuscript asserts that the prototype achieves strong performance on real-world virtualization obfuscators.

Significance. If the static-analysis labeling step proves reliable, the pipeline would supply a scalable source of training data for LLMs on a class of binaries that currently defeats both standard decompilers and direct LLM ingestion. The emphasis on structural rather than full semantic labels is a pragmatic engineering choice given current LLM limits. The automated nature of the labeling is a clear strength for reproducibility and scale.

major comments (2)
  1. [Abstract] Abstract: the claim that the prototype 'shows strong performance on real-world virtualization obfuscators' is unsupported by any quantitative metrics (precision/recall of structural-role labels, number of binaries processed, error analysis, or ground-truth comparison). Because the central contribution is the automated labeling pipeline that enables LLM analysis, the absence of these figures makes the performance assertion unevaluable.
  2. [Static Analysis Framework] Static-analysis framework description: virtualization obfuscation replaces conventional control flow with a custom VM interpreter and data-driven dispatch, which defeats standard CFG recovery. The manuscript provides no validation (manual inspection, inter-rater agreement, or comparison against hand-annotated ground truth) that the framework correctly identifies handlers, dispatchers, and bytecode on such binaries. Systematic mislabeling would render the generated dataset unusable for downstream LLM tasks.
minor comments (1)
  1. The abstract would be clearer if it named the specific real-world obfuscators (e.g., VMProtect, Themida) on which the prototype was tested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments correctly identify areas where the current manuscript requires additional evidence to support its claims. We address each major comment below and commit to revisions that will strengthen the paper without altering its core contribution.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the prototype 'shows strong performance on real-world virtualization obfuscators' is unsupported by any quantitative metrics (precision/recall of structural-role labels, number of binaries processed, error analysis, or ground-truth comparison). Because the central contribution is the automated labeling pipeline that enables LLM analysis, the absence of these figures makes the performance assertion unevaluable.

    Authors: We agree that the abstract's assertion of strong performance is currently unsupported by quantitative evidence and therefore unevaluable. In the revised manuscript we will (1) qualify or remove the unqualified claim from the abstract and (2) add a dedicated evaluation section that reports precision and recall for each structural-role label, the total number of binaries processed, error analysis, and direct comparisons against hand-annotated ground truth on a representative subset of real-world samples. These additions will make the performance claims concrete and directly address the referee's concern. revision: yes

  2. Referee: [Static Analysis Framework] Static-analysis framework description: virtualization obfuscation replaces conventional control flow with a custom VM interpreter and data-driven dispatch, which defeats standard CFG recovery. The manuscript provides no validation (manual inspection, inter-rater agreement, or comparison against hand-annotated ground truth) that the framework correctly identifies handlers, dispatchers, and bytecode on such binaries. Systematic mislabeling would render the generated dataset unusable for downstream LLM tasks.

    Authors: The referee accurately describes the validation gap. We will add a new subsection under the static-analysis framework that presents (a) manual inspection results on a sample of labeled units drawn from real-world obfuscated binaries, (b) any inter-rater agreement statistics obtained during labeling review, and (c) quantitative comparison against hand-annotated ground truth where such annotations exist. This validation will demonstrate the framework's reliability and directly mitigate the risk that systematic mislabeling would compromise downstream LLM training data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; engineering pipeline with external validation

full rationale

The paper describes an implementation of a static analysis framework to decompose obfuscated binaries and automate structural-role labeling for LLM input. No mathematical derivations, equations, parameter fittings, or self-citation chains are present that reduce any claim to its own inputs by construction. Performance claims are positioned as testable against real-world virtualization obfuscators and human annotations, making the work self-contained against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions from static binary analysis and LLM context-window engineering. No free parameters are fitted, no new entities are postulated, and no ad-hoc axioms are introduced beyond the domain assumption that structural labels are useful proxies for semantic understanding.

axioms (1)
  • domain assumption Static analysis can reliably identify structural roles (dispatcher, handler, etc.) in virtualization-obfuscated code without false positives that would corrupt the training set.
    Invoked when the static framework is said to automate labeling.

pith-pipeline@v0.9.0 · 5387 in / 1255 out tokens · 42410 ms · 2026-05-12T03:38:16.332867+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

  1. [1]

    A taxonomy of obfuscating transformations,

    C. Collberg, C. Thomborson, and D. Low, “A taxonomy of obfuscating transformations,” Department of Computer Science, University of Auckland, New Zealand, Tech. Rep. 148, 1997

  2. [2]

    A survey of code obfuscation techniques,

    V . J. M. Salis and A. S. Sani, “A survey of code obfuscation techniques,” ACM Comput. Surv., vol. 53, no. 3, pp. 1–36, 2020

  3. [3]

    In2025 IEEE Symposium on Security and Privacy (SP)

    N. Zhang, et al., “Inspecting Virtual Machine Diversification Inside Virtualization Obfuscation,” 2025 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 2025, pp. 3051–3069, doi: 10.1109/SP61157.2025.00071

  4. [4]

    Chosen-Instruction Attack Against Commercial Code Virtualization Obfuscators,

    S. Li, et al., “Chosen-Instruction Attack Against Commercial Code Virtualization Obfuscators,” 29th Annual Network and Distributed System Security Symposium, NDSS 2022, San Diego, California, USA, April 24–28, 2022

  5. [5]

    Virtualization-based obfuscation: A survey,

    J. Teschke and T. Kurth, “Virtualization-based obfuscation: A survey,” in Proc. 12th Int. Conf. Availability, Rel. Secur. (ARES), 2017, pp. 1–10

  6. [6]

    The tigress diversifying c compiler,

    B. Kinder, “The tigress diversifying c compiler,” 2023. [Online]. Available: http://tigress.cs.arizona.edu/

  7. [7]

    LLVM: A compilation framework for lifelong program analysis & transformation,

    C. Lattner and V . Adve, “LLVM: A compilation framework for lifelong program analysis & transformation,” in Proc. Int. Symp. Code Gener. Optim. (CGO), 2004, pp. 75–86. 7

  8. [8]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” 2018, arXiv:1810.04805. [Online]. Available: https://arxiv.org/abs/1810.04805

  9. [9]

    Palmtree: Learning an assembly language model for instruction embedding,

    B. Li, S. Feng, and G. Tan, “Palmtree: Learning an assembly language model for instruction embedding,” in Proc. 2021 ACM SIGSAC Conf. Comput. Commun. Secur. (CCS), 2021, pp. 3236–3251

  10. [10]

    DeepBinDiff: Learning program-wide code representations for binary diffing,

    H. Pei, B. G. Chun, and S. Kim, “DeepBinDiff: Learning program-wide code representations for binary diffing,” in Proc. 27th Network Distrib. Syst. Secur. Symp. (NDSS), 2020

  11. [11]

    Identifying virtualized code via deep learning,

    G. Luan, S. Feng, and G. Tan, “Identifying virtualized code via deep learning,” in Proc. 36th IEEE/ACM Int. Conf. Autom. Softw. Eng. (ASE), 2021, pp. 1178–1189

  12. [12]

    BinBert: Binary code understanding with BERT,

    X. Wang, J. Yang, and Z. Lin, “BinBert: Binary code understanding with BERT,” in Proc. 32nd IEEE Int. Conf. Softw. Maint. Evol. (ICSME), 2022

  13. [13]

    Attention is all you need,

    S. Vaswani et al., “Attention is all you need,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst. (NIPS), 2017, pp. 5998–6008

  14. [14]

    CNN based Virtualized Obfuscated Malware Detection Technique,

    D. Yoo et al., “CNN based Virtualized Obfuscated Malware Detection Technique,” in Proc. Symposium of the Korean Institute of Communications and Information Sciences, 2024, pp. 637–638

  15. [15]

    Analysis of anti-reversing functionalities of vmprotect and bypass method using pin,

    S. Park and Y . Park, “Analysis of anti-reversing functionalities of vmprotect and bypass method using pin,” KIPS Trans. Comp. Comm. Syst., vol. 10, no. 11, pp. 297–304, 2021

  16. [16]

    VMHunt: A verifiable approach to partially-virtualized binary code simplification,

    D. Xu et al., “VMHunt: A verifiable approach to partially-virtualized binary code simplification,” in Proc. ACM CCS, 2018, pp. 442–458

  17. [17]

    VMAttack: Deobfuscating virtualization-based packed binaries,

    A. Kalysch et al., “VMAttack: Deobfuscating virtualization-based packed binaries,” in Proc. ARES, 2017, pp. 1–10. 8