arxiv: 2604.19526 · v1 · submitted 2026-04-21 · 💻 cs.CR · cs.LG· cs.SE

Recognition: unknown

Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection

Divyesh Gabbireddy , Suman Saha

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:24 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.SE

keywords cross-site scriptingXSS obfuscationLLM generationmachine learning detectionweb securitypayload generationruntime behavioradversarial examples

0 comments

The pith

Fine-tuning LLMs on behavior-preserving pairs raises XSS obfuscation runtime match rate from 0.15 to 0.22, yet generated payloads do not improve ML detectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a pipeline that uses large language models to produce obfuscated cross-site scripting payloads and then checks whether those payloads retain the original malicious behavior when run in a browser. Baseline models match behavior 15 percent of the time; fine-tuning on valid source-target pairs lifts the rate to 22 percent. Even so, adding the generated samples does not raise the performance of downstream machine-learning detectors, though behavior-filtered samples can be included without harm. A sympathetic reader would care because reliable ways to create diverse yet functionally equivalent adversarial examples could strengthen training data for security tools that currently fail against obfuscated attacks.

Core claim

The authors show that current large language models, even after fine-tuning on behavior-preserving obfuscation pairs, achieve only modest success at generating XSS payloads whose runtime behavior matches the original, with match rates increasing from 0.15 to 0.22. A structured pipeline that combines deterministic transformations, LLM generation, and browser-based runtime evaluation is used to assess samples by observable execution rather than syntax alone. Downstream classifier tests reveal that unfiltered generated payloads do not improve detection performance, while behavior-filtered ones can be added without material degradation.

What carries the argument

Browser-based runtime evaluation procedure that compares observable execution behavior of original and obfuscated payloads inside a controlled environment to verify behavioral preservation.

If this is right

Runtime behavior must be verified explicitly rather than inferred from syntactic changes alone for generated samples to be useful.
Fine-tuning on valid source-target obfuscation pairs produces a measurable but still limited gain in generation quality.
Behavior-filtered LLM outputs can be incorporated into training sets without degrading detector performance.
Current LLMs require further advances to generate behaviorally valid adversarial security data at high volume.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Larger or more targeted fine-tuning datasets built from execution traces could push match rates higher than 0.22.
The runtime-validation approach could extend to other injection vulnerabilities where semantic equivalence matters for evasion.
Hybrid generation that pairs LLMs with rule-based obfuscators might close the remaining gap in behavioral fidelity.
If modest gains persist, detection research should prioritize runtime-semantic features over surface-form diversity.

Load-bearing premise

The browser-based runtime evaluation procedure accurately determines whether an obfuscated payload preserves the original malicious behavior.

What would settle it

A test case in which a payload passes the runtime match check yet fails to perform the expected malicious action when executed in an unmodified standard browser would falsify the evaluation method.

Figures

Figures reproduced from arXiv: 2604.19526 by Divyesh Gabbireddy, Suman Saha.

**Figure 2.** Figure 2: Example of a multi-step obfuscation chain showing sequential [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training and validation loss curves across epochs for the fine-tuned [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Runtime behavior match rates across different transformation pair [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Cross-site scripting (XSS) remains a persistent web security vulnerability, especially because obfuscation can change the surface form of a malicious payload while preserving its behavior. These transformations make it difficult for traditional and machine learning-based detection systems to reliably identify attacks. Existing approaches for generating obfuscated payloads often emphasize syntactic diversity, but they do not always ensure that the generated samples remain behaviorally valid. This paper presents a structured pipeline for generating and evaluating obfuscated XSS payloads using large language models (LLMs). The pipeline combines deterministic transformation techniques with LLM-based generation and uses a browser- based runtime evaluation procedure to compare payload behavior in a controlled execution environment. This allows generated samples to be assessed through observable runtime behavior rather than syntactic similarity alone. In the evaluation, an untuned baseline language model achieves a runtime behavior match rate of 0.15, while fine-tuning on behavior-preserving source-target obfuscation pairs improves the match rate to 0.22. Although this represents a measurable improvement, the results show that current LLMs still struggle to generate obfuscations that preserve observed runtime behavior. A downstream classifier evaluation further shows that adding generated payloads does not improve detection performance in this setting, although behavior- filtered generated samples can be incorporated without materially degrading performance. Overall, the study demonstrates both the promise and the limits of applying generative models to adversarial security data generation and emphasizes the importance of runtime behavior checks in improving the quality of generated data for downstream detection systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents a pipeline for generating obfuscated XSS payloads using LLMs that combines deterministic transformations with LLM-based generation. It evaluates the outputs via a browser-based runtime procedure that measures observable behavior preservation rather than syntactic similarity alone. The evaluation reports an untuned baseline match rate of 0.15 that rises to 0.22 after fine-tuning on behavior-preserving source-target pairs; downstream experiments show that the generated payloads do not improve ML-based XSS detection performance, although behavior-filtered samples can be added without degrading it.

Significance. If the runtime behavior match metric is shown to be reliable, the work would usefully document the current limits of LLMs for producing valid adversarial security samples and would reinforce the value of runtime checks when generating training data for detection systems. The concrete empirical numbers and the negative downstream result supply a clear, falsifiable baseline for future LLM-based adversarial generation research.

major comments (3)

[Evaluation / Runtime Behavior Match Rate] The central claims rest on the browser-based runtime evaluation procedure that produces the 0.15 and 0.22 match rates. No quantitative validation (precision, recall, or agreement with manual ground-truth labels on a held-out set of known preserving/non-preserving pairs) is reported for this procedure, leaving open the possibility that the modest improvement and the downstream classifier result are artifacts of incomplete observable coverage or environment-specific effects.
[Results and Abstract] The abstract and results sections report concrete match rates and classifier effects without stating sample sizes, number of trials, statistical significance tests, or confidence intervals. This absence makes it impossible to assess whether the 0.07 absolute improvement from fine-tuning is distinguishable from noise or whether the “no improvement” classifier finding is robust.
[Downstream Classifier Evaluation] The downstream classifier experiment claims that adding generated payloads does not improve detection while filtered samples do not degrade it. The manuscript provides insufficient detail on the base classifier architecture, training/test splits, feature representation, and how the behavior filter is applied, preventing evaluation of whether the negative result is load-bearing or merely an artifact of the experimental setup.

minor comments (2)

[Abstract] The abstract states that “behavior-filtered generated samples can be incorporated without materially degrading performance” but does not define the filtering criterion or the degradation threshold used.
[Methods] Notation for the runtime match rate (e.g., how observable effects are encoded and compared) is introduced without an explicit equation or pseudocode in the methods description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments identify key areas where additional rigor and transparency will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns without altering the core empirical findings.

read point-by-point responses

Referee: [Evaluation / Runtime Behavior Match Rate] The central claims rest on the browser-based runtime evaluation procedure that produces the 0.15 and 0.22 match rates. No quantitative validation (precision, recall, or agreement with manual ground-truth labels on a held-out set of known preserving/non-preserving pairs) is reported for this procedure, leaving open the possibility that the modest improvement and the downstream classifier result are artifacts of incomplete observable coverage or environment-specific effects.

Authors: We agree that explicit validation of the runtime procedure against manual labels would increase confidence in the match rates. The procedure records observable DOM mutations, console output, and network requests in a headless browser to determine behavioral equivalence. In the revised manuscript we will add a dedicated validation subsection that manually labels a held-out set of 150 source-target pairs and reports precision, recall, and Cohen's kappa between the automated procedure and human judgments. This addition will directly address the possibility of measurement artifacts. revision: yes
Referee: [Results and Abstract] The abstract and results sections report concrete match rates and classifier effects without stating sample sizes, number of trials, statistical significance tests, or confidence intervals. This absence makes it impossible to assess whether the 0.07 absolute improvement from fine-tuning is distinguishable from noise or whether the “no improvement” classifier finding is robust.

Authors: We accept that the current presentation lacks the statistical context needed to evaluate robustness. Although the full results section describes the evaluation protocol, we will revise both the abstract and results to include the exact sample sizes, the number of independent trials performed, appropriate statistical significance tests for the observed improvement, and 95% confidence intervals around the reported match rates and classifier metrics. revision: yes
Referee: [Downstream Classifier Evaluation] The downstream classifier experiment claims that adding generated payloads does not improve detection while filtered samples do not degrade it. The manuscript provides insufficient detail on the base classifier architecture, training/test splits, feature representation, and how the behavior filter is applied, preventing evaluation of whether the negative result is load-bearing or merely an artifact of the experimental setup.

Authors: We recognize that greater experimental detail is required for reproducibility and to allow readers to judge the strength of the negative finding. In the revision we will expand the classifier subsection to specify the model architecture and hyperparameters, the precise train/test split ratios and stratification method, the feature extraction pipeline, and the exact threshold and application logic of the behavior filter. These additions will make the experimental setup fully transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: results rest on independent empirical runtime measurements

full rationale

The paper's central claims derive from direct browser-based execution of payloads to compute observable behavior match rates (0.15 baseline, 0.22 fine-tuned) followed by separate classifier accuracy tests. These are standalone experimental observations with no equations, fitted parameters renamed as predictions, self-citations invoked as uniqueness theorems, or ansatzes smuggled in. The runtime procedure is an external measurement step rather than a self-referential definition, so the reported improvement and negative downstream result do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the paper relies on standard assumptions from machine learning and web security evaluation without introducing new free parameters, axioms, or invented entities. Full text would be required to audit any implicit choices in data selection or model training.

pith-pipeline@v0.9.0 · 5566 in / 1230 out tokens · 45456 ms · 2026-05-10T02:24:55.259954+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Cross Site Scripting (XSS),

OW ASP Foundation, “Cross Site Scripting (XSS),” https://owasp.org/ www-community/attacks/xss/, accessed: 2026-04-21

2026
[2]

OW ASP Top 10: Web Application Security Risks,

——, “OW ASP Top 10: Web Application Security Risks,” https://owasp. org/www-project-top-ten/, accessed: 2026-04-21

2026
[3]

Outside the closed world: On using machine learning for network intrusion detection,

R. Sommer and V . Paxson, “Outside the closed world: On using machine learning for network intrusion detection,” inProceedings of the 2010 IEEE Symposium on Security and Privacy, 2010, pp. 305–316

2010
[4]

Evaluating Large Language Models Trained on Code

M. Chenet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Y . Wanget al., “CodeT5: Identifier-aware unified pre-trained encoder- decoder models for code understanding and generation,”arXiv preprint arXiv:2109.00859, 2021

work page internal anchor Pith review arXiv 2021
[6]

Vulnerability disclosure in the age of social media: Exploiting twitter for predicting real-world exploits,

C. Sabottke, O. Suciu, and T. Dumitras, “Vulnerability disclosure in the age of social media: Exploiting twitter for predicting real-world exploits,” inProceedings of the 24th USENIX Security Symposium, 2015, pp. 1041– 1056

2015