Leveraging Large Language Models to Obscure Code Stylometry: A Comparative Study of GPT-3.5 and GPT-4

Benjamin Tan; Saman Pordanesh

arxiv: 2606.22306 · v1 · pith:STNWZPASnew · submitted 2026-06-21 · 💻 cs.SE · cs.AI

Leveraging Large Language Models to Obscure Code Stylometry: A Comparative Study of GPT-3.5 and GPT-4

Saman Pordanesh , Benjamin Tan This is my paper

Pith reviewed 2026-06-26 10:23 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords code stylometrylarge language modelsauthorship attributionprompt engineeringGPTcode obfuscationcybersecuritysoftware engineering

0 comments

The pith

Large language models can alter code to evade stylometry-based authorship attribution, with effectiveness varying by prompting method.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines whether GPT-3.5 and GPT-4 can modify source code to hide the original author's stylistic fingerprints. A sympathetic reader would care because successful obfuscation would weaken tools used to attribute code in security investigations or intellectual property disputes. The experiments compare single-shot and multi-shot prompting approaches and check both the drop in a Random Forest classifier's accuracy and whether the code still executes correctly after changes.

Core claim

The study finds that LLMs can obscure code stylometry, with multi-shot methods outperforming single-shot ones and detailed structured prompts proving more effective, while also showing that preserving the original functionality remains a significant challenge after LLM modifications.

What carries the argument

A Random Forest classifier trained on code features for authorship attribution, against which the success of LLM-based style alteration is measured.

If this is right

Multi-shot prompting achieves greater reduction in attribution accuracy than single-shot prompting.
Detailed and structured prompts lead to better obfuscation results.
Functionality of the modified code is often not preserved, limiting practical use.
Authorship attribution techniques face new challenges from advanced AI code modification capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing code attribution systems may require retraining or new features to handle LLM-generated variations.
This technique could be applied by developers to anonymize their contributions in collaborative projects.
Further tests with other LLMs or different classifiers could reveal if the observed differences are model-specific.

Load-bearing premise

The Random Forest classifier provides a reliable measure of whether stylometric signatures have been successfully obscured.

What would settle it

If a new classifier trained specifically on examples of LLM-modified code achieves high attribution accuracy on the test set, this would indicate that the obfuscation is not robust.

Figures

Figures reproduced from arXiv: 2606.22306 by Benjamin Tan, Saman Pordanesh.

**Figure 2.** Figure 2: Overview of the data gathering process and dataset [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the feature extraction process. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the prompt engineering and strategies [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of the LLM communication pipeline and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

In the rapidly evolving field of software development, code stylometry analyzing unique stylistic signatures of programmers plays a crit-ical role in authorship attribution and cybersecurity. Recent advancements in artificial intelligence, particularly Large Language Models (LLMs) like GPT-3.5 and GPT-4, have introduced new dimensions to this field, challenging traditional stylometry techniques. This study investigates the effectiveness of LLMs in altering code stylometry while preserving functionality and evaluates the impact of various prompt engineering strategies. Through comprehensive experiments, we assess how well these models can obscure stylistic signatures to avoid detection by a Random Forest classifier trained for authorship attribution. The results reveal significant differences in effectiveness between single-shot and multi-shot methods and highlight the importance of detailed, structured prompts. Additionally, functionality preservation checks demonstrate the challenges in maintaining code integrity post-modification. This research provides critical insights into the robustness of authorship attribution techniques against advanced AI capabilities, informing future cybersecurity and software engineering developments

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies GPT-3.5 and GPT-4 to rewrite code for stylometry evasion but the Random Forest results cannot be interpreted without a reported baseline accuracy on clean code.

read the letter

The main takeaway is that this work tests whether GPT models can rewrite code to lower detection rates from a stylometry classifier, with a comparison of single-shot versus multi-shot prompts and some attention to whether the output still runs. The experiments appear to be the first direct head-to-head on these two model versions for this specific task.

The paper does a reasonable job of including functionality preservation checks alongside the detection metric. That choice makes the evaluation more realistic than pure accuracy drops alone.

The soft spot is exactly the one flagged in the stress test. The abstract and available description give no baseline accuracy, cross-validation score, or feature details for the Random Forest on the original code. Without that number it is impossible to know whether post-modification drops reflect successful obfuscation or just a weak starting classifier. The same holds for the functionality checks: no pass rates or test-suite sizes are mentioned. These omissions are central rather than minor because the whole claim rests on measured changes in classifier output.

The work is aimed at people who study code attribution, adversarial ML in security, or prompt engineering for code tasks. A reader already running stylometry experiments could extract the prompting comparisons as a useful reference point even if they have to redo the controls themselves.

It deserves peer review. The topic is timely, the experimental direction is clear, and the gaps are fixable with additional reporting rather than a fundamental redesign.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that GPT-3.5 and GPT-4 can obscure code stylometry signatures through single-shot and multi-shot prompt engineering, producing measurable drops in accuracy for a Random Forest authorship attribution classifier while attempting to preserve functionality; it reports differences in effectiveness across prompting strategies and notes challenges in post-modification code integrity.

Significance. If the central experimental claims are supported by proper validation, the work would illustrate concrete limitations of traditional stylometry against LLM-based code transformation, informing both offensive obfuscation techniques and defensive attribution methods in cybersecurity and software engineering.

major comments (2)

[Abstract] Abstract and experimental description: no baseline accuracy, cross-validation scores, feature importance, or comparison to alternative classifiers (e.g., SVM) is reported for the Random Forest model on unmodified code from the same authors and dataset. Without establishing that the classifier performs substantially above chance on clean code, post-modification accuracy drops cannot be interpreted as successful stylometry obfuscation; this is load-bearing for the central claim.
[Functionality preservation checks] Functionality preservation section: the manuscript states that checks 'demonstrate the challenges' but supplies no quantitative metrics such as test-suite pass rates, compilation success, or behavioral equivalence measures. This prevents assessment of whether obfuscation success comes at the cost of broken code.

minor comments (1)

[Abstract] Abstract contains a hyphenation artifact: 'crit-ical' should read 'critical'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify key gaps in reporting that affect interpretability of the central claims. We have revised the manuscript to incorporate the requested baseline metrics and quantitative functionality results.

read point-by-point responses

Referee: [Abstract] Abstract and experimental description: no baseline accuracy, cross-validation scores, feature importance, or comparison to alternative classifiers (e.g., SVM) is reported for the Random Forest model on unmodified code from the same authors and dataset. Without establishing that the classifier performs substantially above chance on clean code, post-modification accuracy drops cannot be interpreted as successful stylometry obfuscation; this is load-bearing for the central claim.

Authors: We agree that baseline performance on unmodified code is essential for validating the obfuscation results. The original manuscript reported accuracy drops but did not include sufficient detail on initial classifier performance. In the revision we have added the Random Forest baseline accuracy on clean code (well above chance), 5-fold cross-validation scores, feature importance analysis, and a direct comparison against an SVM classifier trained on the same dataset and features to confirm the observed drops reflect genuine stylometry changes rather than classifier weakness. revision: yes
Referee: [Functionality preservation checks] Functionality preservation section: the manuscript states that checks 'demonstrate the challenges' but supplies no quantitative metrics such as test-suite pass rates, compilation success, or behavioral equivalence measures. This prevents assessment of whether obfuscation success comes at the cost of broken code.

Authors: We acknowledge the absence of quantitative metrics in the functionality section limits evaluation of the trade-off. The revision now reports concrete measures including compilation success rates, test-suite pass rates for snippets with available tests, and behavioral equivalence checks via execution on sample inputs. These additions make explicit the extent to which functionality is preserved or degraded under each prompting strategy. revision: yes

Circularity Check

0 steps flagged

No circularity: experimental comparison with external classifier metric

full rationale

This is an empirical study that applies LLMs via prompts to modify code and then measures the effect using a separately trained Random Forest authorship classifier. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or described experiments. The central claim rests on observed accuracy changes rather than any reduction to the paper's own inputs by construction. The noted absence of baseline classifier metrics is a validity concern, not a circularity issue under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study depends on the assumption that the chosen classifier and functionality checks are valid measures, with no free parameters explicitly mentioned.

axioms (1)

domain assumption The Random Forest classifier accurately captures code stylometry features for authorship attribution.
The paper relies on this to measure success of obfuscation.

pith-pipeline@v0.9.1-grok · 5699 in / 1174 out tokens · 29719 ms · 2026-06-26T10:23:20.447592+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 1 linked inside Pith

[1]

De-anonymizing program- mers via code stylometry,

A. Caliskan-Islam et al., “De-anonymizing program- mers via code stylometry,” in24th USENIX security symposium (USENIX Security 15), 2015, pp. 255–270

2015
[2]

Distinguishing ai-and human-generated code: A case study,

S. Bukhari, B. Tan, and L. De Carli, “Distinguishing ai-and human-generated code: A case study,” inPro- ceedings of the 2023 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses, 2023, pp. 17–25

2023
[3]

[Online]

OpenAI,Gpt-3.5. [Online]. Available: https://openai. com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates
[4]

OpenAI,Gpt-4, Mar. 2023. [Online]. Available: https: //openai.com/research/gpt-4

2023
[5]

Robust learning against relational adversaries,

Y . Wang, M. Alhanahnah, X. Meng, K. Wang, M. Christodorescu, and S. Jha, “Robust learning against relational adversaries,”Advances in Neural Information Processing Systems, vol. 35, pp. 16 246–16 260, 2022

2022
[6]

Doppelg¨anger finder: Taking stylometry to the underground,

S. Afroz, A. C. Islam, A. Stolerman, R. Greenstadt, and D. McCoy, “Doppelg¨anger finder: Taking stylometry to the underground,” in2014 IEEE Symposium on Security and Privacy, IEEE, 2014, pp. 212–226

2014
[7]

Practical attacks against authorship recognition techniques,

M. R. Brennan and R. Greenstadt, “Practical attacks against authorship recognition techniques,” inTwenty- First IAAI Conference, 2009. UNIVERSITY OF CALGARY , SCHULICH SCHOOL OF ENGINEERING, UNDERGRADUATE RESEARCH THESIS, WINTER 2024 10

2009
[8]

Source code authorship attribu- tion using long short-term memory based networks,

B. Alsulami, E. Dauber, R. Harang, S. Mancoridis, and R. Greenstadt, “Source code authorship attribu- tion using long short-term memory based networks,” inComputer Security–ESORICS 2017: 22nd European Symposium on Research in Computer Security, Oslo, Norway, September 11-15, 2017, Proceedings, Part I 22, Springer, 2017, pp. 65–82

2017
[9]

When coding style survives compi- lation: De-anonymizing programmers from executable binaries,

A. Caliskan et al., “When coding style survives compi- lation: De-anonymizing programmers from executable binaries,”arXiv preprint arXiv:1512.08546, 2015

Pith/arXiv arXiv 2015
[10]

Apr. 2024. [Online]. Available: https://en.wikipedia.org/ wiki/Abstract syntax tree#References

2024
[11]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735– 1780, 1997

1997
[12]

[Online]

microsoft, May 2024. [Online]. Available: https : / / github.com/microsoft/methods2test

2024
[13]

Unit test case generation with transformers and focal context,

M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan, “Unit test case generation with transformers and focal context,”arXiv preprint arXiv:2009.05617, 2020

arXiv 2009
[14]

Rebryk,Rebryk/code stylometry, Sep

Y . Rebryk,Rebryk/code stylometry, Sep. 2023. [On- line]. Available: https : / / github . com / rebryk / code stylometry

2023
[15]

Pordanesh,Sinapordanesh/llms on code stylometry, May 2024

S. Pordanesh,Sinapordanesh/llms on code stylometry, May 2024. [Online]. Available: https : / / github . com / sinapordanesh/LLMs on Code Stylometry

2024
[16]

Random forests,

L. Breiman, “Random forests,”Machine learning, vol. 45, pp. 5–32, 2001. APPENDIX A.1 Methods2Test Dataset JSON Structure. Fig. 6: Methods2Test Dataset JSON Structure

2001

[1] [1]

De-anonymizing program- mers via code stylometry,

A. Caliskan-Islam et al., “De-anonymizing program- mers via code stylometry,” in24th USENIX security symposium (USENIX Security 15), 2015, pp. 255–270

2015

[2] [2]

Distinguishing ai-and human-generated code: A case study,

S. Bukhari, B. Tan, and L. De Carli, “Distinguishing ai-and human-generated code: A case study,” inPro- ceedings of the 2023 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses, 2023, pp. 17–25

2023

[3] [3]

[Online]

OpenAI,Gpt-3.5. [Online]. Available: https://openai. com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates

[4] [4]

OpenAI,Gpt-4, Mar. 2023. [Online]. Available: https: //openai.com/research/gpt-4

2023

[5] [5]

Robust learning against relational adversaries,

Y . Wang, M. Alhanahnah, X. Meng, K. Wang, M. Christodorescu, and S. Jha, “Robust learning against relational adversaries,”Advances in Neural Information Processing Systems, vol. 35, pp. 16 246–16 260, 2022

2022

[6] [6]

Doppelg¨anger finder: Taking stylometry to the underground,

S. Afroz, A. C. Islam, A. Stolerman, R. Greenstadt, and D. McCoy, “Doppelg¨anger finder: Taking stylometry to the underground,” in2014 IEEE Symposium on Security and Privacy, IEEE, 2014, pp. 212–226

2014

[7] [7]

Practical attacks against authorship recognition techniques,

M. R. Brennan and R. Greenstadt, “Practical attacks against authorship recognition techniques,” inTwenty- First IAAI Conference, 2009. UNIVERSITY OF CALGARY , SCHULICH SCHOOL OF ENGINEERING, UNDERGRADUATE RESEARCH THESIS, WINTER 2024 10

2009

[8] [8]

Source code authorship attribu- tion using long short-term memory based networks,

B. Alsulami, E. Dauber, R. Harang, S. Mancoridis, and R. Greenstadt, “Source code authorship attribu- tion using long short-term memory based networks,” inComputer Security–ESORICS 2017: 22nd European Symposium on Research in Computer Security, Oslo, Norway, September 11-15, 2017, Proceedings, Part I 22, Springer, 2017, pp. 65–82

2017

[9] [9]

When coding style survives compi- lation: De-anonymizing programmers from executable binaries,

A. Caliskan et al., “When coding style survives compi- lation: De-anonymizing programmers from executable binaries,”arXiv preprint arXiv:1512.08546, 2015

Pith/arXiv arXiv 2015

[10] [10]

Apr. 2024. [Online]. Available: https://en.wikipedia.org/ wiki/Abstract syntax tree#References

2024

[11] [11]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735– 1780, 1997

1997

[12] [12]

[Online]

microsoft, May 2024. [Online]. Available: https : / / github.com/microsoft/methods2test

2024

[13] [13]

Unit test case generation with transformers and focal context,

M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan, “Unit test case generation with transformers and focal context,”arXiv preprint arXiv:2009.05617, 2020

arXiv 2009

[14] [14]

Rebryk,Rebryk/code stylometry, Sep

Y . Rebryk,Rebryk/code stylometry, Sep. 2023. [On- line]. Available: https : / / github . com / rebryk / code stylometry

2023

[15] [15]

Pordanesh,Sinapordanesh/llms on code stylometry, May 2024

S. Pordanesh,Sinapordanesh/llms on code stylometry, May 2024. [Online]. Available: https : / / github . com / sinapordanesh/LLMs on Code Stylometry

2024

[16] [16]

Random forests,

L. Breiman, “Random forests,”Machine learning, vol. 45, pp. 5–32, 2001. APPENDIX A.1 Methods2Test Dataset JSON Structure. Fig. 6: Methods2Test Dataset JSON Structure

2001