pith. sign in

arxiv: 2606.22306 · v1 · pith:STNWZPASnew · submitted 2026-06-21 · 💻 cs.SE · cs.AI

Leveraging Large Language Models to Obscure Code Stylometry: A Comparative Study of GPT-3.5 and GPT-4

Pith reviewed 2026-06-26 10:23 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords code stylometrylarge language modelsauthorship attributionprompt engineeringGPTcode obfuscationcybersecuritysoftware engineering
0
0 comments X

The pith

Large language models can alter code to evade stylometry-based authorship attribution, with effectiveness varying by prompting method.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines whether GPT-3.5 and GPT-4 can modify source code to hide the original author's stylistic fingerprints. A sympathetic reader would care because successful obfuscation would weaken tools used to attribute code in security investigations or intellectual property disputes. The experiments compare single-shot and multi-shot prompting approaches and check both the drop in a Random Forest classifier's accuracy and whether the code still executes correctly after changes.

Core claim

The study finds that LLMs can obscure code stylometry, with multi-shot methods outperforming single-shot ones and detailed structured prompts proving more effective, while also showing that preserving the original functionality remains a significant challenge after LLM modifications.

What carries the argument

A Random Forest classifier trained on code features for authorship attribution, against which the success of LLM-based style alteration is measured.

If this is right

  • Multi-shot prompting achieves greater reduction in attribution accuracy than single-shot prompting.
  • Detailed and structured prompts lead to better obfuscation results.
  • Functionality of the modified code is often not preserved, limiting practical use.
  • Authorship attribution techniques face new challenges from advanced AI code modification capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Existing code attribution systems may require retraining or new features to handle LLM-generated variations.
  • This technique could be applied by developers to anonymize their contributions in collaborative projects.
  • Further tests with other LLMs or different classifiers could reveal if the observed differences are model-specific.

Load-bearing premise

The Random Forest classifier provides a reliable measure of whether stylometric signatures have been successfully obscured.

What would settle it

If a new classifier trained specifically on examples of LLM-modified code achieves high attribution accuracy on the test set, this would indicate that the obfuscation is not robust.

Figures

Figures reproduced from arXiv: 2606.22306 by Benjamin Tan, Saman Pordanesh.

Figure 1
Figure 1. Figure 1: Overview of the experimental design process. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the data gathering process and dataset [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the feature extraction process. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the prompt engineering and strategies [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the LLM communication pipeline and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

In the rapidly evolving field of software development, code stylometry analyzing unique stylistic signatures of programmers plays a crit-ical role in authorship attribution and cybersecurity. Recent advancements in artificial intelligence, particularly Large Language Models (LLMs) like GPT-3.5 and GPT-4, have introduced new dimensions to this field, challenging traditional stylometry techniques. This study investigates the effectiveness of LLMs in altering code stylometry while preserving functionality and evaluates the impact of various prompt engineering strategies. Through comprehensive experiments, we assess how well these models can obscure stylistic signatures to avoid detection by a Random Forest classifier trained for authorship attribution. The results reveal significant differences in effectiveness between single-shot and multi-shot methods and highlight the importance of detailed, structured prompts. Additionally, functionality preservation checks demonstrate the challenges in maintaining code integrity post-modification. This research provides critical insights into the robustness of authorship attribution techniques against advanced AI capabilities, informing future cybersecurity and software engineering developments

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that GPT-3.5 and GPT-4 can obscure code stylometry signatures through single-shot and multi-shot prompt engineering, producing measurable drops in accuracy for a Random Forest authorship attribution classifier while attempting to preserve functionality; it reports differences in effectiveness across prompting strategies and notes challenges in post-modification code integrity.

Significance. If the central experimental claims are supported by proper validation, the work would illustrate concrete limitations of traditional stylometry against LLM-based code transformation, informing both offensive obfuscation techniques and defensive attribution methods in cybersecurity and software engineering.

major comments (2)
  1. [Abstract] Abstract and experimental description: no baseline accuracy, cross-validation scores, feature importance, or comparison to alternative classifiers (e.g., SVM) is reported for the Random Forest model on unmodified code from the same authors and dataset. Without establishing that the classifier performs substantially above chance on clean code, post-modification accuracy drops cannot be interpreted as successful stylometry obfuscation; this is load-bearing for the central claim.
  2. [Functionality preservation checks] Functionality preservation section: the manuscript states that checks 'demonstrate the challenges' but supplies no quantitative metrics such as test-suite pass rates, compilation success, or behavioral equivalence measures. This prevents assessment of whether obfuscation success comes at the cost of broken code.
minor comments (1)
  1. [Abstract] Abstract contains a hyphenation artifact: 'crit-ical' should read 'critical'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify key gaps in reporting that affect interpretability of the central claims. We have revised the manuscript to incorporate the requested baseline metrics and quantitative functionality results.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental description: no baseline accuracy, cross-validation scores, feature importance, or comparison to alternative classifiers (e.g., SVM) is reported for the Random Forest model on unmodified code from the same authors and dataset. Without establishing that the classifier performs substantially above chance on clean code, post-modification accuracy drops cannot be interpreted as successful stylometry obfuscation; this is load-bearing for the central claim.

    Authors: We agree that baseline performance on unmodified code is essential for validating the obfuscation results. The original manuscript reported accuracy drops but did not include sufficient detail on initial classifier performance. In the revision we have added the Random Forest baseline accuracy on clean code (well above chance), 5-fold cross-validation scores, feature importance analysis, and a direct comparison against an SVM classifier trained on the same dataset and features to confirm the observed drops reflect genuine stylometry changes rather than classifier weakness. revision: yes

  2. Referee: [Functionality preservation checks] Functionality preservation section: the manuscript states that checks 'demonstrate the challenges' but supplies no quantitative metrics such as test-suite pass rates, compilation success, or behavioral equivalence measures. This prevents assessment of whether obfuscation success comes at the cost of broken code.

    Authors: We acknowledge the absence of quantitative metrics in the functionality section limits evaluation of the trade-off. The revision now reports concrete measures including compilation success rates, test-suite pass rates for snippets with available tests, and behavioral equivalence checks via execution on sample inputs. These additions make explicit the extent to which functionality is preserved or degraded under each prompting strategy. revision: yes

Circularity Check

0 steps flagged

No circularity: experimental comparison with external classifier metric

full rationale

This is an empirical study that applies LLMs via prompts to modify code and then measures the effect using a separately trained Random Forest authorship classifier. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or described experiments. The central claim rests on observed accuracy changes rather than any reduction to the paper's own inputs by construction. The noted absence of baseline classifier metrics is a validity concern, not a circularity issue under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study depends on the assumption that the chosen classifier and functionality checks are valid measures, with no free parameters explicitly mentioned.

axioms (1)
  • domain assumption The Random Forest classifier accurately captures code stylometry features for authorship attribution.
    The paper relies on this to measure success of obfuscation.

pith-pipeline@v0.9.1-grok · 5699 in / 1174 out tokens · 29719 ms · 2026-06-26T10:23:20.447592+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 1 linked inside Pith

  1. [1]

    De-anonymizing program- mers via code stylometry,

    A. Caliskan-Islam et al., “De-anonymizing program- mers via code stylometry,” in24th USENIX security symposium (USENIX Security 15), 2015, pp. 255–270

  2. [2]

    Distinguishing ai-and human-generated code: A case study,

    S. Bukhari, B. Tan, and L. De Carli, “Distinguishing ai-and human-generated code: A case study,” inPro- ceedings of the 2023 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses, 2023, pp. 17–25

  3. [3]

    [Online]

    OpenAI,Gpt-3.5. [Online]. Available: https://openai. com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates

  4. [4]

    OpenAI,Gpt-4, Mar. 2023. [Online]. Available: https: //openai.com/research/gpt-4

  5. [5]

    Robust learning against relational adversaries,

    Y . Wang, M. Alhanahnah, X. Meng, K. Wang, M. Christodorescu, and S. Jha, “Robust learning against relational adversaries,”Advances in Neural Information Processing Systems, vol. 35, pp. 16 246–16 260, 2022

  6. [6]

    Doppelg¨anger finder: Taking stylometry to the underground,

    S. Afroz, A. C. Islam, A. Stolerman, R. Greenstadt, and D. McCoy, “Doppelg¨anger finder: Taking stylometry to the underground,” in2014 IEEE Symposium on Security and Privacy, IEEE, 2014, pp. 212–226

  7. [7]

    Practical attacks against authorship recognition techniques,

    M. R. Brennan and R. Greenstadt, “Practical attacks against authorship recognition techniques,” inTwenty- First IAAI Conference, 2009. UNIVERSITY OF CALGARY , SCHULICH SCHOOL OF ENGINEERING, UNDERGRADUATE RESEARCH THESIS, WINTER 2024 10

  8. [8]

    Source code authorship attribu- tion using long short-term memory based networks,

    B. Alsulami, E. Dauber, R. Harang, S. Mancoridis, and R. Greenstadt, “Source code authorship attribu- tion using long short-term memory based networks,” inComputer Security–ESORICS 2017: 22nd European Symposium on Research in Computer Security, Oslo, Norway, September 11-15, 2017, Proceedings, Part I 22, Springer, 2017, pp. 65–82

  9. [9]

    When coding style survives compi- lation: De-anonymizing programmers from executable binaries,

    A. Caliskan et al., “When coding style survives compi- lation: De-anonymizing programmers from executable binaries,”arXiv preprint arXiv:1512.08546, 2015

  10. [10]

    Apr. 2024. [Online]. Available: https://en.wikipedia.org/ wiki/Abstract syntax tree#References

  11. [11]

    Long short-term memory,

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735– 1780, 1997

  12. [12]

    [Online]

    microsoft, May 2024. [Online]. Available: https : / / github.com/microsoft/methods2test

  13. [13]

    Unit test case generation with transformers and focal context,

    M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan, “Unit test case generation with transformers and focal context,”arXiv preprint arXiv:2009.05617, 2020

  14. [14]

    Rebryk,Rebryk/code stylometry, Sep

    Y . Rebryk,Rebryk/code stylometry, Sep. 2023. [On- line]. Available: https : / / github . com / rebryk / code stylometry

  15. [15]

    Pordanesh,Sinapordanesh/llms on code stylometry, May 2024

    S. Pordanesh,Sinapordanesh/llms on code stylometry, May 2024. [Online]. Available: https : / / github . com / sinapordanesh/LLMs on Code Stylometry

  16. [16]

    Random forests,

    L. Breiman, “Random forests,”Machine learning, vol. 45, pp. 5–32, 2001. APPENDIX A.1 Methods2Test Dataset JSON Structure. Fig. 6: Methods2Test Dataset JSON Structure