Leveraging Large Language Models to Obscure Code Stylometry: A Comparative Study of GPT-3.5 and GPT-4
Pith reviewed 2026-06-26 10:23 UTC · model grok-4.3
The pith
Large language models can alter code to evade stylometry-based authorship attribution, with effectiveness varying by prompting method.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study finds that LLMs can obscure code stylometry, with multi-shot methods outperforming single-shot ones and detailed structured prompts proving more effective, while also showing that preserving the original functionality remains a significant challenge after LLM modifications.
What carries the argument
A Random Forest classifier trained on code features for authorship attribution, against which the success of LLM-based style alteration is measured.
If this is right
- Multi-shot prompting achieves greater reduction in attribution accuracy than single-shot prompting.
- Detailed and structured prompts lead to better obfuscation results.
- Functionality of the modified code is often not preserved, limiting practical use.
- Authorship attribution techniques face new challenges from advanced AI code modification capabilities.
Where Pith is reading between the lines
- Existing code attribution systems may require retraining or new features to handle LLM-generated variations.
- This technique could be applied by developers to anonymize their contributions in collaborative projects.
- Further tests with other LLMs or different classifiers could reveal if the observed differences are model-specific.
Load-bearing premise
The Random Forest classifier provides a reliable measure of whether stylometric signatures have been successfully obscured.
What would settle it
If a new classifier trained specifically on examples of LLM-modified code achieves high attribution accuracy on the test set, this would indicate that the obfuscation is not robust.
Figures
read the original abstract
In the rapidly evolving field of software development, code stylometry analyzing unique stylistic signatures of programmers plays a crit-ical role in authorship attribution and cybersecurity. Recent advancements in artificial intelligence, particularly Large Language Models (LLMs) like GPT-3.5 and GPT-4, have introduced new dimensions to this field, challenging traditional stylometry techniques. This study investigates the effectiveness of LLMs in altering code stylometry while preserving functionality and evaluates the impact of various prompt engineering strategies. Through comprehensive experiments, we assess how well these models can obscure stylistic signatures to avoid detection by a Random Forest classifier trained for authorship attribution. The results reveal significant differences in effectiveness between single-shot and multi-shot methods and highlight the importance of detailed, structured prompts. Additionally, functionality preservation checks demonstrate the challenges in maintaining code integrity post-modification. This research provides critical insights into the robustness of authorship attribution techniques against advanced AI capabilities, informing future cybersecurity and software engineering developments
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that GPT-3.5 and GPT-4 can obscure code stylometry signatures through single-shot and multi-shot prompt engineering, producing measurable drops in accuracy for a Random Forest authorship attribution classifier while attempting to preserve functionality; it reports differences in effectiveness across prompting strategies and notes challenges in post-modification code integrity.
Significance. If the central experimental claims are supported by proper validation, the work would illustrate concrete limitations of traditional stylometry against LLM-based code transformation, informing both offensive obfuscation techniques and defensive attribution methods in cybersecurity and software engineering.
major comments (2)
- [Abstract] Abstract and experimental description: no baseline accuracy, cross-validation scores, feature importance, or comparison to alternative classifiers (e.g., SVM) is reported for the Random Forest model on unmodified code from the same authors and dataset. Without establishing that the classifier performs substantially above chance on clean code, post-modification accuracy drops cannot be interpreted as successful stylometry obfuscation; this is load-bearing for the central claim.
- [Functionality preservation checks] Functionality preservation section: the manuscript states that checks 'demonstrate the challenges' but supplies no quantitative metrics such as test-suite pass rates, compilation success, or behavioral equivalence measures. This prevents assessment of whether obfuscation success comes at the cost of broken code.
minor comments (1)
- [Abstract] Abstract contains a hyphenation artifact: 'crit-ical' should read 'critical'.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments identify key gaps in reporting that affect interpretability of the central claims. We have revised the manuscript to incorporate the requested baseline metrics and quantitative functionality results.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental description: no baseline accuracy, cross-validation scores, feature importance, or comparison to alternative classifiers (e.g., SVM) is reported for the Random Forest model on unmodified code from the same authors and dataset. Without establishing that the classifier performs substantially above chance on clean code, post-modification accuracy drops cannot be interpreted as successful stylometry obfuscation; this is load-bearing for the central claim.
Authors: We agree that baseline performance on unmodified code is essential for validating the obfuscation results. The original manuscript reported accuracy drops but did not include sufficient detail on initial classifier performance. In the revision we have added the Random Forest baseline accuracy on clean code (well above chance), 5-fold cross-validation scores, feature importance analysis, and a direct comparison against an SVM classifier trained on the same dataset and features to confirm the observed drops reflect genuine stylometry changes rather than classifier weakness. revision: yes
-
Referee: [Functionality preservation checks] Functionality preservation section: the manuscript states that checks 'demonstrate the challenges' but supplies no quantitative metrics such as test-suite pass rates, compilation success, or behavioral equivalence measures. This prevents assessment of whether obfuscation success comes at the cost of broken code.
Authors: We acknowledge the absence of quantitative metrics in the functionality section limits evaluation of the trade-off. The revision now reports concrete measures including compilation success rates, test-suite pass rates for snippets with available tests, and behavioral equivalence checks via execution on sample inputs. These additions make explicit the extent to which functionality is preserved or degraded under each prompting strategy. revision: yes
Circularity Check
No circularity: experimental comparison with external classifier metric
full rationale
This is an empirical study that applies LLMs via prompts to modify code and then measures the effect using a separately trained Random Forest authorship classifier. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or described experiments. The central claim rests on observed accuracy changes rather than any reduction to the paper's own inputs by construction. The noted absence of baseline classifier metrics is a validity concern, not a circularity issue under the enumerated patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Random Forest classifier accurately captures code stylometry features for authorship attribution.
Reference graph
Works this paper leans on
-
[1]
De-anonymizing program- mers via code stylometry,
A. Caliskan-Islam et al., “De-anonymizing program- mers via code stylometry,” in24th USENIX security symposium (USENIX Security 15), 2015, pp. 255–270
2015
-
[2]
Distinguishing ai-and human-generated code: A case study,
S. Bukhari, B. Tan, and L. De Carli, “Distinguishing ai-and human-generated code: A case study,” inPro- ceedings of the 2023 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses, 2023, pp. 17–25
2023
-
[3]
[Online]
OpenAI,Gpt-3.5. [Online]. Available: https://openai. com/blog/gpt-3-5-turbo-fine-tuning-and-api-updates
-
[4]
OpenAI,Gpt-4, Mar. 2023. [Online]. Available: https: //openai.com/research/gpt-4
2023
-
[5]
Robust learning against relational adversaries,
Y . Wang, M. Alhanahnah, X. Meng, K. Wang, M. Christodorescu, and S. Jha, “Robust learning against relational adversaries,”Advances in Neural Information Processing Systems, vol. 35, pp. 16 246–16 260, 2022
2022
-
[6]
Doppelg¨anger finder: Taking stylometry to the underground,
S. Afroz, A. C. Islam, A. Stolerman, R. Greenstadt, and D. McCoy, “Doppelg¨anger finder: Taking stylometry to the underground,” in2014 IEEE Symposium on Security and Privacy, IEEE, 2014, pp. 212–226
2014
-
[7]
Practical attacks against authorship recognition techniques,
M. R. Brennan and R. Greenstadt, “Practical attacks against authorship recognition techniques,” inTwenty- First IAAI Conference, 2009. UNIVERSITY OF CALGARY , SCHULICH SCHOOL OF ENGINEERING, UNDERGRADUATE RESEARCH THESIS, WINTER 2024 10
2009
-
[8]
Source code authorship attribu- tion using long short-term memory based networks,
B. Alsulami, E. Dauber, R. Harang, S. Mancoridis, and R. Greenstadt, “Source code authorship attribu- tion using long short-term memory based networks,” inComputer Security–ESORICS 2017: 22nd European Symposium on Research in Computer Security, Oslo, Norway, September 11-15, 2017, Proceedings, Part I 22, Springer, 2017, pp. 65–82
2017
-
[9]
When coding style survives compi- lation: De-anonymizing programmers from executable binaries,
A. Caliskan et al., “When coding style survives compi- lation: De-anonymizing programmers from executable binaries,”arXiv preprint arXiv:1512.08546, 2015
Pith/arXiv arXiv 2015
-
[10]
Apr. 2024. [Online]. Available: https://en.wikipedia.org/ wiki/Abstract syntax tree#References
2024
-
[11]
Long short-term memory,
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735– 1780, 1997
1997
-
[12]
[Online]
microsoft, May 2024. [Online]. Available: https : / / github.com/microsoft/methods2test
2024
-
[13]
Unit test case generation with transformers and focal context,
M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan, “Unit test case generation with transformers and focal context,”arXiv preprint arXiv:2009.05617, 2020
arXiv 2009
-
[14]
Rebryk,Rebryk/code stylometry, Sep
Y . Rebryk,Rebryk/code stylometry, Sep. 2023. [On- line]. Available: https : / / github . com / rebryk / code stylometry
2023
-
[15]
Pordanesh,Sinapordanesh/llms on code stylometry, May 2024
S. Pordanesh,Sinapordanesh/llms on code stylometry, May 2024. [Online]. Available: https : / / github . com / sinapordanesh/LLMs on Code Stylometry
2024
-
[16]
Random forests,
L. Breiman, “Random forests,”Machine learning, vol. 45, pp. 5–32, 2001. APPENDIX A.1 Methods2Test Dataset JSON Structure. Fig. 6: Methods2Test Dataset JSON Structure
2001
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.