pith. sign in

arxiv: 2504.16584 · v1 · pith:BEQN6PSTnew · submitted 2025-04-23 · 💻 cs.CR · cs.AI

Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code

Pith reviewed 2026-05-22 18:36 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords small language modelsCWE detectionPython codefine-tuningvulnerability detectionprivacy-preservingon-premise analysissoftware security
0
0 comments X

The pith

Fine-tuned small language models can detect MITRE Top 25 CWEs in Python code at 99 percent accuracy while running locally to keep code private.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether a 350-million parameter pre-trained code model can be adapted through fine-tuning to spot common security weaknesses in Python programs. Large language models already handle this analysis effectively but require sending code to remote servers, which creates privacy risks and high costs for proprietary work. The authors generated a 500-example dataset through synthetic creation followed by human review, then applied instruction-following fine-tuning; the base model detected none of the weaknesses while the tuned version reached 99 percent accuracy, 98.08 percent precision, 100 percent recall, and 99.04 percent F1-score on the held-out test set. A reader would care because the results point to a workable path for embedding accurate vulnerability scanning inside local tools and workflows without external data exposure.

Core claim

After instruction-following fine-tuning on a 500-example dataset of Python code covering the MITRE Top 25 CWEs, the specialized 350-million parameter codegen-mono model achieves approximately 99 percent accuracy, 98.08 percent precision, 100 percent recall, and 99.04 percent F1-score on CWE detection. The unmodified base model failed to identify any CWEs in the test samples. These outcomes indicate that small language models can be turned into precise, on-premise tools for vulnerability analysis in Python.

What carries the argument

Instruction-following fine-tuning of the pre-trained codegen-mono 350-million parameter model on a semi-supervised dataset of 500 Python examples generated synthetically and reviewed by humans.

If this is right

  • Fine-tuned small language models serve as accurate alternatives to large cloud models for detecting code weaknesses.
  • The method supports privacy-preserving analysis because inference can occur entirely on local hardware.
  • Near-perfect recall and high precision allow reliable integration of CWE checks into day-to-day development tools.
  • The approach lowers computational cost and removes the need to transmit sensitive code off-site.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fine-tuning recipe could be applied to other programming languages to create local scanners beyond Python.
  • Embedding the resulting model inside an IDE would let developers receive immediate local warnings about potential CWEs while writing code.
  • Refining the synthetic data process with more varied real-world examples might improve performance on unseen code patterns.

Load-bearing premise

The 500-example dataset built from LLM synthetic generation and human review represents real-world Python vulnerabilities without systematic labeling errors that would inflate the test metrics.

What would settle it

Running the fine-tuned model on a fresh collection of independently labeled real-world Python projects that contain verified MITRE Top 25 CWEs and measuring whether accuracy stays near 99 percent.

Figures

Figures reproduced from arXiv: 2504.16584 by Bangladesh), Bangladesh University of Engineering Technology, Communication Technology, Dhaka, Dinajpur, Hossen A Mustafa (Institute of Information, Md. Azizul Hakim Bappy (Institute of Information, Prottoy Saha (Institute of Information, Rajinus Salehat (Hajee Mohammad Danesh Science, Technology University.

Figure 1
Figure 1. Figure 1: Semi-Supervised Dataset Creation and Fine-Tuning Pipeline [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Dataset Example 3.3 Model Selection We selected the codegen-mono 350M model [24] as our base SLM. This model was chosen because: (a) It is a publicly available, pre-trained model specifically designed for code-related tasks. (b) Its 5 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities, such as Common Weakness Enumerations (CWEs). However, their reliance on cloud infrastructure and substantial computational requirements pose challenges for analyzing sensitive or proprietary codebases due to privacy concerns and inference costs. This work explores the potential of Small Language Models (SLMs) as a viable alternative for accurate, on-premise vulnerability detection. We investigated whether a 350-million parameter pre-trained code model (codegen-mono) could be effectively fine-tuned to detect the MITRE Top 25 CWEs specifically within Python code. To facilitate this, we developed a targeted dataset of 500 examples using a semi-supervised approach involving LLM-driven synthetic data generation coupled with meticulous human review. Initial tests confirmed that the base codegen-mono model completely failed to identify CWEs in our samples. However, after applying instruction-following fine-tuning, the specialized SLM achieved remarkable performance on our test set, yielding approximately 99% accuracy, 98.08% precision, 100% recall, and a 99.04% F1-score. These results strongly suggest that fine-tuned SLMs can serve as highly accurate and efficient tools for CWE detection, offering a practical and privacy-preserving solution for integrating advanced security analysis directly into development workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents a case study on fine-tuning the 350-million-parameter codegen-mono model on a 500-example dataset of Python code snippets, synthetically generated via LLM and refined by human review, to detect MITRE Top 25 CWEs. The base model fails to identify vulnerabilities, but after instruction-following fine-tuning the specialized SLM reports approximately 99% accuracy, 98.08% precision, 100% recall, and 99.04% F1-score on a held-out test set drawn from the same synthetic distribution, positioning the approach as a practical, privacy-preserving, on-premise alternative to large cloud LLMs for security analysis in development workflows.

Significance. If the reported performance generalizes, the work would illustrate that small language models can be specialized for high-accuracy CWE detection with modest data and compute, offering clear privacy and deployment advantages. The explicit contrast between the untuned model's complete failure and the fine-tuned model's strong metrics on the test set provides a useful existence proof for targeted SLM adaptation in code security tasks.

major comments (2)
  1. [Experiments / Results] The evaluation is performed exclusively on a test split from the 500-example LLM-synthetic dataset (described in the Dataset and Experiments sections). No external validation set drawn from real-world sources such as GitHub repositories, public CVEs, or production Python codebases is reported, which directly undermines the claim that the model constitutes a 'practical ... solution for integrating advanced security analysis directly into development workflows.'
  2. [Dataset Construction] The manuscript provides no quantification of distribution shift, no analysis of potential artifacts from the LLM synthetic generation process, and no details on train/test split sizes or cross-validation (Abstract and Dataset sections). Given that the headline metrics rest on this single synthetic distribution, these omissions leave the 99% accuracy and 99.04% F1-score vulnerable to inflation from memorization or generation biases.
minor comments (3)
  1. [Abstract] The abstract states a dataset of 500 examples but does not report the test-set size or split ratio; adding these figures would allow readers to assess the statistical reliability of the reported metrics.
  2. [Experiments] Consider including a direct comparison against at least one additional baseline (e.g., a different small code model or a simple static-analysis rule set) to isolate the contribution of the instruction fine-tuning step.
  3. [Task Definition] Clarify the exact CWE detection formulation (binary classification per CWE, multi-label, or ranked list) and whether the model outputs CWE identifiers or only vulnerability flags.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and have revised the manuscript to improve transparency and balance the claims.

read point-by-point responses
  1. Referee: [Experiments / Results] The evaluation is performed exclusively on a test split from the 500-example LLM-synthetic dataset (described in the Dataset and Experiments sections). No external validation set drawn from real-world sources such as GitHub repositories, public CVEs, or production Python codebases is reported, which directly undermines the claim that the model constitutes a 'practical ... solution for integrating advanced security analysis directly into development workflows.'

    Authors: We agree that the absence of external validation on real-world sources limits the strength of the practicality claim. This manuscript is presented as a case study demonstrating the feasibility of fine-tuning a 350M-parameter model on a modest human-reviewed synthetic dataset, where the base model fails completely but the fine-tuned version achieves high metrics on the held-out split. To address the concern, we have added a new Limitations section that explicitly notes the synthetic evaluation and outlines the need for future validation on GitHub repositories and CVEs. We have also revised the abstract, introduction, and conclusion to describe the results as showing promise for privacy-preserving on-premise analysis rather than claiming a fully practical deployed solution. revision: yes

  2. Referee: [Dataset Construction] The manuscript provides no quantification of distribution shift, no analysis of potential artifacts from the LLM synthetic generation process, and no details on train/test split sizes or cross-validation (Abstract and Dataset sections). Given that the headline metrics rest on this single synthetic distribution, these omissions leave the 99% accuracy and 99.04% F1-score vulnerable to inflation from memorization or generation biases.

    Authors: We thank the referee for highlighting these omissions. The original Dataset section described the LLM-assisted generation followed by human review but did not report the train/test split (we used an 80/20 random split: 400 training, 100 test), cross-validation rationale, or explicit analysis of artifacts and shift. In the revision we have expanded the section to include: the split sizes and the decision to use a single split to maximize training data given the small corpus; details on the human review criteria used to reduce synthetic artifacts and ensure realistic CWE patterns; and a qualitative discussion of distribution shift, noting that prompts were designed to cover diverse Python scenarios while acknowledging that synthetic data may not fully capture production distributions. These additions provide greater transparency regarding the reported metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical fine-tuning evaluation

full rationale

The paper describes a standard supervised fine-tuning workflow: synthetic dataset generation via LLM plus human review to create 500 examples, instruction fine-tuning of the base codegen-mono model, and reporting of accuracy/precision/recall/F1 on a held-out test split drawn from the same dataset. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation. The reported metrics are direct empirical measurements from conventional ML evaluation rather than quantities that reduce to the inputs by construction, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the correctness of human-reviewed synthetic labels and the assumption that the small test distribution matches real codebases; no new physical entities or mathematical axioms are introduced.

free parameters (1)
  • fine-tuning hyperparameters
    Learning rate, epochs, and batch size are chosen during training but not reported in the abstract; these choices directly affect the final metrics.
axioms (1)
  • domain assumption Human review of LLM-generated examples produces accurate CWE labels without systematic bias
    The entire evaluation pipeline depends on this labeling step being reliable.

pith-pipeline@v0.9.0 · 5822 in / 1237 out tokens · 50372 ms · 2026-05-22T18:36:28.583533+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SLM Finetuning for Natural Language to Domain Specific Code Generation in Production

    cs.LG 2026-04 unverdicted novelty 3.0

    Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losi...

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    H. Mittal. Software security: Threats, solutions and challenges.Comput. Softw. Media Appl. , 6(1):3769, Feb. 2024. doi: 10.24294/csma.v6i1.3769

  2. [2]

    R. A. Martin and S. Barnum. Common weakness enumeration (cwe) status update. ACM SIGAda Ada Lett., XXVIII(1):88–91, Apr. 2008. doi: 10.1145/1387830.1387835

  3. [3]

    A. Bagnato. Security in model-driven architecture. InProceedings on European Workshop on Security in Model Driven Architecture , 2009

  4. [4]

    Nikolic, D

    D. Nikolic, D. Stefanovic, D. Dakic, S. Sladojevic, and S. Ristic. Analysis of the tools for static code analysis. In 2021 20th International Symposium INFOTEH-JAHORINA (INFOTEH) , pages 1–6, East Sarajevo, Bosnia and Herzegovina, Mar. 2021. IEEE. doi: 10.1109/IN- FOTEH51037.2021.9400688

  5. [5]

    M. Vieira. Engineering trustworthy software: A mission for llms. arXiv preprint arXiv:2411.17981, 2024. doi: 10.48550/ARXIV.2411.17981. 9

  6. [6]

    Dozono, T

    K. Dozono, T. E. Gasiba, and A. Stocco. Large language models for secure code as- sessment: A multi-language empirical study. arXiv preprint arXiv:2408.06428 , 2024. doi: 10.48550/ARXIV.2408.06428

  7. [7]

    Kazim and S

    M. Kazim and S. Ying. A survey on top security threats in cloud computing.Int. J. Adv. Comput. Sci. Appl. , 6(3), 2015. doi: 10.14569/IJACSA.2015.060316

  8. [8]

    Sehrawat, S

    N. Sehrawat, S. Vashisht, and N. Kaur. Edge-computing paradigm: Survey and analysis on security threads. In2021 International Conference on Computing Sciences (ICCS) , pages 254– 259, Phagwara, India, Dec. 2021. IEEE. doi: 10.1109/ICCS54944.2021.00057

  9. [9]

    P. G. Recasens et al. Towards pareto optimal throughput in small language model serving. In Proceedings of the 4th Workshop on Machine Learning and Systems , pages 144–152, Athens Greece, Apr. 2024. ACM. doi: 10.1145/3642970.3655832

  10. [10]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    E. Nijkamp et al. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 , Feb. 2023. doi: 10.48550/arXiv.2203.13474

  11. [11]

    J. White. Bandit algorithms for website optimization . O’Reilly Media, Inc., 2013

  12. [12]

    Bennett, T

    G. Bennett, T. Hall, E. Winter, and S. Counsell. Semgrep*: Improving the limited performance of static application security testing (sast) tools. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages 614–623, 2024

  13. [13]

    A. H. Jerónimo, P. M. Moreno, J. A. V. Camacho, and G. C. Vega. Techniques of sast tools in the early stages of secure software development: A systematic literature review. In2024 IEEE International Conference on Engineering Veracruz (ICEV) , pages 1–8. IEEE, 2024

  14. [14]

    Eghbali and M

    A. Eghbali and M. Pradel. Dynapyt: a dynamic analysis framework for python. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2022

  15. [15]

    Farasat and J

    T. Farasat and J. Posegga. Machine learning techniques for python source code vulnerability detection. In Proceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy (CODASPY ’24), pages 151–153, Jun. 2024. ACM

  16. [16]

    Charoenwet, P

    W. Charoenwet, P. Thongtanunam, V.-T. Pham, and C. Treude. An empirical study of static analysis tools for secure code review.arXiv preprint arXiv:2407.12241 [cs.SE] , 2024

  17. [17]

    Kumar and P

    A. Kumar and P. Sharma. Open ai codex: An inevitable future? International Journal for Research in Applied Science and Engineering Technology , 11:539–543, 2023

  18. [18]

    A. Maddy. Integrating ai services into semantic kernel: A case study on enhancing functionality with google palm and large language models.Transactions on Open Source Software Projects , 1(1), 2024

  19. [19]

    GPT-4 Technical Report

    J. Achiam et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 , 2023

  20. [20]

    Harnessingthepowerofllmsinsourcecodevulnerabilitydetection

    A.A.Mahyari. Harnessingthepowerofllmsinsourcecodevulnerabilitydetection. In MILCOM 2024 - 2024 IEEE Military Communications Conference (MILCOM) , pages 251–256, Oct. 2024. IEEE

  21. [21]

    Combining Combined Forecasts: a Network Approach

    X. Du et al. Vul-rag: Enhancing llm-based vulnerability detection via knowledge-level rag. arXiv preprint arXiv:2406.13749 , 2024. 10

  22. [22]

    N. T. Islam et al. Llm-powered code vulnerability repair with reinforcement learning and semantic reward. arXiv preprint arXiv:2401.03374 , 2024

  23. [23]

    Zhang, W

    Y. Zhang, W. Song, Z. Ji, N. Meng, et al. How well does llm generate security tests?arXiv preprint arXiv:2310.00710, 2023

  24. [24]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    E. Nijkamp et al. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 , 2022

  25. [25]

    Y. Wang, W. Wang, S. Joty, and S. C. Hoi. Codet5: Identifier-aware unified pre- trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021

  26. [26]

    StarCoder: may the source be with you!

    R. Li et al. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161 , 2023

  27. [27]

    Murphy et al

    W. Murphy et al. Combining llm code generation with formal specifications and reactive program synthesis. arXiv preprint arXiv:2410.19736 , 2024

  28. [28]

    Ugare, T

    S. Ugare, T. Suresh, H. Kang, S. Misailovic, and G. Singh. Syncode: Llm generation with grammar augmentation. arXiv preprint arXiv:2403.01632 , 2024

  29. [29]

    Leinonen, P

    J. Leinonen, P. Denny, O. Kiljunen, S. MacNeil, S. Sarsa, and A. Hellas. Llm-itation is the sincerest form of data: Generating synthetic buggy code submissions for computing education. arXiv preprint arXiv:2404.01898 , 2024

  30. [30]

    CWE - CWE Top 25 Most Dangerous Software Weaknesses.https://cwe

    MITRE CWE. CWE - CWE Top 25 Most Dangerous Software Weaknesses.https://cwe. mitre.org/top25/. Accessed: 15-Apr-2025

  31. [31]

    Bagheri and P

    A. Bagheri and P. Hegedűs. Towards a block-level conformer-based python vulnerability de- tection. Software, 3(3):310–327, 2024. MDPI

  32. [32]

    Singh, S

    S. Singh, S. Jeevan, N. Reddy, M. Moharir, et al. Temporal analysis and common weak- ness enumeration (cwe) code prediction for software vulnerabilities using machine learning. In 2024 8th International Conference on Computational System and Information Technology for Sustainable Solutions (CSITSS) , pages 1–6. IEEE, 2024

  33. [33]

    Mechri, M

    A. Mechri, M. A. Ferrag, and M. Debbah. Secureqwen: Leveraging llms for vulnerability detection in python codebases.Computers & Security , 148:104151, 2025. Elsevier

  34. [34]

    Steenhoek et al

    B. Steenhoek et al. To err is machine: Vulnerability detection challenges llm reasoning.arXiv preprint arXiv:2403.17218, 2024

  35. [35]

    Shestov et al

    A. Shestov et al. Finetuning large language models for vulnerability detection.IEEE Access,

  36. [36]

    J. Li, F. Rabbi, C. Cheng, A. Sangalay, Y. Tian, and J. Yang. An exploratory study on fine- tuning large language models for secure code generation. arXiv preprint arXiv:2408.09078 , 2024

  37. [37]

    Jiang et al

    X. Jiang et al. Investigating large language models for code vulnerability detection: An exper- imental study. arXiv preprint arXiv:2412.18260 , 2024. 11