Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code
Pith reviewed 2026-05-22 18:36 UTC · model grok-4.3
The pith
Fine-tuned small language models can detect MITRE Top 25 CWEs in Python code at 99 percent accuracy while running locally to keep code private.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
After instruction-following fine-tuning on a 500-example dataset of Python code covering the MITRE Top 25 CWEs, the specialized 350-million parameter codegen-mono model achieves approximately 99 percent accuracy, 98.08 percent precision, 100 percent recall, and 99.04 percent F1-score on CWE detection. The unmodified base model failed to identify any CWEs in the test samples. These outcomes indicate that small language models can be turned into precise, on-premise tools for vulnerability analysis in Python.
What carries the argument
Instruction-following fine-tuning of the pre-trained codegen-mono 350-million parameter model on a semi-supervised dataset of 500 Python examples generated synthetically and reviewed by humans.
If this is right
- Fine-tuned small language models serve as accurate alternatives to large cloud models for detecting code weaknesses.
- The method supports privacy-preserving analysis because inference can occur entirely on local hardware.
- Near-perfect recall and high precision allow reliable integration of CWE checks into day-to-day development tools.
- The approach lowers computational cost and removes the need to transmit sensitive code off-site.
Where Pith is reading between the lines
- The same fine-tuning recipe could be applied to other programming languages to create local scanners beyond Python.
- Embedding the resulting model inside an IDE would let developers receive immediate local warnings about potential CWEs while writing code.
- Refining the synthetic data process with more varied real-world examples might improve performance on unseen code patterns.
Load-bearing premise
The 500-example dataset built from LLM synthetic generation and human review represents real-world Python vulnerabilities without systematic labeling errors that would inflate the test metrics.
What would settle it
Running the fine-tuned model on a fresh collection of independently labeled real-world Python projects that contain verified MITRE Top 25 CWEs and measuring whether accuracy stays near 99 percent.
Figures
read the original abstract
Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities, such as Common Weakness Enumerations (CWEs). However, their reliance on cloud infrastructure and substantial computational requirements pose challenges for analyzing sensitive or proprietary codebases due to privacy concerns and inference costs. This work explores the potential of Small Language Models (SLMs) as a viable alternative for accurate, on-premise vulnerability detection. We investigated whether a 350-million parameter pre-trained code model (codegen-mono) could be effectively fine-tuned to detect the MITRE Top 25 CWEs specifically within Python code. To facilitate this, we developed a targeted dataset of 500 examples using a semi-supervised approach involving LLM-driven synthetic data generation coupled with meticulous human review. Initial tests confirmed that the base codegen-mono model completely failed to identify CWEs in our samples. However, after applying instruction-following fine-tuning, the specialized SLM achieved remarkable performance on our test set, yielding approximately 99% accuracy, 98.08% precision, 100% recall, and a 99.04% F1-score. These results strongly suggest that fine-tuned SLMs can serve as highly accurate and efficient tools for CWE detection, offering a practical and privacy-preserving solution for integrating advanced security analysis directly into development workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a case study on fine-tuning the 350-million-parameter codegen-mono model on a 500-example dataset of Python code snippets, synthetically generated via LLM and refined by human review, to detect MITRE Top 25 CWEs. The base model fails to identify vulnerabilities, but after instruction-following fine-tuning the specialized SLM reports approximately 99% accuracy, 98.08% precision, 100% recall, and 99.04% F1-score on a held-out test set drawn from the same synthetic distribution, positioning the approach as a practical, privacy-preserving, on-premise alternative to large cloud LLMs for security analysis in development workflows.
Significance. If the reported performance generalizes, the work would illustrate that small language models can be specialized for high-accuracy CWE detection with modest data and compute, offering clear privacy and deployment advantages. The explicit contrast between the untuned model's complete failure and the fine-tuned model's strong metrics on the test set provides a useful existence proof for targeted SLM adaptation in code security tasks.
major comments (2)
- [Experiments / Results] The evaluation is performed exclusively on a test split from the 500-example LLM-synthetic dataset (described in the Dataset and Experiments sections). No external validation set drawn from real-world sources such as GitHub repositories, public CVEs, or production Python codebases is reported, which directly undermines the claim that the model constitutes a 'practical ... solution for integrating advanced security analysis directly into development workflows.'
- [Dataset Construction] The manuscript provides no quantification of distribution shift, no analysis of potential artifacts from the LLM synthetic generation process, and no details on train/test split sizes or cross-validation (Abstract and Dataset sections). Given that the headline metrics rest on this single synthetic distribution, these omissions leave the 99% accuracy and 99.04% F1-score vulnerable to inflation from memorization or generation biases.
minor comments (3)
- [Abstract] The abstract states a dataset of 500 examples but does not report the test-set size or split ratio; adding these figures would allow readers to assess the statistical reliability of the reported metrics.
- [Experiments] Consider including a direct comparison against at least one additional baseline (e.g., a different small code model or a simple static-analysis rule set) to isolate the contribution of the instruction fine-tuning step.
- [Task Definition] Clarify the exact CWE detection formulation (binary classification per CWE, multi-label, or ranked list) and whether the model outputs CWE identifiers or only vulnerability flags.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and have revised the manuscript to improve transparency and balance the claims.
read point-by-point responses
-
Referee: [Experiments / Results] The evaluation is performed exclusively on a test split from the 500-example LLM-synthetic dataset (described in the Dataset and Experiments sections). No external validation set drawn from real-world sources such as GitHub repositories, public CVEs, or production Python codebases is reported, which directly undermines the claim that the model constitutes a 'practical ... solution for integrating advanced security analysis directly into development workflows.'
Authors: We agree that the absence of external validation on real-world sources limits the strength of the practicality claim. This manuscript is presented as a case study demonstrating the feasibility of fine-tuning a 350M-parameter model on a modest human-reviewed synthetic dataset, where the base model fails completely but the fine-tuned version achieves high metrics on the held-out split. To address the concern, we have added a new Limitations section that explicitly notes the synthetic evaluation and outlines the need for future validation on GitHub repositories and CVEs. We have also revised the abstract, introduction, and conclusion to describe the results as showing promise for privacy-preserving on-premise analysis rather than claiming a fully practical deployed solution. revision: yes
-
Referee: [Dataset Construction] The manuscript provides no quantification of distribution shift, no analysis of potential artifacts from the LLM synthetic generation process, and no details on train/test split sizes or cross-validation (Abstract and Dataset sections). Given that the headline metrics rest on this single synthetic distribution, these omissions leave the 99% accuracy and 99.04% F1-score vulnerable to inflation from memorization or generation biases.
Authors: We thank the referee for highlighting these omissions. The original Dataset section described the LLM-assisted generation followed by human review but did not report the train/test split (we used an 80/20 random split: 400 training, 100 test), cross-validation rationale, or explicit analysis of artifacts and shift. In the revision we have expanded the section to include: the split sizes and the decision to use a single split to maximize training data given the small corpus; details on the human review criteria used to reduce synthetic artifacts and ensure realistic CWE patterns; and a qualitative discussion of distribution shift, noting that prompts were designed to cover diverse Python scenarios while acknowledging that synthetic data may not fully capture production distributions. These additions provide greater transparency regarding the reported metrics. revision: yes
Circularity Check
No significant circularity in empirical fine-tuning evaluation
full rationale
The paper describes a standard supervised fine-tuning workflow: synthetic dataset generation via LLM plus human review to create 500 examples, instruction fine-tuning of the base codegen-mono model, and reporting of accuracy/precision/recall/F1 on a held-out test split drawn from the same dataset. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation. The reported metrics are direct empirical measurements from conventional ML evaluation rather than quantities that reduce to the inputs by construction, rendering the chain self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- fine-tuning hyperparameters
axioms (1)
- domain assumption Human review of LLM-generated examples produces accurate CWE labels without systematic bias
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We developed a targeted dataset of 500 examples using a semi-supervised approach involving LLM-driven synthetic data generation coupled with meticulous human review... after applying instruction-following fine-tuning, the specialized SLM achieved remarkable performance on our test set, yielding approximately 99% accuracy...
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fine-tuned SLMs can serve as highly accurate and efficient tools for CWE detection, offering a practical and privacy-preserving solution
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
SLM Finetuning for Natural Language to Domain Specific Code Generation in Production
Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losi...
Reference graph
Works this paper leans on
-
[1]
H. Mittal. Software security: Threats, solutions and challenges.Comput. Softw. Media Appl. , 6(1):3769, Feb. 2024. doi: 10.24294/csma.v6i1.3769
-
[2]
R. A. Martin and S. Barnum. Common weakness enumeration (cwe) status update. ACM SIGAda Ada Lett., XXVIII(1):88–91, Apr. 2008. doi: 10.1145/1387830.1387835
-
[3]
A. Bagnato. Security in model-driven architecture. InProceedings on European Workshop on Security in Model Driven Architecture , 2009
work page 2009
-
[4]
D. Nikolic, D. Stefanovic, D. Dakic, S. Sladojevic, and S. Ristic. Analysis of the tools for static code analysis. In 2021 20th International Symposium INFOTEH-JAHORINA (INFOTEH) , pages 1–6, East Sarajevo, Bosnia and Herzegovina, Mar. 2021. IEEE. doi: 10.1109/IN- FOTEH51037.2021.9400688
work page doi:10.1109/in- 2021
-
[5]
M. Vieira. Engineering trustworthy software: A mission for llms. arXiv preprint arXiv:2411.17981, 2024. doi: 10.48550/ARXIV.2411.17981. 9
-
[6]
K. Dozono, T. E. Gasiba, and A. Stocco. Large language models for secure code as- sessment: A multi-language empirical study. arXiv preprint arXiv:2408.06428 , 2024. doi: 10.48550/ARXIV.2408.06428
-
[7]
M. Kazim and S. Ying. A survey on top security threats in cloud computing.Int. J. Adv. Comput. Sci. Appl. , 6(3), 2015. doi: 10.14569/IJACSA.2015.060316
-
[8]
N. Sehrawat, S. Vashisht, and N. Kaur. Edge-computing paradigm: Survey and analysis on security threads. In2021 International Conference on Computing Sciences (ICCS) , pages 254– 259, Phagwara, India, Dec. 2021. IEEE. doi: 10.1109/ICCS54944.2021.00057
-
[9]
P. G. Recasens et al. Towards pareto optimal throughput in small language model serving. In Proceedings of the 4th Workshop on Machine Learning and Systems , pages 144–152, Athens Greece, Apr. 2024. ACM. doi: 10.1145/3642970.3655832
-
[10]
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
E. Nijkamp et al. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 , Feb. 2023. doi: 10.48550/arXiv.2203.13474
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.13474 2023
-
[11]
J. White. Bandit algorithms for website optimization . O’Reilly Media, Inc., 2013
work page 2013
-
[12]
G. Bennett, T. Hall, E. Winter, and S. Counsell. Semgrep*: Improving the limited performance of static application security testing (sast) tools. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages 614–623, 2024
work page 2024
-
[13]
A. H. Jerónimo, P. M. Moreno, J. A. V. Camacho, and G. C. Vega. Techniques of sast tools in the early stages of secure software development: A systematic literature review. In2024 IEEE International Conference on Engineering Veracruz (ICEV) , pages 1–8. IEEE, 2024
work page 2024
-
[14]
A. Eghbali and M. Pradel. Dynapyt: a dynamic analysis framework for python. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , 2022
work page 2022
-
[15]
T. Farasat and J. Posegga. Machine learning techniques for python source code vulnerability detection. In Proceedings of the Fourteenth ACM Conference on Data and Application Security and Privacy (CODASPY ’24), pages 151–153, Jun. 2024. ACM
work page 2024
-
[16]
W. Charoenwet, P. Thongtanunam, V.-T. Pham, and C. Treude. An empirical study of static analysis tools for secure code review.arXiv preprint arXiv:2407.12241 [cs.SE] , 2024
-
[17]
A. Kumar and P. Sharma. Open ai codex: An inevitable future? International Journal for Research in Applied Science and Engineering Technology , 11:539–543, 2023
work page 2023
-
[18]
A. Maddy. Integrating ai services into semantic kernel: A case study on enhancing functionality with google palm and large language models.Transactions on Open Source Software Projects , 1(1), 2024
work page 2024
-
[19]
J. Achiam et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Harnessingthepowerofllmsinsourcecodevulnerabilitydetection
A.A.Mahyari. Harnessingthepowerofllmsinsourcecodevulnerabilitydetection. In MILCOM 2024 - 2024 IEEE Military Communications Conference (MILCOM) , pages 251–256, Oct. 2024. IEEE
work page 2024
-
[21]
Combining Combined Forecasts: a Network Approach
X. Du et al. Vul-rag: Enhancing llm-based vulnerability detection via knowledge-level rag. arXiv preprint arXiv:2406.13749 , 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [22]
- [23]
-
[24]
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
E. Nijkamp et al. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Y. Wang, W. Wang, S. Joty, and S. C. Hoi. Codet5: Identifier-aware unified pre- trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
StarCoder: may the source be with you!
R. Li et al. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
W. Murphy et al. Combining llm code generation with formal specifications and reactive program synthesis. arXiv preprint arXiv:2410.19736 , 2024
- [28]
-
[29]
J. Leinonen, P. Denny, O. Kiljunen, S. MacNeil, S. Sarsa, and A. Hellas. Llm-itation is the sincerest form of data: Generating synthetic buggy code submissions for computing education. arXiv preprint arXiv:2404.01898 , 2024
-
[30]
CWE - CWE Top 25 Most Dangerous Software Weaknesses.https://cwe
MITRE CWE. CWE - CWE Top 25 Most Dangerous Software Weaknesses.https://cwe. mitre.org/top25/. Accessed: 15-Apr-2025
work page 2025
-
[31]
A. Bagheri and P. Hegedűs. Towards a block-level conformer-based python vulnerability de- tection. Software, 3(3):310–327, 2024. MDPI
work page 2024
-
[32]
S. Singh, S. Jeevan, N. Reddy, M. Moharir, et al. Temporal analysis and common weak- ness enumeration (cwe) code prediction for software vulnerabilities using machine learning. In 2024 8th International Conference on Computational System and Information Technology for Sustainable Solutions (CSITSS) , pages 1–6. IEEE, 2024
work page 2024
- [33]
-
[34]
B. Steenhoek et al. To err is machine: Vulnerability detection challenges llm reasoning.arXiv preprint arXiv:2403.17218, 2024
-
[35]
A. Shestov et al. Finetuning large language models for vulnerability detection.IEEE Access,
- [36]
-
[37]
X. Jiang et al. Investigating large language models for code vulnerability detection: An exper- imental study. arXiv preprint arXiv:2412.18260 , 2024. 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.