arxiv: 2605.13138 · v1 · submitted 2026-05-13 · 💻 cs.SE · cs.CR· cs.LG

Recognition: no theorem link

Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study

Felix M\"achtle, Joseph Bienh\"uls, Kristoffer Hempel, Nils Loose, Thomas Eisenbarth

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:35 UTC · model grok-4.3

classification 💻 cs.SE cs.CRcs.LG

keywords vulnerability-fixing commitscode language modelscommit detectionsecurity patchesbenchmark evaluationattention attributionsoftware security

0 comments

The pith

Code language models acquire no transferable security understanding from vulnerability-fixing code changes alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper consolidates more than 20 prior datasets into one benchmark covering over 180000 commits and runs over 180 experiments on fine-tuned code models ranging from 125 million to 14 billion parameters. It reports that models show no evidence of learning security-relevant patterns from code diffs by themselves. When commit messages are present they dominate model attention; when messages are removed, even extra intra-procedural semantic context added to the diffs fails to redirect attention toward the actual code changes. Group-stratified splits produce roughly 17 percent performance drops relative to random splits, temporal splits on aggregated data prove unreliable due to project-distribution shifts, and at a 0.5 percent false-positive rate all code-only models miss more than 93 percent of vulnerabilities.

Core claim

We find no evidence that models acquire transferable security-relevant code understanding from code changes alone. When commit messages are available, they dominate model attention, and when removed, an attribution analysis shows that enriching diffs with additional intra-procedural semantic context does not shift model attention toward the code changes.

What carries the argument

A unified consolidation framework that merges fragmented VFC datasets and applies attention attribution analysis to compare model behavior on diffs alone versus diffs plus commit messages.

If this is right

Group-stratified evaluation produces approximately 17 percent performance drops compared with random splits.
Temporal splits on aggregated datasets become unreliable because of compositional shifts in the underlying project distributions.
At a false-positive rate of 0.5 percent, all fine-tuned code-only models miss more than 93 percent of vulnerabilities.
Larger and more diverse training data or generative approaches yield preliminary gains but leave the core limitations intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models appear to learn surface-level textual cues in messages rather than deeper code semantics, implying that purely code-centric detectors may require entirely different training signals.
The attention results suggest that adding intra-procedural context alone is insufficient; richer inter-procedural or data-flow features might need to be tested explicitly.
Future benchmarks could isolate code changes by deliberately withholding messages during both training and inference to measure genuine code understanding.
The 17 percent drop in group-stratified settings indicates that project-level leakage in random splits may inflate reported performance across many security tasks.

Load-bearing premise

The merged datasets contain accurate, unbiased labels for vulnerability-fixing commits and the random, group-stratified, and temporal splits reflect realistic deployment conditions without unmeasured shifts.

What would settle it

Training a code-only model on diffs without messages and observing it reach high recall at low false-positive rate in a temporal split on previously unseen projects would contradict the reported findings.

Figures

Figures reproduced from arXiv: 2605.13138 by Felix M\"achtle, Joseph Bienh\"uls, Kristoffer Hempel, Nils Loose, Thomas Eisenbarth.

**Figure 1.** Figure 1: Temporal overview of existing VFC datasets, their label source, size and dependent datasets. The advisory label class includes [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: a. D1 - Manually reviewed C/ C++ : Combines all C/ C++ commits from datasets where the original authors report some manual validation of label quality. D2 - D1 + Advisory-based C/ C++ : Our primary evaluation dataset. D2 extends D1 by including all advisory-mapped C/ C++ commits without requiring additional manual verification. D3 - D2 + Automated tooling C/ C++ : All C/ C++ commits, including those label… view at source ↗

**Figure 4.** Figure 4: Token-level analysis of D2. (a) Stacked ridge plots showing token distributions for six input representations. Each ridge decomposes the total count into changed lines (red), file headers/ hunks (gray), context (green), and commit message (blue). Dashed lines mark medians, red vertical lines indicate common sequence limits. (b) Fraction of discarded tokens that are code changes under naive end-truncation v… view at source ↗

**Figure 3.** Figure 3: t-SNE visualizations of D2 based on CodeBERT [13] embeddings (perplexity = 150, 2,000 iterations, PCA pre-reduction to 50 dimensions). Top row: points colored by the five most frequent source repositories, remainin projects in gray. Bottom row: points colored by commit timestamp with lighter colors representing more recent commits. Left column uses stock CodeBERT, right column uses fine-tuned model on di… view at source ↗

**Figure 5.** Figure 5: Example VFC from FFmpeg with integrated gradi [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity of CodeBERT on the diff representation to temporal split placement. A fixed-size window (20% train, 20% val, 20% test) slides across the chronologically ordered dataset in 5% increments. Dotted markers indicate three non-overlapping windows. All metrics are on a shared [0, 1] scale [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Integrated gradients attribution (absolute values, top- [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: t-SNE visualizations of D2 based on fine-tuned CodeBERT [13] embeddings and colored by VFC label. Left column shows all samples from D2, right column shows only test-set samples. Overall, among the fine-tuned models and representations tested we find no evidence that models acquire transferable security-relevant code understanding. Models exploit commit messages when available. When messages are removed, … view at source ↗

**Figure 9.** Figure 9: Training dynamics across different evaluation settings. Each plot shows the evolution of the F1 on the validation set after each [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

read the original abstract

Automated detection of vulnerability-fixing commits (VFCs) is critical for timely security patch deployment, as advisory databases lag patch releases by a median of 25 days and many fixes never receive advisories. We present a comprehensive evaluation of code language model based VFC detection through a unified framework consolidating over 20 fragmented datasets spanning more than 180000 commits. Across over 180 experiments with fine-tuned models from 125 M to 14 B parameters, we find no evidence that models acquire transferable security-relevant code understanding from code changes alone. When commit messages are available, they dominate model attention, and when removed, an attribution analysis shows that enriching diffs with additional intra-procedural semantic context does not shift model attention toward the code changes. Group-stratified evaluation exposes approximately 17% performance drops compared to random splits, while temporal splits on aggregated datasets prove unreliable due to compositional shift in the underlying project distributions. At a false positive rate of 0.5% all fine-tuned code-only models miss over 93% of vulnerabilities. Larger and more diverse training data or generative approaches show preliminary improvements but do not resolve the underlying limitations. To support future research on code-centric VFC detection, we release our unified framework and evaluation suite.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper consolidates over 20 prior VFC datasets into a unified benchmark of >180k commits and conducts >180 experiments with code LMs (125M–14B parameters). It reports no evidence that models learn transferable security-relevant understanding from code changes alone: commit messages dominate attention when present; enriching diffs with intra-procedural context does not redirect attribution toward the changes; group-stratified splits drop performance ~17% and temporal splits are unreliable due to project-distribution shift. At 0.5% FPR, code-only models miss >93% of vulnerabilities. The authors release the framework and evaluation suite.

Significance. If the negative findings are robust, the work is significant for establishing the practical limits of purely code-centric VFC detection and for supplying a reusable benchmark that future work can build upon. The scale of the experimental sweep across model sizes and the use of attribution analysis are strengths that support the empirical conclusions.

major comments (1)

[Dataset Construction and Evaluation Methodology] The central claim that models acquire no transferable security understanding from code changes rests on the assumption that the consolidated labels (>180k commits from >20 sources) are sufficiently accurate and unbiased proxies for actual vulnerability fixes. No precision/recall figures on a manually verified held-out subset are reported, leaving open the possibility that label noise correlated with commit-message keywords or project identity explains the observed code-only performance and attribution results.

minor comments (1)

[Abstract] The abstract states that temporal splits are 'unreliable due to compositional shift' but does not quantify the magnitude of the shift or provide per-project statistics that would allow readers to assess its severity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the single major comment below and will revise the manuscript to incorporate additional analysis of label quality.

read point-by-point responses

Referee: [Dataset Construction and Evaluation Methodology] The central claim that models acquire no transferable security understanding from code changes rests on the assumption that the consolidated labels (>180k commits from >20 sources) are sufficiently accurate and unbiased proxies for actual vulnerability fixes. No precision/recall figures on a manually verified held-out subset are reported, leaving open the possibility that label noise correlated with commit-message keywords or project identity explains the observed code-only performance and attribution results.

Authors: We acknowledge this is a valid methodological concern. The unified benchmark aggregates labels directly from more than 20 previously published datasets that have been used as proxies for VFCs in the literature; we did not perform a fresh end-to-end manual verification of all 180k+ commits. To address the referee's point, the revised manuscript will add a new subsection on label quality. We will manually inspect a stratified random sample of 1,000 commits (balanced across sources, labels, and presence/absence of commit messages), report estimated precision and recall, and analyze whether any observed noise correlates with commit-message keywords or project identity. We will also discuss how the consistency of our attribution, message-ablation, and group/temporal-split results across independent source datasets provides supporting evidence that is not fully explained by uniform label noise. We believe these additions will strengthen the presentation of our negative findings without altering the core conclusions. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with held-out evaluation

full rationale

The paper is an empirical study that consolidates >20 prior datasets into >180k commits and reports model performance across fine-tuning experiments (125M–14B parameters), split variants, and attribution analyses. All central claims (no transferable code-only understanding, message dominance, 17% group-stratified drop, >93% miss rate at 0.5% FPR) are direct outputs of standard train/test procedures on held-out data; no equations, fitted parameters, or self-citation chains are invoked to derive the results. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the quality and representativeness of the aggregated datasets and the assumption that standard fine-tuning and attribution methods reveal true model behavior.

axioms (1)

domain assumption Existing datasets provide accurate labels for vulnerability-fixing commits without significant noise or selection bias.
The study aggregates over 20 prior datasets and treats their labels as ground truth.

pith-pipeline@v0.9.0 · 5539 in / 1382 out tokens · 62124 ms · 2026-05-14T18:35:52.994809+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 48 canonical work pages · 3 internal anchors

[1]

Jafar Akhoundali, Sajad Rahim Nouri, Kristian F. D. Rietveld, and Olga Gady- atskaya. 2024. MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery. InProceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2024. ACM. https://doi.org/10.1145/3663533.3664036

work page doi:10.1145/3663533.3664036 2024
[2]

Dos and Don’ts of Machine Learning in Computer Security

Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke, Fabio Pierazzi,ChristianWressnegger,LorenzoCavallaro,andKonradRieck.2022. Dos and Don’ts of Machine Learning in Computer Security. In31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022, Kevin R. B. Butler and Kurt Thomas (Eds.). USENIX Association, 397...

2022
[3]

Guru Prasad Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. InPROMISE ’21: 17th International Conference on Predictive Models and Data Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study Analytics in Software Engineering,...

work page doi:10.1145/3475960 2021
[4]

Max Brunsfeld. [n.d.]. Tree-sitter. https://github.com/tree-sitter/tree-sitter
[5]

Tianyu Chen, Lin Li, Taotao Qian, Jingyi Liu, Wei Yang, Ding Li, Guangtai Liang, Qianxiang Wang, and Tao Xie. 2024. CompVPD: Iteratively Identifying Vulnerability Patches Based on Human Validation Results with a Precise Context. arXiv:2310.02530 [cs.CR] https://arxiv.org/abs/2310.02530

work page arXiv 2024
[6]

Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David A. Wagner. 2023. DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection. InProceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, RAID 2023. ACM. https://doi.org/10.1145/3607199.3607242

work page doi:10.1145/3607199.3607242 2023
[7]

TheoChow,MarioD’Onghia,LorenzLinhardt,ZeliangKan,DanielArp,Lorenzo Cavallaro, and Fabio Pierazzi. 2026. Beyond the TESSERACT: Trustworthy Dataset Curation for Sound Evaluations of Android Malware Classifiers. In Proceedings of the 4th IEEE Conference on Secure and Trustworthy Machine Learning. https://discovery.ucl.ac.uk/id/eprint/10220473/

work page arXiv 2026
[9]

https://doi.org/10.48550/ARXIV.2403.18624 arXiv:2403.18624

Vulnerability Detection with Code Language Models: How Far Are We? CoRRabs/2403.18624 (2024). https://doi.org/10.48550/ARXIV.2403.18624 arXiv:2403.18624

work page doi:10.48550/arxiv.2403.18624 2024
[10]

Wagner, Baishakhi Ray, and Yizheng Chen

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David A. Wagner, Baishakhi Ray, and Yizheng Chen
[11]

InProceedings of the 47th IEEE/ACM international conference on software engineering

Vulnerability Detection with Code Language Models: How Far are We?. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025. IEEE, 1729–1741. https: //doi.org/10.1109/ICSE55347.2025.00038

work page doi:10.1109/icse55347.2025.00038 2025
[12]

Trevor Dunlap, Elizabeth Lin, William Enck, and Bradley Reaves. 2024. VFCFinder: Pairing Security Advisories and Patches. InProceedings of the 19th ACM Asia Conference on Computer and Communications Security, ASIA CCS 2024, Singapore, July 1-5, 2024,JianyingZhou,TonyQ.S.Quek,DebinGao, and Alvaro A. Cárdenas (Eds.). ACM. https://doi.org/10.1145/3634737.3657007

work page doi:10.1145/3634737.3657007 2024
[13]

Jonathan Evertz, Niklas Risse, Nicolai Neuer, Andreas Müller, Philipp Normann, Gaetano Sapia, Srishti Gupta, David Pape, Soumya Shaw, Devansh Srivastav, Christian Wressnegger, Erwin Quiring, Thorsten Eisenhofer, Daniel Arp, and Lea Schönherr. 2025. Chasing Shadows: Pitfalls in LLM Security Research. CoRRabs/2512.09549 (2025). https://doi.org/10.48550/ARXI...

work page doi:10.48550/arxiv.2512.09549 2025
[14]

Jean-Rémy Falleri and Matias Martinez. 2024. Fine-grained, accurate and scalable source differencing. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20,

2024
[15]

https://doi.org/10.1145/3597503.3639148

ACM, 231:1–231:12. https://doi.org/10.1145/3597503.3639148

work page doi:10.1145/3597503.3639148
[16]

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, LinjunShou,BingQin,TingLiu,DaxinJiang,andMingZhou.2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. InFindings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, Vol. EMNLP 2020), Trevor Cohn, Yul...

work page doi:10.18653/v1/2020.findings-emnlp.139 2020
[17]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

GLM.2025.GLM-4.5:Agentic,Reasoning,andCoding(ARC)FoundationModels. CoRRabs/2508.06471 (2025). https://doi.org/10.48550/ARXIV.2508.06471 arXiv:2508.06471

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.06471 2025
[18]

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. InPro- ceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, SmarandaMuresan,PreslavNakov,andAlineVillavicencio(...

work page doi:10.18653/v1/2022.acl- 2022
[19]

Jingxuan He and Martin T. Vechev. 2023. Large Language Models for Code: Security Hardening and Adversarial Testing. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, CCS 2023, Copenhagen, Denmark, November 26-30, 2023. ACM. https://doi.org/10.1145/ 3576915.3623175

work page arXiv 2023
[20]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=nZeVKeeFYf9

2022
[21]

Williams

Nasif Imtiaz, Aniqa Khanom, and Laurie A. Williams. 2023. Open or Sneaky? Fast or Slow? Light or Heavy?: Investigating Security Releases of Open Source Packages.IEEE Trans. Software Eng.49, 4 (2023), 1540–1560. https://doi.org/ 10.1109/TSE.2022.3181010

work page doi:10.1109/tse.2022.3181010 2023
[22]

ZeliangKan,ShaeMcFadden,DanielArp,FeargusPendlebury,RobertoJordaney, Johannes Kinder, Fabio Pierazzi, and Lorenzo Cavallaro. 2024. TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time (Extended Version).CoRRabs/2402.01359 (2024). https://doi.org/10.48550/ ARXIV.2402.01359 arXiv:2402.01359

work page arXiv 2024
[23]

Jian Yi David Lee and Hai Leong Chieu. 2021. Co-training for Commit Classi- fication. InProceedings of the Seventh Workshop on Noisy User-generated Text, W-NUT 2021, Online, November 11, 2021, Wei Xu, Alan Ritter, Tim Baldwin, and Afshin Rahimi (Eds.). Association for Computational Linguistics, 389–395. https://doi.org/10.18653/V1/2021.WNUT-1.43

work page doi:10.18653/v1/2021.wnut-1.43 2021
[24]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han
[25]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.CoRRabs/2306.00978 (2023). https://doi.org/10.48550/ARXIV. 2306.00978 arXiv:2306.00978

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2023
[26]

Shangqing Liu, Yanzhou Li, and Yang Liu. 2022. CommitBART: A Large Pre-trained Model for GitHub Commits.CoRRabs/2208.08100 (2022). https: //doi.org/10.48550/ARXIV.2208.08100 arXiv:2208.08100

work page doi:10.48550/arxiv.2208.08100 2022
[27]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/ forum?id=Bkg6RiCqY7

2019
[28]

ICVul:AWell- labeled C/C++ Vulnerability Dataset with Comprehensive Metadata and VCCs

ChaomengLu,TianyuLi,ToonDehaene,andBertLagaisse.2025. ICVul:AWell- labeled C/C++ Vulnerability Dataset with Comprehensive Metadata and VCCs. In22nd IEEE/ACM International Conference on Mining Software Repositories, MSR@ICSE 2025, Ottawa, ON, Canada, April 28-29, 2025. IEEE, 154–158. https://doi.org/10.1109/MSR66628.2025.00034

work page doi:10.1109/msr66628.2025.00034 2025
[29]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- sio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Unde...

2021
[30]

Changhua Luo, Wei Meng, and Shuai Wang. 2024. Strengthening Supply Chain Security with Fine-grained Safe Patch Identification. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 89:1–89:12. https://doi.org/10.1145/3597503. 3639104

work page doi:10.1145/3597503 2024
[31]

Santosa, Asankhaya Sharma, and Ming Yi Ang

Giang Nguyen-Truong, Hong Jin Kang, David Lo, Abhishek Sharma, Andrew E. Santosa, Asankhaya Sharma, and Ming Yi Ang. 2022. HERMES: Using Commit- Issue Linking to Detect Vulnerability-Fixing Commits. InIEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2022, Honolulu, HI, USA, March 15-18, 2022. IEEE, 51–62. https://doi....

work page arXiv 2022
[32]

Chao Ni, Liyu Shen, Xiaohu Yang, Yan Zhu, and Shaohua Wang. 2024. MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representations. In 21st IEEE/ACM International Conference on Mining Software Repositories, MSR 2024, Lisbon, Portugal, April 15-16, 2024, Diomidis Spinellis, Alberto Bacchelli, and Eleni Constantinou (Eds.). ACM, 738–742. https...

work page doi:10.1145/3643991 2024
[33]

Georgios Nikitopoulos, Konstantina Dritsa, Panos Louridas, and Dimitris Mitropoulos. 2021. CrossVul: a cross-language vulnerability dataset with com- mit data. InESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM. https://doi.org/10.1145/3468264.3473122

work page doi:10.1145/3468264.3473122 2021
[34]

NIST. [n.d.]. National Vulnerability Database. https://nvd.nist.gov/ accessed 2025-09

2025
[35]

Serena Elisa Ponta, Henrik Plate, Antonino Sabetta, Michele Bezzi, and Cédric Dangremont. 2019. A manually-curated dataset of fixes to vulnerabilities of open-source software. InProceedings of the 16th International Conference on Mining Software Repositories, MSR 2019. IEEE / ACM. https://doi.org/10.1109/ MSR.2019.00064

work page arXiv 2019
[36]

Sofia Reis and Rui Abreu. 2017. SECBENCH: A Database of Real Security Vulnerabilities. InProceedings of the International Workshop on Secure Software Engineering in DevOps and Agile Development co-located with the 22nd European Symposium on Research in Computer Security (ESORICS 2017) (CEUR Workshop Proceedings, Vol. 1977).CEUR-WS.org. https://ceur-ws.org...

2017
[37]

Sofia Reis and Rui Abreu. 2021. A ground-truth dataset of real security patches. CoRRabs/2110.09635 (2021). arXiv:2110.09635 https://arxiv.org/abs/2110. 09635

work page arXiv 2021
[38]

In33rd USENIX Security Symposium, USENIX Security 2024.USENIXAssociation

NiklasRisseandMarcelBöhme.2024.UncoveringtheLimitsofMachineLearning for Automatic Vulnerability Detection. In33rd USENIX Security Symposium, USENIX Security 2024.USENIXAssociation. https://www.usenix.org/conference/ usenixsecurity24/presentation/risse

2024
[39]

Niklas Risse, Jing Liu, and Marcel Böhme. 2025. Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection.Proc. ACM Softw. Eng.2, ISSTA (2025), 388–410. https://doi.org/10.1145/3728887

work page doi:10.1145/3728887 2025
[40]

Antonino Sabetta and Michele Bezzi. 2018. A Practical Approach to the Automatic Classification of Security-Relevant Commits. In2018 IEEE Inter- national Conference on Software Maintenance and Evolution, ICSME 2018, Nils Loose, Joseph Bienhüls, Kristoffer Hempel, Felix Mächtle, and Thomas Eisenbarth Madrid, Spain, September 23-29, 2018. IEEE Computer Socie...

work page doi:10.1109/icsme.2018.00058 2018
[41]

ZayneReaSprague,FangcongYin,JuanDiegoRodriguez,DongweiJiang,Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett
[42]

InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net/forum?id=w6nlcS8Kkn

2025
[43]

Benjamin Steenhoek, Md Mahbubur Rahman, Richard Jiles, and Wei Le. 2023. An Empirical Study of Deep Learning Models for Vulnerability Detection. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2237–2248. https://doi.org/10. 1109/ICSE48619.2023.00188

work page arXiv 2023
[44]

Jiamou Sun, Zhenchang Xing, Qinghua Lu, Xiwei Xu, Liming Zhu, Thong Hoang, and Dehai Zhao. 2023. Silent Vulnerable Dependency Alert Prediction with Vulnerability Key Aspect Explanation. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 970–982. https://doi.org/10.1109/ICSE48619.2023.00089

work page doi:10.1109/icse48619.2023.00089 2023
[45]

Shiyu Sun, Shu Wang, Xinda Wang, Yunlong Xing, Elisa Zhang, and Kun Sun. 2023. Exploring Security Commits in Python. InIEEE International Conference on Software Maintenance and Evolution, ICSME 2023. IEEE. https: //doi.org/10.1109/ICSME58846.2023.00027

work page doi:10.1109/icsme58846.2023.00027 2023
[46]

AxiomaticAttributionfor Deep Networks

MukundSundararajan,AnkurTaly,andQiqiYan.2017. AxiomaticAttributionfor Deep Networks. InProceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.). PMLR, 3319–3328. http://proceedings.mlr.press/v70/sundararajan17a.html

2017
[47]

Bissyandé

Xunzhu Tang, Zhenghan Chen, Saad Ezzini, Haoye Tian, Yewei Song, Jacques Klein, and Tegawendé F. Bissyandé. 2023. Multilevel Semantic Embedding of Software Patches: A Fine-to-Coarse Grained Approach Towards Security Patch Detection.CoRRabs/2308.15233 (2023). https://doi.org/10.48550/ARXIV.2308. 15233 arXiv:2308.15233

work page doi:10.48550/arxiv.2308 2023
[48]

Just-in-TimeDetectionofSilentSecurityPatches

Xunzhu Tang, Kisub Kim, Saad Ezzini, Yewei Song, Haoye Tian, Jacques Klein, andTegawendeBissyande.2025. Just-in-TimeDetectionofSilentSecurityPatches. ACM Trans. Softw. Eng. Methodol.(July 2025). https://doi.org/10.1145/3749370 Just Accepted

work page doi:10.1145/3749370 2025
[49]

https://doi.org/10.5281/zenodo.19250701 Zenodo artifact archive

VFCDetective Artifact 2026. https://doi.org/10.5281/zenodo.19250701 Zenodo artifact archive

work page doi:10.5281/zenodo.19250701 2026
[50]

Shu Wang, Xinda Wang, Kun Sun, Sushil Jajodia, Haining Wang, and Qi Li. 2023. GraphSPD:Graph-BasedSecurityPatchDetectionwithEnrichedCodeSemantics. In2023 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 2409–2426. https://doi.org/10.1109/SP46215.2023. 00035

work page doi:10.1109/sp46215.2023 2023
[51]

Shichao Wang, Yun Zhang, Liagfeng Bao, Xin Xia, and Minghui Wu. 2022. VCMatch:ARanking-basedApproachforAutomaticSecurityPatchesLocalization for OSS Vulnerabilities. InIEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2022, Honolulu, HI, USA, March 15-18,

2022
[52]

Santosa, Asankhaya Sharma, and Ming Yi Ang

IEEE, 589–600. https://doi.org/10.1109/SANER53432.2022.00076

work page doi:10.1109/saner53432.2022.00076 2022
[53]

Xinda Wang, Shu Wang, Pengbin Feng, Kun Sun, and Sushil Jajodia. 2021. PatchDB: A Large-Scale Security Patch Dataset. In51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021. IEEE. https://doi.org/10.1109/DSN48987.2021.00030

work page doi:10.1109/dsn48987.2021.00030 2021
[54]

PatchRNN:ADeepLearning-BasedSystemfor SecurityPatchIdentification.In2021 IEEE Military Communications Conference, MILCOM 2021, San Diego, CA, USA, November 29 - Dec

Xinda Wang, Shu Wang, Pengbin Feng, Kun Sun, Sushil Jajodia, Sanae Ben- chaaboun,andFrankGeck.2021. PatchRNN:ADeepLearning-BasedSystemfor SecurityPatchIdentification.In2021 IEEE Military Communications Conference, MILCOM 2021, San Diego, CA, USA, November 29 - Dec. 2, 2021.IEEE,595–600. https://doi.org/10.1109/MILCOM52596.2021.9652940

work page doi:10.1109/milcom52596.2021.9652940 2021
[55]

Joty, and Steven C

Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Under- standing and Generation. InEMNLP. Association for Computational Linguistics, 8696–8708

2021
[56]

LauraWartschinski,YannicNoller,ThomasVogel,TimoKehrer,andLarsGrunske
[57]

VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase for Python.Inf. Softw. Technol.144 (2022). https://doi.org/10.1016/J. INFSOF.2021.106809

work page doi:10.1016/j 2022
[58]

Xin-Cheng Wen, Zirui Lin, Cuiyun Gao, Hongyu Zhang, Yong Wang, and Qing Liao. 2024. Repository-Level Graph Representation Learning for Enhanced Security Patch Detection. arXiv:2412.08068 [cs.SE] https://arxiv.org/abs/2412. 08068

work page arXiv 2024
[59]

CongyingXu,BihuanChen,ChenhaoLu,KaifengHuang,XinPeng,andYangLiu
[60]

ACM, 860–871

Trackingpatchesforopensourcesoftwarevulnerabilities.InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, Singapore, Singapore, November 14-18, 2022, Abhik Roychoudhury, Cristian Cadar, and Miryung Kim (Eds.). ACM, 860–871. https://doi.org/10.1145/3540250.3549125

work page doi:10.1145/3540250.3549125 2022
[63]

Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and Discovering Vulnerabilities with Code Property Graphs. In2014 IEEE Symposium on Security and Privacy, SP 2014, Berkeley, CA, USA, May 18-21,

2014
[64]

https://doi.org/10.1109/SP.2014.44

IEEE Computer Society, 590–604. https://doi.org/10.1109/SP.2014.44

work page doi:10.1109/sp.2014.44 2014
[65]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, JianhongTu,JianweiZhang,JianxinYang,JiaxiYang,JingrenZhou,JunyangLin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzha...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024
[66]

Xu Yang, Wenhan Zhu, Michael Pacheco, Jiayuan Zhou, Shaowei Wang, Xing Hu, and Kui Liu. 2025. Code Change Intention, Development Artifact, and History Vulnerability: Putting Them Together for Vulnerability Fix Detection by LLM. Proc. ACM Softw. Eng.2,FSE(2025),489–510. https://doi.org/10.1145/3715738

work page doi:10.1145/3715738 2025
[67]

Jiayuan Zhou, Michael Pacheco, Jinfu Chen, Xing Hu, Xin Xia, David Lo, and Ahmed E. Hassan. 2023. CoLeFunDa: Explainable Silent Vulnerability Fix Identification. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2565–

2023
[68]

https://doi.org/10.1109/ICSE48619.2023.00214

work page doi:10.1109/icse48619.2023.00214 2023
[69]

Jiayuan Zhou, Michael Pacheco, Zhiyuan Wan, Xin Xia, David Lo, Yuan Wang, and Ahmed E. Hassan. 2021. Finding A Needle in a Haystack: Automated Mining of Silent Vulnerability Fixes. In36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021. IEEE, 705–716. https://doi.org/10.1109/ASE5152...

work page doi:10.1109/ase51524.2021.9678720 2021
[70]

Shuyan Zhou, Uri Alon, Sumit Agarwal, and Graham Neubig. 2023. Code- BERTScore: Evaluating Code Generation with Pretrained Models of Code. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Lin...

work page doi:10.18653/v1/2023.emnlp-main.859 2023
[71]

Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective Vulnerability Identification by Learning Comprehensive Pro- gram Semantics via Graph Neural Networks. InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019. https://proceedings.neurips...

2019
[72]

Yaqin Zhou, Jing Kai Siow, Chenyu Wang, Shangqing Liu, and Yang Liu. 2022. SPI: Automated Identification of Security Patches via Commits.ACM Trans. Softw. Eng. Methodol.31,1(2022),13:1–13:27. https://doi.org/10.1145/3468854

work page doi:10.1145/3468854 2022