arxiv: 2604.25599 · v1 · submitted 2026-04-28 · 💻 cs.SE · cs.LG

Recognition: unknown

PLMGH: What Matters in PLM-GNN Hybrids for Code Classification and Vulnerability Detection

Mohamed Taoufik Kaouthar El Idrissi , Edward Zulkoski , Mohammad Hamdaqa

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:01 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords pretrained language modelsgraph neural networkscode classificationvulnerability detectionhybrid modelsempirical studymodel design

0 comments

The pith

PLM choice affects hybrid performance more than GNN backbone for code classification and vulnerability detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a controlled comparison by pairing each of three code-specialized pretrained language models with each of three graph neural network architectures, then tests the resulting hybrids against pure PLM and pure GNN baselines on Java250 for classification and Devign for vulnerability detection, including an obfuscated-identifier variant. Hybrids beat the GNN-only baselines across both tasks and often improve ranking metrics over frozen PLMs alone. On Devign the results prove more sensitive to which PLM supplies the initial features than to which GNN processes the graph, larger PLMs are not reliably better feature sources, and the PLM selection exerts greater overall influence than the GNN selection. These observations are distilled into concrete guidelines for choosing components when constructing PLM-GNN hybrids for code tasks.

Core claim

Across both code classification and vulnerability detection tasks, PLM-GNN hybrids consistently outperform GNN-only baselines and often improve ranking quality over frozen PLMs. On Devign, performance and robustness are more sensitive to the PLM feature source than to the GNN backbone. Larger PLMs are not necessarily better feature extractors in this pipeline, and the PLM choice has more impact than the GNN choice.

What carries the argument

The controlled empirical pipeline that feeds PLM-derived embeddings as node features into a GNN for joint semantic and structural processing of code graphs.

If this is right

Hybrids should be preferred over GNN-only models for code classification and vulnerability detection.
Model builders should allocate more effort to selecting and testing PLMs than to selecting GNN backbones.
Larger PLMs cannot be assumed to provide superior features for downstream GNN stages.
The derived guidelines can be used to narrow design choices before full training runs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar PLM-dominant sensitivity may appear in other code-graph tasks such as clone detection or bug localization.
Repeating the study with fine-tuned rather than frozen PLMs could test whether the current ranking of feature quality persists after adaptation.
The finding that size does not predict feature quality suggests efficiency gains by trying smaller PLMs first in hybrid pipelines.

Load-bearing premise

That the three chosen PLMs, three GNN architectures, and two datasets are representative enough to yield general design guidelines for PLM-GNN hybrids in code tasks.

What would settle it

A follow-up experiment on a third code dataset in which swapping the GNN backbone changes performance and robustness rankings more than swapping the PLM would falsify the claim that PLM source dominates.

Figures

Figures reproduced from arXiv: 2604.25599 by Edward Zulkoski, Mohamed Taoufik Kaouthar El Idrissi, Mohammad Hamdaqa.

**Figure 1.** Figure 1: Overview of the PLM→GNN approach Overall, we observe that the choice of PLM feature source tends to have a larger impact on performance than the choice of GNN backbone in our evaluated settings. We additionally find that the PLM size is not the primary factor that decides on the PLM→GNN hybrid performance. This work investigates the following research questions: • RQ1 (Effectiveness): How do PLM→GNN hybrid… view at source ↗

read the original abstract

Code understanding models increasingly rely on pretrained language models (PLMs) and graph neural networks (GNNs), which capture complementary semantic and structural information. We conduct a controlled empirical study of PLM-GNN hybrids for code classification and vulnerability detection tasks by systematically pairing three code-specialized PLMs with three foundational GNN architectures. We compare these hybrids against PLM-only and GNN-only baselines on Java250 and Devign, including an identifier-obfuscation setting. Across both tasks, hybrids consistently outperform GNN-only baselines and often improve ranking quality over frozen PLMs. On Devign, performance and robustness are more sensitive to the PLM feature source than to the GNN backbone. We also find that larger PLMs are not necessarily better feature extractors in this pipeline, and that the PLM choice has more impact than the GNN choice. Finally, we distill these findings into practical guidelines for PLM-GNN design choices in code classification and vulnerability detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper runs a narrow but controlled empirical comparison of PLM-GNN hybrids on code classification and vulnerability detection, showing hybrids beat GNN baselines and PLM choice matters more than GNN on Devign.

read the letter

The main takeaway is that hybrids of pretrained language models and graph neural networks tend to outperform pure GNN baselines on Java250 and Devign, with the PLM source driving more of the performance and robustness differences than the GNN backbone on the vulnerability task. They also report that larger PLMs do not always produce better features in this setup and turn the results into a short set of design guidelines. That is the core contribution: a systematic side-by-side test of three code PLMs with three GNN architectures, plus an identifier-obfuscation condition to probe robustness. The paper does a clean job of including both PLM-only and GNN-only baselines so the hybrid gains are visible, and the obfuscation test adds a practical angle that many prior comparisons skip. The findings on sensitivity are useful data points even if they are not surprising. The main limitation is scope. Three models of each type and two datasets is a small sample for claiming general guidelines about PLM-GNN design. The ordering they observe could easily shift with different fusion methods, training regimes, or other benchmarks. The abstract gives no numbers on statistical tests, hyperparameter search, or exact training protocols, so it is hard to judge how stable the sensitivity claims really are without the full methods and tables. If those details are missing or weak in the paper, the guidelines stay preliminary. This work is aimed at researchers and engineers who already build or tune hybrid models for code understanding and security tasks. A reader who wants concrete numbers on a few standard pairings will find it helpful; someone looking for a broad theory or new architecture will not. I would send it to peer review. The comparison is honest and the obfuscation condition is a reasonable addition, even though the narrow model set keeps the conclusions from being definitive.

Referee Report

2 major / 2 minor

Summary. The manuscript conducts a controlled empirical study of PLM-GNN hybrid models for code classification (Java250) and vulnerability detection (Devign). It pairs three code-specialized PLMs with three GNN architectures, compares hybrids to PLM-only and GNN-only baselines (including identifier-obfuscated settings), and reports that hybrids consistently outperform GNN-only baselines, often improve over frozen PLMs, with greater sensitivity to PLM source than GNN backbone on Devign. Larger PLMs are not always better, PLM choice matters more than GNN, and these are distilled into practical guidelines for design choices.

Significance. If the empirical findings hold under rigorous controls, the work offers actionable insights for practitioners and researchers designing hybrid models for code understanding tasks. It highlights the complementary nature of PLM semantic features and GNN structural info, and provides evidence that design choices should prioritize PLM selection. The obfuscation experiments add value by testing robustness. However, the significance is tempered by the narrow scope of models and datasets, which may not support broad guidelines without further validation.

major comments (2)

[Experimental Setup] The abstract reports consistent outperformance and sensitivity findings but provides no details on statistical tests, hyperparameter controls, or exact training protocols. Without these, the central empirical claims (e.g., hybrids outperforming baselines and PLM source mattering more than GNN backbone) cannot be verified or reproduced.
[Results and Discussion] The practical guidelines distilled from the results rest on patterns observed with only three PLMs, three GNNs, and two datasets. The claim that performance and robustness are more sensitive to PLM feature source than GNN backbone (and that larger PLMs are not necessarily better) may be an artifact of the specific fusion mechanism and chosen models rather than a general principle; no ablations on additional architectures or datasets are described to test robustness.

minor comments (2)

[Abstract] The abstract could explicitly name the three PLMs and three GNN architectures to allow readers to immediately assess the scope of the study.
[Tables] Tables reporting performance metrics should include standard deviations across multiple runs or confidence intervals to substantiate claims of 'consistent' outperformance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We appreciate the acknowledgment of the value in our controlled experiments, the comparison to baselines, and the inclusion of identifier-obfuscation settings. We address each major comment below, indicating the revisions we will make to improve clarity, reproducibility, and appropriate qualification of our findings.

read point-by-point responses

Referee: [Experimental Setup] The abstract reports consistent outperformance and sensitivity findings but provides no details on statistical tests, hyperparameter controls, or exact training protocols. Without these, the central empirical claims (e.g., hybrids outperforming baselines and PLM source mattering more than GNN backbone) cannot be verified or reproduced.

Authors: We agree that the abstract would benefit from a concise summary of key experimental controls to support immediate verification of the claims. The full manuscript (Section 4) already specifies the training protocols, hyperparameter ranges, grid-search procedure, early-stopping criteria, and evaluation metrics, along with the use of paired t-tests (p < 0.05) for significance. To address the concern directly, we will revise the abstract to include a brief clause noting the consistent hyperparameter controls, statistical testing, and availability of reproducibility artifacts. This change will make the central claims more readily verifiable without altering the manuscript's length or focus. revision: yes
Referee: [Results and Discussion] The practical guidelines distilled from the results rest on patterns observed with only three PLMs, three GNNs, and two datasets. The claim that performance and robustness are more sensitive to PLM feature source than GNN backbone (and that larger PLMs are not necessarily better) may be an artifact of the specific fusion mechanism and chosen models rather than a general principle; no ablations on additional architectures or datasets are described to test robustness.

Authors: We acknowledge that the study scope is limited to the three PLMs, three GNN architectures, two datasets, and the chosen fusion mechanism, as already stated in the manuscript. The guidelines are presented as empirical observations from this controlled setting rather than universal principles. We will revise the Results and Discussion sections to more explicitly qualify the sensitivity findings and guidelines as specific to the examined configurations. We will also add a dedicated Limitations and Threats to Validity subsection that discusses the narrow model and dataset selection and recommends future validation on additional architectures and tasks. These textual revisions will better contextualize the claims while preserving the contribution of the systematic comparisons and obfuscation experiments. revision: yes

Circularity Check

0 steps flagged

Purely empirical study with no derivation chain or self-referential predictions

full rationale

The paper performs controlled experiments pairing three PLMs with three GNNs, evaluates hybrids vs. baselines on Java250 and Devign (including obfuscation), and summarizes observed patterns into guidelines. No equations, fitted parameters, uniqueness theorems, or ansatzes are present; claims rest on direct performance measurements rather than any reduction to inputs by construction. Self-citations, if any, are not load-bearing for a central premise. This matches the default non-circular case for empirical comparison papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the assumption that the selected PLMs, GNNs, and benchmarks are representative; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5482 in / 1109 out tokens · 26726 ms · 2026-05-07T16:01:31.546448+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 22 canonical work pages · 9 internal anchors

[1]

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang
[2]

Unified pre-training for program understanding and generation.arXiv preprint arXiv:2103.06333(2021)

work page arXiv 2021
[3]

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Frame- work. InProceedings of the 25th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining

2019
[4]

Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness.ACM Computing Surveys (CSUR)51, 4 (2018), 1–37

2018
[5]

Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning to represent programs with graphs.arXiv preprint arXiv:1711.00740(2017)

work page arXiv 2017
[6]

Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to Represent Programs with Graphs. InInternational Conference on Learning Representations. PLMGH: What Matters in PLM-GNN Hybrids for Code Classification and Vulnerability Detection

2018
[7]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model.Journal of machine learning research3, Feb (2003), 1137–1155

2003
[8]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review arXiv 2021
[9]

Junyan Cheng, Iordanis Fostiropoulos, and Barry Boehm. 2021. Gn-transformer: Fusing sequence and graph representation for improved code summarization. arXiv preprint arXiv:2111.08874(2021)

work page arXiv 2021
[10]

Vijay Prakash Dwivedi, Chaitanya K Joshi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. 2023. Benchmarking graph neural networks. Journal of Machine Learning Research24, 43 (2023), 1–48

2023
[11]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages.arXiv preprint arXiv:2002.08155 (2020)

work page internal anchor Pith review arXiv 2020
[12]

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code representation.arXiv preprint arXiv:2203.03850(2022)

work page arXiv 2022
[13]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366 (2020)

work page internal anchor Pith review arXiv 2020
[14]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al . 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review arXiv 2024
[15]

Abram Hindle, Earl T Barr, Mark Gabel, Zhendong Su, and Premkumar Devanbu
[16]

ACM59, 5 (2016), 122–131

On the naturalness of software.Commun. ACM59, 5 (2016), 122–131

2016
[17]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

work page internal anchor Pith review arXiv 2024
[18]

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907(2016)

work page internal anchor Pith review arXiv 2016
[19]

Alexander LeClair, Sakib Haque, Lingfei Wu, and Collin McMillan. 2020. Im- proved code summarization via a graph neural network. InProceedings of the 28th international conference on program comprehension. 184–195

2020
[20]

Ruitong Liu, Yanbin Wang, Haitao Xu, Jianguo Sun, Fan Zhang, Peiyue Li, and Zhenhao Guo. 2025. Vul-LMGNNs: Fusing language models and online-distilled graph neural networks for code vulnerability detection.Information Fusion115 (2025), 102748

2025
[21]

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy- Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. Starcoder 2 and the stack v2: The next generation.arXiv preprint arXiv:2402.19173(2024)

work page internal anchor Pith review arXiv 2024
[22]

Van-Anh Nguyen, Dai Quoc Nguyen, Van Nguyen, Trung Le, Quan Hung Tran, and Dinh Phung. 2022. Regvd: Revisiting graph neural networks for vulnerabil- ity detection. InProceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 178–182

2022
[23]

José Carlos Paiva, José Paulo Leal, and Álvaro Figueira. 2024. Comparing se- mantic graph representations of source code: the case of automatic feedback on programming assignments.Computer Science and Information Systems21, 1 (2024), 117–142

2024
[24]

Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, et al
[25]

Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks.arXiv preprint arXiv:2105.12655(2021)

work page arXiv 2021
[26]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review arXiv 2023
[27]

Yunsheng Shi, Zhengjie Huang, Shikun Feng, Hui Zhong, Wenjin Wang, and Yu Sun. 2020. Masked label prediction: Unified message passing model for semi-supervised classification.arXiv preprint arXiv:2009.03509(2020)

work page arXiv 2020
[28]

Ze Tang, Xiaoyu Shen, Chuanyi Li, Jidong Ge, Liguo Huang, Zhelin Zhu, and Bin Luo. 2022. Ast-trans: Code summarization with efficient tree-structured attention. InProceedings of the 44th International Conference on Software Engineering. 150– 162

2022
[29]

Hoai-Chau Tran, Anh-Duy Tran, and Kim-Hung Le. 2025. DetectVul: A statement- level code vulnerability detection for Python.Future Generation Computer Sys- tems163 (2025), 107504. doi:10.1016/j.future.2024.107504

work page doi:10.1016/j.future.2024.107504 2025
[30]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

2017
[31]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks.arXiv preprint arXiv:1710.10903(2017)

work page internal anchor Pith review arXiv 2017
[32]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier- aware unified pre-trained encoder-decoder models for code understanding and generation.arXiv preprint arXiv:2109.00859(2021)

work page arXiv 2021
[33]

Jiayi Wei, Maruth Goyal, Greg Durrett, and Isil Dillig. 2020. Lambdanet: Probabilistic type inference using graph neural networks.arXiv preprint arXiv:2005.02161(2020)

work page arXiv 2020
[34]

Hongqiu Wu, Hai Zhao, and Min Zhang. 2020. Code summarization with structure-induced transformer.arXiv preprint arXiv:2012.14710(2020)

work page arXiv 2020
[35]

Aidan ZH Yang, Haoye Tian, He Ye, Ruben Martins, and Claire Le Goues. 2024. Security vulnerability detection with multitask self-instructed fine-tuning of large language models.arXiv preprint arXiv:2406.05892(2024)

work page arXiv 2024
[36]

Yufan Ye, Pu Pang, Ting Zhang, and Hua Huang. 2025. GNN-Coder: Boosting Semantic Code Retrieval with Combined GNNs and Transformer.arXiv preprint arXiv:2502.15202(2025)

work page arXiv 2025
[37]

Kechi Zhang, Zhuo Li, Zhi Jin, and Ge Li. 2023. Implant global and local hierarchy information to sequence based code representation models. In2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC). IEEE, 157–168

2023
[38]

Kechi Zhang, Wenhan Wang, Huangzhao Zhang, Ge Li, and Zhi Jin. 2022. Learn- ing to represent programs with heterogeneous graphs. InProceedings of the 30th IEEE/ACM international conference on program comprehension. 378–389

2022
[39]

Yuguo Zhang, Jia Yang, and Ou Ruan. 2024. Cross-language source code clone de- tection based on graph neural network. InProceedings of the 2024 3rd International Conference on Cryptography, Network Security and Communication Technology. 189–194

2024
[40]

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems32 (2019)

2019
[41]

Yufan Zhuang, Sahil Suneja, Veronika Thost, Giacomo Domeniconi, Alessandro Morari, and Jim Laredo. 2021. Software vulnerability detection via deep learning over disaggregated code graph representation.arXiv preprint arXiv:2109.03341 (2021)

work page arXiv 2021