arxiv: 2604.17715 · v1 · submitted 2026-04-20 · 💻 cs.SE · cs.LG

Recognition: unknown

Program Structure-aware Language Models: Targeted Software Testing beyond Textual Semantics

Khang Tran , Khoa Nguyen , Cristian Borcea , NHatHai Phan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:03 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords GLMTestprogram structure-awaretargeted test case generationcode property graphsgraph neural networksbranch coveragesoftware testinglarge language models

0 comments

The pith

GLMTest steers language models to specific code branches by feeding them program graphs instead of text prompts alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GLMTest, a framework that combines code property graphs with large language models through a graph neural network. This integration conditions test generation on particular execution branches rather than relying only on textual semantics in prompts. The result is more controllable targeting of high-risk paths that standard LLM approaches often miss. Experiments on real projects and the TestGenEval benchmark show branch accuracy rising from 27.4 percent to 50.2 percent when built on a Qwen2.5-Coder-7B-Instruct model, outperforming Claude-Sonnet-4.5 and GPT-4o-mini.

Core claim

GLMTest is the first program structure-aware LLM framework for targeted test case generation. It seamlessly integrates code property graphs and code semantics using a graph neural network and a language model to condition test case generation on execution branches. This structured conditioning enables controllable and branch-targeted test case generation, thereby potentially enhancing bug and security risk discovery. On the TestGenEval benchmark, GLMTest built on a Qwen2.5-Coder-7B-Instruct model improves branch accuracy from 27.4% to 50.2% compared with state-of-the-art LLMs.

What carries the argument

Code property graphs processed by a graph neural network that conditions the language model on specific execution branches for test generation.

If this is right

Test generation becomes steerable toward chosen high-risk branches instead of random or prompt-driven coverage.
The same integration of graphs and language models can be applied to other code models beyond the Qwen2.5 base used in the experiments.
Branch-targeted generation offers a direct lever for improving discovery of security-relevant paths.
The framework works on real-world projects, showing practical applicability beyond synthetic benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If branch accuracy gains persist across larger codebases, the approach could reduce the manual effort needed to reach critical execution paths in testing workflows.
The same graph-conditioned conditioning mechanism might extend to related tasks such as automated repair or vulnerability patching.
Direct measurement of bug-finding rates, rather than branch accuracy alone, would strengthen the case for adoption in security-critical software.

Load-bearing premise

Higher accuracy at hitting chosen branches will automatically produce tests that uncover more subtle bugs and vulnerabilities than current methods.

What would settle it

A head-to-head experiment on the same real-world projects that counts actual bugs or vulnerabilities discovered by GLMTest-generated tests versus tests from Claude-Sonnet-4.5 or GPT-4o-mini.

Figures

Figures reproduced from arXiv: 2604.17715 by Cristian Borcea, Khang Tran, Khoa Nguyen, NHatHai Phan.

**Figure 1.** Figure 1: A Python function example annotated with line numbers and branch paths. Two test cases (Lines 12 and 15) are shown with their corresponding execution branches (Lines 13 and 16), illustrating how different input combinations traverse distinct branches. captures the program’s structural information. Setting of GLMTest. Given a program S, the test case generator produces a test suite τ = {ti}i∈[1,n] , where… view at source ↗

**Figure 2.** Figure 2: GLMTest pipeline. guidance on constructing effective test cases to exercise the targeted branch. The prompt is tokenized and encoded into textual embeddings. Then, the structural embeddings are concatenated with the textual embeddings, which are passed to the LLM module flm to generate executable test cases executing b. We train GLMTest end-to-end on a highquality dataset curated from real-world projects… view at source ↗

**Figure 4.** Figure 4: Branch accuracy and branch overlap with the targeted branches of GLMTest vs. baselines with execution feedback [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Branch accuracy and branch overlap with the targeted branches of GLMTest vs. RAG augmentation baselines. GLMTest is more likely and more consistent to reach the targeted branch across tasks. Perrepository results mirror this overall performance. For instance, on django repository, compared with test cases generated by ClaudeSonnet-4.5, branch accuracy improves from 0.46 to 0.68 with GLMTest, and the fr… view at source ↗

**Figure 6.** Figure 6: Pass@1 and BranchCov when using branchtargeted inference. setting, 4o-mini and Sonnet-4.5 achieve 26.5% and 28.5% BranchAcc with 59.5% and 60.2% BranchOverlap, respectively, which are lower than GLMTest’s 50.2% BranchAcc and 80.2% BranchOverlap. These results indicate that iterative execution feedback, while providing some guidance, remains significantly less effective than explicit structural conditioni… view at source ↗

**Figure 7.** Figure 7: Prompt template used by GLMTest to instruct the LLM to generate branch-targeted Python test cases [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Recent advances in large language models for test case generation have improved branch coverage via prompt-engineered mutations. However, they still lack principled mechanisms for steering models toward specific high-risk execution branches, limiting their effectiveness for discovering subtle bugs and security vulnerabilities. We propose GLMTest, the first program structure-aware LLM framework for targeted test case generation that seamlessly integrates code property graphs and code semantics using a graph neural network and a language model to condition test case generation on execution branches. This structured conditioning enables controllable and branch-targeted test case generation, thereby potentially enhancing bug and security risk discovery. Experiments on real-world projects show that GLMTest built on a Qwen2.5-Coder-7B-Instruct model improves branch accuracy from 27.4% to 50.2% on TestGenEval benchmark compared with state-of-the-art LLMs, i.e., Claude-Sonnet-4.5 and GPT-4o-mini.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GLMTest gets a measurable lift in branch accuracy by steering LLMs with code property graphs and a GNN, but the results stay at targeting metrics rather than actual bug or vulnerability detection.

read the letter

GLMTest improves branch accuracy from 27.4% to 50.2% on TestGenEval by conditioning a Qwen2.5-Coder model on execution branches through code property graphs and a graph neural network. That is the concrete result the paper puts forward against Claude-Sonnet-4.5 and GPT-4o-mini baselines on real-world projects. The approach is new in its explicit combination of CPG structure, GNN encoding, and LLM generation to aim test cases at specific high-risk branches instead of relying only on prompt mutations. The framework description is straightforward and the benchmark comparison is reported directly, which makes the accuracy claim easy to check. The paper does a reasonable job showing how the pieces fit together for controllable, branch-targeted generation. The main limitation is that the experiments measure only branch accuracy. The motivation section emphasizes discovery of subtle bugs and security vulnerabilities, yet no mutation scores, fault-detection rates, or vulnerability-specific benchmarks appear in the reported results. Higher accuracy at hitting branches does not automatically translate into more bugs found, so the practical payoff remains an open assumption rather than a demonstrated outcome. Details on GNN training, graph construction overhead, and how the conditioning is actually passed to the LLM would also help readers judge reproducibility. This work is aimed at people building or evaluating LLM-based test generators who want to move beyond pure text prompting. A reader interested in structure-aware testing techniques will find a usable method and a clear numerical comparison. It is worth sending to peer review because the core integration is described, the benchmark result is specific, and the gap on bug-finding evidence is fixable with additional experiments rather than a fatal flaw in the setup.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes GLMTest, a program structure-aware LLM framework for targeted test case generation. It integrates code property graphs (CPGs) and graph neural networks (GNNs) to condition test generation on specific execution branches rather than relying solely on textual semantics. The central empirical claim is that GLMTest, instantiated on the Qwen2.5-Coder-7B-Instruct model, raises branch accuracy from 27.4% to 50.2% on the TestGenEval benchmark relative to commercial baselines Claude-Sonnet-4.5 and GPT-4o-mini. The authors position the approach as enabling controllable, branch-targeted testing that could improve discovery of subtle bugs and security vulnerabilities.

Significance. If the branch-accuracy gains prove robust, the work would constitute a useful advance by demonstrating a concrete mechanism for injecting program-structure information (via CPGs and GNNs) into LLM-based test generation. This moves beyond prompt-engineering techniques and supplies a principled route to branch-specific conditioning. The explicit integration of graph-based program representations with language models is a clear methodological contribution that could be extended to other software-engineering tasks.

major comments (2)

[Abstract] Abstract: the claim that the CPG+GNN conditioning 'potentially enhancing bug and security risk discovery' is unsupported by the reported results. The only quantitative finding is the branch-accuracy improvement on TestGenEval; no mutation scores, fault-detection rates, or vulnerability-specific benchmarks are presented, leaving the central motivation unverified.
[Experiments] Experiments section: the 27.4% to 50.2% accuracy comparison lacks reported details on baseline prompt templates, temperature settings, statistical significance tests, or ablation studies isolating the contribution of the CPG and GNN components. Without these controls the magnitude of the claimed improvement cannot be confidently attributed to the proposed structure-aware mechanism.

minor comments (2)

[Abstract] The abstract introduces the base model Qwen2.5-Coder-7B-Instruct only at the end; moving this information earlier would improve readability.
[Introduction] A brief description or citation for the TestGenEval benchmark in the introduction would help readers unfamiliar with the dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the CPG+GNN conditioning 'potentially enhancing bug and security risk discovery' is unsupported by the reported results. The only quantitative finding is the branch-accuracy improvement on TestGenEval; no mutation scores, fault-detection rates, or vulnerability-specific benchmarks are presented, leaving the central motivation unverified.

Authors: We agree that the abstract phrasing implies a direct link to bug and vulnerability discovery that is not supported by the current experiments, which focus solely on branch accuracy. The improvement in targeted branch coverage is a foundational step, but we did not measure fault detection or security benchmarks. In the revised version we will remove the forward-looking claim about bug and security risk discovery from the abstract and limit the stated contribution to the demonstrated gains in branch-targeted test generation. revision: yes
Referee: [Experiments] Experiments section: the 27.4% to 50.2% accuracy comparison lacks reported details on baseline prompt templates, temperature settings, statistical significance tests, or ablation studies isolating the contribution of the CPG and GNN components. Without these controls the magnitude of the claimed improvement cannot be confidently attributed to the proposed structure-aware mechanism.

Authors: The referee correctly identifies missing experimental controls. The current manuscript reports only aggregate accuracy figures without the requested details. In the revision we will expand the Experiments section to include: (1) the exact prompt templates used for all baselines, (2) temperature settings (0.2 for deterministic decoding), (3) statistical significance testing (e.g., McNemar’s test with p-values), and (4) ablation studies that separately disable the CPG and GNN components to quantify their individual contributions. These additions will allow readers to attribute performance gains more precisely to the structure-aware mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical gains measured on external public benchmark

full rationale

The paper's central claim rests on an empirical evaluation: GLMTest (CPG + GNN conditioning) raises branch accuracy from 27.4% to 50.2% on the named TestGenEval benchmark versus independent commercial models (Claude-Sonnet-4.5, GPT-4o-mini). No equation, parameter fit, or self-citation is shown to define the reported accuracy by construction. The framework description and integration steps are architectural choices whose outputs are measured externally rather than renamed or forced. The untested mapping from branch accuracy to bug discovery is an evidence gap, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate free parameters, axioms, or invented entities; the approach appears to rest on standard assumptions about GNN message passing and LLM prompting that are not spelled out.

pith-pipeline@v0.9.0 · 5462 in / 1245 out tokens · 59572 ms · 2026-05-10T05:03:23.142276+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 22 canonical work pages · 3 internal anchors

[1]

and Sen, Siddhartha , booktitle=

Lemieux, Caroline and Inala, Jeevana Priya and Lahiri, Shuvendu K. and Sen, Siddhartha , booktitle=. CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models , year=
[2]

An empirical study of automated unit test generation for Python , year =

Lukasczyk, Stephan and Kroi. An empirical study of automated unit test generation for Python , year =. Empirical Softw. Engg. , month =. doi:10.1007/s10664-022-10248-w , abstract =

work page doi:10.1007/s10664-022-10248-w
[3]

2024 , eprint=

CoverUp: Coverage-Guided LLM-Based Test Generation , author=. 2024 , eprint=

2024
[4]

Yuan, Zhiqiang and Liu, Mingwei and Ding, Shiji and Wang, Kaixin and Chen, Yixuan and Peng, Xin and Lou, Yiling , title =. Proc. ACM Softw. Eng. , month =. 2024 , issue_date =. doi:10.1145/3660783 , abstract =

work page doi:10.1145/3660783 2024
[5]

ACM Transactions on Software Engineering and Methodology (TOSEM) , year=

Gordon Fraser and Andrea Arcuri , title =. ACM Transactions on Software Engineering and Methodology (TOSEM) , year=. 2014 , publisher=

2014
[6]

ISSTA 2014, Proceedings of the 2014 International Symposium on Software Testing and Analysis , pages =

Ren. ISSTA 2014, Proceedings of the 2014 International Symposium on Software Testing and Analysis , pages =. 2014 , note =

2014
[7]

Code coverage at Google , year =

Ivankovi\'. Code coverage at Google , year =. Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages =. doi:10.1145/3338906.3340459 , abstract =

work page doi:10.1145/3338906.3340459 2019
[8]

2024 , publisher=

Fundamentals of Software Testing , author=. 2024 , publisher=

2024
[9]

ACM Computing Surveys (CSUR) , volume=

Fuzzing: a survey for roadmap , author=. ACM Computing Surveys (CSUR) , volume=. 2022 , publisher=

2022
[10]

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

Fuzz4all: Universal fuzzing with large language models , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=
[11]

Proceedings of the 19th ACM Asia Conference on Computer and Communications Security , pages=

SoK: Where to Fuzz? Assessing Target Selection Methods in Directed Fuzzing , author=. Proceedings of the 19th ACM Asia Conference on Computer and Communications Security , pages=
[12]

Advances in Neural Information Processing Systems , year=

Attention is all you need , author=. Advances in Neural Information Processing Systems , year=
[13]

Code Llama: Open Foundation Models for Code

Code llama: Open foundation models for code , author=. arXiv preprint arXiv:2308.12950 , year=

work page internal anchor Pith review arXiv
[14]

„Survey reveals AI’s impact on the developer experience.“ , author=. Github. blog , volume=
[15]

Building Language Models for Text with Named Entities

Parvez, Md Rizwan and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei. Building Language Models for Text with Named Entities. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1221

work page doi:10.18653/v1/p18-1221 2018
[16]

, author=

Automated whitebox fuzz testing. , author=. NDSS , volume=
[17]

Proceedings of the 2021 ACM SIGSAC conference on computer and communications security , pages=

Snipuzz: Black-box fuzzing of iot firmware via message snippet inference , author=. Proceedings of the 2021 ACM SIGSAC conference on computer and communications security , pages=

2021
[18]

2021 , eprint=

Unit Test Case Generation with Transformers and Focal Context , author=. 2021 , eprint=

2021
[19]

A3Test: Assertion-Augmented Automated Test case generation , journal =

Saranya Alagarsamy and Chakkrit Tantithamthavorn and Aldeida Aleti , keywords =. A3Test: Assertion-Augmented Automated Test case generation , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.infsof.2024.107565 , url =

work page doi:10.1016/j.infsof.2024.107565 2024
[20]

Log parsing: How far can chatgpt go? InProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering, ASE ’23, page 1699–1704

Rao, Nikitha and Jain, Kush and Alon, Uri and Goues, Claire Le and Hellendoorn, Vincent J. , title =. Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering , pages =. 2024 , isbn =. doi:10.1109/ASE56229.2023.00193 , abstract =

work page doi:10.1109/ase56229.2023.00193 2024
[21]

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation , year=

Schäfer, Max and Nadi, Sarah and Eghbali, Aryaz and Tip, Frank , journal=. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation , year=
[22]

Using Large Language Models to Generate JUnit Tests: An Empirical Study , year =

Siddiq, Mohammed Latif and Da Silva Santos, Joanna Cecilia and Tanvir, Ridwanul Hasan and Ulfat, Noshin and Al Rifat, Fahmid and Carvalho Lopes, Vin\'. Using Large Language Models to Generate JUnit Tests: An Empirical Study , year =. Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering , pages =. doi:10.114...

work page doi:10.1145/3661167.3661216
[23]

Chatunitest: A framework for llm-based test generation,

Chen, Yinghao and Hu, Zehao and Zhi, Chen and Han, Junxiao and Deng, Shuiguang and Yin, Jianwei , title =. Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering , pages =. 2024 , isbn =. doi:10.1145/3663529.3663801 , abstract =

work page doi:10.1145/3663529.3663801 2024
[24]

Desmarais , keywords =

Arghavan Moradi Dakhel and Amin Nikanjam and Vahid Majdinasab and Foutse Khomh and Michel C. Desmarais , keywords =. Effective test generation using pre-trained Large Language Models and mutation testing , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.infsof.2024.107468 , url =

work page doi:10.1016/j.infsof.2024.107468 2024
[25]

A Survey on Malware Detection with Graph Representation Learning,

Bilot, Tristan and El Madhoun, Nour and Al Agha, Khaldoun and Zouaoui, Anis , title =. ACM Comput. Surv. , month = jun, articleno =. 2024 , issue_date =. doi:10.1145/3664649 , abstract =

work page doi:10.1145/3664649 2024
[26]

and Hankin, Chris , title =

Nielson, Flemming and Nielson, Hanne R. and Hankin, Chris , title =. 2010 , isbn =

2010
[27]

Ryan, Gabriel and Jain, Siddhartha and Shang, Mingyue and Wang, Shiqi and Ma, Xiaofei and Ramanathan, Murali Krishna and Ray, Baishakhi , title =. Proc. ACM Softw. Eng. , month = jul, articleno =. 2024 , issue_date =. doi:10.1145/3643769 , abstract =

work page doi:10.1145/3643769 2024
[28]

Deep learning code fragments for code clone detection , year=

White, Martin and Tufano, Michele and Vendome, Christopher and Poshyvanyk, Denys , booktitle=. Deep learning code fragments for code clone detection , year=
[29]

Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages =

Zhao, Gang and Huang, Jeff , title =. Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages =. 2018 , isbn =. doi:10.1145/3236024.3236068 , abstract =

work page doi:10.1145/3236024.3236068 2018
[30]

InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering(Porto de Galinhas, Brazil)(FSE 2024)

Jiang, Yu and Liang, Jie and Ma, Fuchen and Chen, Yuanliang and Zhou, Chijin and Shen, Yuheng and Wu, Zhiyong and Fu, Jingzhou and Wang, Mingzhe and Li, Shanshan and Zhang, Quan , title =. Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering , pages =. 2024 , isbn =. doi:10.1145/3663529.3663784 , abstract =

work page doi:10.1145/3663529.3663784 2024
[31]

2014 IEEE symposium on security and privacy , pages=

Modeling and discovering vulnerabilities with code property graphs , author=. 2014 IEEE symposium on security and privacy , pages=. 2014 , organization=

2014
[32]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Vision-language models for vision tasks: A survey , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
[33]

IEEE Transactions on Software Engineering , volume=

Software testing with large language models: Survey, landscape, and vision , author=. IEEE Transactions on Software Engineering , volume=. 2024 , publisher=

2024
[34]

TestEval : Benchmarking Large Language Models for Test Case Generation

Wang, Wenhan and Yang, Chenyuan and Wang, Zhijie and Huang, Yuheng and Chu, Zhaoyang and Song, Da and Zhang, Lingming and Chen, An Ran and Ma, Lei. TestEval : Benchmarking Large Language Models for Test Case Generation. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.197

work page doi:10.18653/v1/2025.findings-naacl.197 2025
[35]

International Journal of Emerging Technologies and Innovative Research (www

Artificial intelligence in software test automation: A systematic literature review , author=. International Journal of Emerging Technologies and Innovative Research (www. jetir. org| UGC and issn Approved), ISSN , pages=
[36]

Proceedings of the 2018 ACM SIGSAC conference on computer and communications security , pages=

Evaluating fuzz testing , author=. Proceedings of the 2018 ACM SIGSAC conference on computer and communications security , pages=

2018
[37]

IEEE Transactions on Reliability , volume=

Fuzzing: State of the art , author=. IEEE Transactions on Reliability , volume=. 2018 , publisher=

2018
[38]

2022 , howpublished =

The European Software Testing Benchmark Report 2022, Part II: Quality in Software Testing , author =. 2022 , howpublished =

2022
[39]

2024 , howpublished =

Software Testing: Market and Insights Report 2024 , author =. 2024 , howpublished =

2024
[40]

2022 , howpublished =

State of Testing Report 2022 , author =. 2022 , howpublished =

2022
[41]

2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) , pages=

Aster: Natural and multi-language unit test generation with llms , author=. 2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) , pages=. 2025 , organization=

2025
[42]

Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering , pages=

Mutation-Guided LLM-based Test Generation at Meta , author=. Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering , pages=
[43]

Proceedings of the ACM on Software Engineering , volume=

Code-aware prompting: A study of coverage-guided test generation in regression setting using llm , author=. Proceedings of the ACM on Software Engineering , volume=. 2024 , publisher=

2024
[44]

Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering , pages=

Hits: High-coverage llm-based unit test generation via method slicing , author=. Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering , pages=
[45]

arXiv preprint arXiv:2409.10756 , year=

Vulnllmeval: A framework for evaluating large language models in software vulnerability detection and patching , author=. arXiv preprint arXiv:2409.10756 , year=

work page arXiv
[46]

ACM Transactions on Software Engineering and Methodology , volume=

Large language model for vulnerability detection and repair: Literature review and the road ahead , author=. ACM Transactions on Software Engineering and Methodology , volume=. 2025 , publisher=

2025
[47]

Software: Practice and Experience , volume=

Declarative static analysis for multilingual programs using CodeQL , author=. Software: Practice and Experience , volume=. 2023 , publisher=

2023
[48]

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark , author=
[49]

2012 7th International Workshop on Automation of Software Test (AST) , pages=

SECFUZZ: Fuzz-testing security protocols , author=. 2012 7th International Workshop on Automation of Software Test (AST) , pages=. 2012 , organization=

2012
[50]

2019 IEEE Symposium on Security and Privacy (SP) , pages=

SoK: Sanitizing for security , author=. 2019 IEEE Symposium on Security and Privacy (SP) , pages=. 2019 , organization=

2019
[51]

European semantic web conference , pages=

Modeling relational data with graph convolutional networks , author=. European semantic web conference , pages=. 2018 , organization=

2018
[52]

Graph Attention Networks

Graph attention networks , author=. arXiv preprint arXiv:1710.10903 , year=

work page internal anchor Pith review arXiv
[53]

Advances in neural information processing systems , volume=

Inductive representation learning on large graphs , author=. Advances in neural information processing systems , volume=
[54]

Applied Sciences , volume=

A review of current trends, techniques, and challenges in large language models (llms) , author=. Applied Sciences , volume=. 2024 , publisher=

2024
[55]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review arXiv
[56]

Journal of Systems and Software , volume=

On introducing automatic test case generation in practice: A success story and lessons learned , author=. Journal of Systems and Software , volume=. 2021 , publisher=

2021
[57]

Intelligent Computing-Proceedings of the Computing Conference , pages=

The Future of Software Testing: AI--Powered Test Case Generation and Validation , author=. Intelligent Computing-Proceedings of the Computing Conference , pages=. 2025 , organization=

2025
[58]

An Empirical Study of the Non-determinism of ChatGPT in Code Generation , author=
[59]

Proceedings of the ACM/IEEE 2nd International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering , pages=

An exploratory study on how non-determinism in large language models affects log parsing , author=. Proceedings of the ACM/IEEE 2nd International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering , pages=
[60]

The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[61]

Alessio Devoto, Maximilian Jeblick, and Simon J ´egou

Bridging Code Graphs and Large Language Models for Better Code Understanding , author=. arXiv preprint arXiv:2512.07666 , year=

work page arXiv
[62]

Information Fusion , volume=

Vul-LMGNNs: Fusing language models and online-distilled graph neural networks for code vulnerability detection , author=. Information Fusion , volume=. 2025 , publisher=

2025
[63]

IEEE Transactions on Knowledge and Data Engineering , year=

Large language models on graphs: A comprehensive survey , author=. IEEE Transactions on Knowledge and Data Engineering , year=
[64]

2025 IEEE International Conference on Software Services Engineering (SSE) , pages=

On the challenges of fuzzing techniques via large language models , author=. 2025 IEEE International Conference on Software Services Engineering (SSE) , pages=. 2025 , organization=

2025
[65]

arXiv preprint arXiv:2507.22065 , year=

Fuzzing: Randomness? Reasoning! Efficient Directed Fuzzing via Large Language Models , author=. arXiv preprint arXiv:2507.22065 , year=

work page arXiv
[66]

2008 , publisher =

Introduction to Software Testing , author =. 2008 , publisher =

2008
[67]

34th USENIX Security Symposium (USENIX Security 25) , pages=

\ LLMxCPG \ : \ Context-Aware \ Vulnerability Detection Through Code Property \ Graph-Guided \ Large Language Models , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=
[68]

arXiv preprint arXiv:2511.11896 , year=

VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization , author=. arXiv preprint arXiv:2511.11896 , year=

work page arXiv
[69]

Codexgraph: Bridging large language models and code repositories via code graph databases , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[70]

2024 , url =

GitHub , title =. 2024 , url =

2024
[71]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Galla: Graph aligned large language models for improved source code understanding , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[72]

ACM Transactions on Asian and Low-Resource Language Information Processing , volume=

Codekgc: Code language model for generative knowledge graph construction , author=. ACM Transactions on Asian and Low-Resource Language Information Processing , volume=. 2024 , publisher=

2024
[73]

arXiv preprint arXiv:2408.13863 , year=

Codegraph: Enhancing graph reasoning of llms with code , author=. arXiv preprint arXiv:2408.13863 , year=

work page arXiv
[74]

Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

Which modality should i use-text, motif, or image?: Understanding graphs with large language models , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

2024
[75]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Graphinsight: Unlocking insights in large language models for graph structure understanding , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[76]

Proceedings of the ACM on Software Engineering , volume=

Coverup: Effective high coverage test generation for python , author=. Proceedings of the ACM on Software Engineering , volume=. 2025 , publisher=

2025
[77]

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

Pyanalyzer: An effective and practical approach for dependency extraction from python code , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=