arxiv: 2605.06279 · v1 · submitted 2026-05-07 · 💻 cs.SE · cs.AI

Recognition: unknown

Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions

Chengjie Wang, Chen Zhao, Jingzheng Wu, Tianyue Luo, Xiang Ling

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:49 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM code generationvulnerable dependencieslibrary version selectionCVE exposurePython securitycompatibility failuressystemic biasmanifest files

0 comments

The pith

LLMs specify library versions carrying known security vulnerabilities in 37 to 56 percent of generated tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures the security and compatibility risks created when large language models write Python code that imports third-party libraries with explicit version numbers. It tests ten different models on one thousand programming tasks drawn from Stack Overflow, tracking how often the models add version identifiers and then checking those versions against known vulnerability databases. When versions are supplied, more than one third of the resulting tasks include at least one publicly documented CVE, and roughly two thirds of those CVEs are rated critical or high severity. Most of the vulnerabilities were disclosed before the models' training data ended, and every model tends to settle on the same narrow set of problematic release numbers. The work treats version choice as a distinct failure mode that persists even when the surrounding code is otherwise correct.

Core claim

The authors show that LLMs include version identifiers in 26.83 to 95.18 percent of cases when prompted directly and in 6.45 to 59.19 percent of cases when asked to produce a manifest file. Among those specified versions, 36.70 to 55.70 percent of tasks contain at least one known CVE, and 62.75 to 74.51 percent of the CVEs are critical or high. In 72.27 to 91.37 percent of cases the CVEs were published before the model's knowledge cutoff. All ten models converge on the same small collection of risky release versions rather than distributing errors independently. Static compatibility checks pass in only 19.70 to 63.20 percent of cases, with installation failures as the main cause, and dynamic

What carries the argument

The PinTrace benchmark of 1,000 Stack Overflow tasks used to prompt models, extract version strings from generated code or manifests, and match them against public CVE records.

If this is right

Externally supplied version constraints sharply reduce both CVE exposure and installation failures.
Version selection errors are attributable to the chosen releases rather than to flaws in the surrounding code.
Dynamic test pass rates drop to 6.49-48.62 percent, confirming that the risky versions actually break functionality.
All evaluated models exhibit the same narrow preference for vulnerable releases, pointing to a shared training artifact.
Compatibility failures remain high even when the generated code is otherwise correct.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams that let LLMs generate dependency files may need automated post-processing to replace risky versions with known-safe alternatives.
The same convergence pattern could appear in other languages or ecosystems if the underlying training data favor popular but outdated releases.
Prompt engineering that explicitly forbids version numbers or forces latest-stable constraints offers a lightweight mitigation.
Long-term, training data curation that down-weights vulnerable historical releases might reduce the bias at its source.

Load-bearing premise

The 1,000 Stack Overflow tasks represent typical real-world prompts for library-dependent code, and the automated extraction of versions plus CVE lookup introduces negligible false positives or selection bias.

What would settle it

A follow-up experiment on a fresh set of prompts or with an independent CVE scanner that finds substantially lower vulnerable-version rates or no convergence across models would undermine the systemic-bias conclusion.

Figures

Figures reproduced from arXiv: 2605.06279 by Chengjie Wang, Chen Zhao, Jingzheng Wu, Tianyue Luo, Xiang Ling.

**Figure 1.** Figure 1: A motivating example of LLM-introduced version-level risk. An LLM generates functionally correct view at source ↗

**Figure 2.** Figure 2: The analysis pipeline of the measurement study. Each (task, model, mode) triple passes through code view at source ↗

**Figure 3.** Figure 3: Specified version distribution for the top-10 most frequently annotated TPLs under inline mode. Each view at source ↗

**Figure 4.** Figure 4: Distribution of release dates of LLM-specified TPL versions per model under inline mode. The newer view at source ↗

**Figure 5.** Figure 5: Per-model vulnerability exposure profile under inline mode. (a): Distribution of the number of vulnerable view at source ↗

**Figure 6.** Figure 6: Cross-model convergence on the most frequently specified vulnerable view at source ↗

**Figure 7.** Figure 7: Distribution of the lag between CVE disclosure dates and each model’s knowledge cutoff ( view at source ↗

**Figure 8.** Figure 8: Flow from static compatibility outcomes to dynamic execution results across BigCodeBench tasks view at source ↗

read the original abstract

Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version choices can carry security and compatibility risks, yet they have not been systematically studied. We present the first large-scale measurement study of version-level risk in LLM-generated Python code, evaluating 10 LLMs on PinTrace, a curated benchmark of 1,000 Stack Overflow programming tasks. LLMs tend to specify version identifiers when directly prompted at 26.83%-95.18%, while down to 6.45%-59.19% in creating a manifest file directly. Among the specified versions, 36.70%-55.70% of tasks contain at least one known CVE, and 62.75%-74.51% of them carry Critical or High severity ratings. In 72.27%-91.37% of cases, the associated CVEs were publicly disclosed before the model's knowledge cutoff. The statistics show all models converge on the same small set of risky release versions, indicating a systemic bias rather than isolated model error. Static compatibility rates range from 19.70% to 63.20%, with installation failure as the dominant cause. The dynamic test cases confirm the pattern by 6.49%-48.62% pass rates. Further experiments confirm that these failures are attributable to version selection rather than code quality, and that externally anchored version constraints substantially reduce both vulnerability exposure and compatibility failures. Our findings reveal LLM version selection as a first-class, previously overlooked risk surface in LLM-based development. We disclosed these findings to the community of the evaluated models, and several confirmed the issue. All the code and dataset have been released for open science at https://github.com/dw763j/PinTrace.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs often pick library versions with known CVEs in generated code, but the SO-based benchmark and unvalidated extraction pipeline make the reported rates hard to trust as a general finding.

read the letter

This paper measures how often ten LLMs specify concrete versions for third-party libraries in Python code and finds that 37-56% of those versions carry at least one known CVE, with most of them critical or high severity. The work also shows low compatibility rates and that models converge on the same risky releases across different prompting styles. They release the full dataset and code, which is useful, and they notified the model providers. That is the concrete contribution: a first large-scale count of version-level exposure rather than just generic code quality issues. The numbers on specification rates (6-95% depending on prompt type) and pre-cutoff CVEs are straightforward to report and worth having on record. The static and dynamic compatibility checks add a practical angle that most prior LLM code studies skip. The main weaknesses sit in the setup. The 1,000 Stack Overflow tasks may not reflect how developers actually prompt LLMs for library-dependent work; SO questions often involve specific, narrow problems that could bias toward certain packages or version patterns. The paper gives no manual validation, precision numbers, or error analysis for the version extraction and NVD matching steps, so false positives or parsing artifacts could inflate the CVE counts. Convergence on risky versions might partly trace to shared training data or the benchmark itself rather than pure model behavior. These are not fatal, but they mean the headline percentages need tighter controls before they can be treated as a reliable risk surface. Readers working on LLM security tooling or supply-chain measurement will find the artifacts and raw frequencies helpful even if they treat the exact rates as provisional. The paper is coherent on its own terms and engages the right prior work on LLM code generation. It deserves peer review so referees can check the extraction pipeline and task selection against the claims.

Referee Report

3 major / 3 minor

Summary. The paper presents the first large-scale measurement study of version-level security and compatibility risks in Python code generated by 10 LLMs, evaluated on the PinTrace benchmark of 1,000 curated Stack Overflow programming tasks. Key findings include LLMs specifying version identifiers at rates of 26.83%-95.18% (lower when generating manifests), with 36.70%-55.70% of tasks containing at least one known CVE among specified versions (62.75%-74.51% rated Critical or High), 72.27%-91.37% of which predate the models' knowledge cutoffs. All models converge on the same small set of risky versions; static compatibility ranges 19.70%-63.20% and dynamic pass rates 6.49%-48.62%, with experiments attributing failures primarily to version selection rather than code quality. Externally anchored constraints reduce risks, and the authors disclose findings to model providers while releasing code and data.

Significance. If the measurements are robust, the work establishes LLM version selection as a systemic, previously unquantified risk surface in LLM-assisted development, with high CVE exposure and compatibility failures that are not isolated errors but convergent behavior across models. The open release of the full dataset, code, and benchmark supports reproducibility and follow-on work; disclosure to vendors adds practical impact. This could inform secure prompting practices and training data curation in software engineering.

major comments (3)

[§3] §3 (Benchmark Construction): The central claim that 36.70%-55.70% of tasks contain known CVEs depends on PinTrace's 1,000 Stack Overflow tasks being representative of real-world LLM prompting for library-dependent code, yet no validation (e.g., diversity metrics, comparison to actual usage logs, or sampling rationale) is reported. This is load-bearing for generalizing the risk surface beyond the benchmark.
[§4] §4 (Version Extraction and CVE Matching): The automated pipeline for extracting version strings from generated code and matching to NVD/CVE databases reports no precision/recall figures, manual validation set, or inter-rater reliability checks. Without these, selection bias or false positives in parsing (e.g., advisory text treated as versions) could directly inflate the CVE percentages and severity distributions, which underpin the main empirical claims.
[§5.2-5.3] §5.2-5.3 (Compatibility Experiments): The attribution of low static (19.70%-63.20%) and dynamic (6.49%-48.62%) compatibility rates to version selection rather than code quality relies on controlled experiments, but the construction of test cases and isolation of confounding factors (e.g., task selection bias) are not detailed sufficiently to rule out alternative explanations for the observed failures.

minor comments (3)

[Abstract] Abstract: The reported ranges (e.g., 36.70%-55.70%) would benefit from explicit mapping to individual models or conditions for immediate interpretability.
[Figures/Tables] Throughout: Some figure captions and table headers could more clearly distinguish between 'tasks with specified versions' versus 'all tasks' to avoid reader misinterpretation of the percentages.
[§6] §6 (Discussion): The claim of 'systemic bias' from convergence on risky versions would be strengthened by a brief comparison to version distributions in non-LLM codebases or training data artifacts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing clarifications on our methodology and indicating where revisions will be made to improve transparency and robustness. Our goal is to ensure the measurements are presented with appropriate caveats while maintaining the core findings on LLM version selection risks.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The central claim that 36.70%-55.70% of tasks contain known CVEs depends on PinTrace's 1,000 Stack Overflow tasks being representative of real-world LLM prompting for library-dependent code, yet no validation (e.g., diversity metrics, comparison to actual usage logs, or sampling rationale) is reported. This is load-bearing for generalizing the risk surface beyond the benchmark.

Authors: PinTrace was constructed by curating 1,000 Stack Overflow tasks that explicitly involve third-party library imports in Python, selected to mirror common developer queries to LLMs. The curation prioritized diversity across domains such as data processing, web development, and scientific computing. We agree that additional details on sampling rationale and task characteristics would strengthen the paper. In revision, we will add a dedicated subsection describing the curation criteria, library distribution statistics, and a limitations discussion on generalizability. Direct comparison to proprietary usage logs from LLM providers is not possible without access to such data, but the observed convergence on risky versions across 10 distinct models supports the systemic nature of the risk independent of any single benchmark. revision: partial
Referee: [§4] §4 (Version Extraction and CVE Matching): The automated pipeline for extracting version strings from generated code and matching to NVD/CVE databases reports no precision/recall figures, manual validation set, or inter-rater reliability checks. Without these, selection bias or false positives in parsing (e.g., advisory text treated as versions) could directly inflate the CVE percentages and severity distributions, which underpin the main empirical claims.

Authors: Version strings are extracted via a hybrid approach using Python's ast parser for import statements combined with targeted regex for version specifiers (e.g., in requirements or comments). Matching to CVEs uses exact version lookup in the NVD database. We did not include validation metrics in the original submission. In the revised manuscript, we will report results from a manual audit of a random sample of 200 generated outputs, including precision and recall for extraction and matching, along with the validation protocol. No instances of advisory text being misparsed as versions were encountered during pipeline development. revision: yes
Referee: [§5.2-5.3] §5.2-5.3 (Compatibility Experiments): The attribution of low static (19.70%-63.20%) and dynamic (6.49%-48.62%) compatibility rates to version selection rather than code quality relies on controlled experiments, but the construction of test cases and isolation of confounding factors (e.g., task selection bias) are not detailed sufficiently to rule out alternative explanations for the observed failures.

Authors: The controlled experiments included ablations where the same generated code was tested under different version constraints, and separate runs with externally provided correct versions. Dynamic tests were derived directly from the functional requirements in the original Stack Overflow posts. We acknowledge that more detail on test construction and bias controls is needed. In revision, we will expand the experimental section with additional descriptions of test case derivation, how library-specific functionality was covered, and further results isolating version effects from code quality. These additions will better substantiate the attribution to version selection. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

This paper conducts a large-scale empirical measurement of LLM-generated code for library version specifications, CVE presence, severity, compatibility, and related statistics across 10 models on a fixed benchmark of 1,000 Stack Overflow tasks. It reports observed frequencies (e.g., 36.70%-55.70% tasks with known CVEs) and percentages directly from automated extraction and database matching. No equations, fitted parameters, derivations, predictions, or ansatzes appear in the provided text. Central claims rest on direct counts and comparisons rather than any self-referential reduction or load-bearing self-citation chain. The study is self-contained against external benchmarks (NVD/CVE databases, static/dynamic testing) with no internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the representativeness of the PinTrace benchmark drawn from Stack Overflow tasks and the accuracy of external CVE databases for risk labeling; these are standard domain assumptions rather than paper-specific inventions.

axioms (2)

domain assumption The 1,000 curated Stack Overflow tasks accurately reflect typical programming scenarios that involve third-party library imports.
The study uses these tasks as the basis for all measurements of version choice and risk.
domain assumption Public CVE databases provide complete and accurate vulnerability information for the library versions identified in generated code.
Risk statistics are derived directly from matching extracted versions against these databases.

pith-pipeline@v0.9.0 · 5648 in / 1451 out tokens · 66382 ms · 2026-05-08T08:49:48.939351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 52 canonical work pages

[1]

Rahaf Alkhadra, Joud Abuzaid, Mariam AlShammari, and Nazeeruddin Mohammad. 2021. Solar winds hack: In-depth analysis and countermeasures. In2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT). IEEE, IEEE, Kharagpur, India, 1–7. doi:10.1109/ICCCNT51525.2021.9579611

work page doi:10.1109/icccnt51525.2021.9579611 2021
[2]

Astral. 2024. uv: An Extremely Fast Python Package Installer and Resolver. https://docs.astral.sh/uv/

2024
[3]

Astral. 2025. ty: A Fast Type Checker and Language Server for Python. https://docs.astral.sh/ty/

2025
[4]

Bndr and contributors. 2024. pipreqs: Package-to-Import Name Mapping. https://github.com/bndr/pipreqs/blob/ master/pipreqs/mapping. Accessed: February 2026

2024
[5]

Shih-Chieh Dai, Jun Xu, and Guanhong Tao. 2026. Rethinking the Evaluation of Secure Code Generation. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering. doi:10.48550/arXiv.2503.15554

work page doi:10.48550/arxiv.2503.15554 2026
[6]

Andreas Dann, Ben Hermann, and Eric Bodden. 2023. UpCy: Safely Updating Outdated Dependencies. InProceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 233–244. doi:10.1109/ICSE48619.2023.00031

work page doi:10.1109/icse48619.2023.00031 2023
[7]

Yanjun Fu, Ethan Baker, Yu Ding, and Yizheng Chen. 2024. Constrained Decoding for Secure Code Generation. arXiv:2405.00218 doi:10.48550/arXiv.2405.00218

work page doi:10.48550/arxiv.2405.00218 2024
[8]

Yujia Fu, Peng Liang, Amjed Tahir, et al. 2025. Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study.ACM Transactions on Software Engineering and Methodology34, 8 (2025), 1–34. doi:10.1145/3716848

work page doi:10.1145/3716848 2025
[9]

Kai Gao, Runzhi He, Bing Xie, and Minghui Zhou. 2024. Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement.ACM Trans. Softw. Eng. Methodol.33, 4, Article 97 (April 2024), 27 pages. doi:10.1145/3640336

work page doi:10.1145/3640336 2024
[10]

Ruofan Gao, Amjed Tahir, Peng Liang, Teo Susnjak, and Foutse Khomh. 2025. A Survey of Bugs in AI-Generated Code. arXiv:2512.05239 doi:10.48550/arXiv.2512.05239

work page doi:10.48550/arxiv.2512.05239 2025
[11]

GitHub. 2025. GitHub Copilot: AI Code Generation Statistics. https://github.blog/news-insights/research/

2025
[12]

GitHub. 2025. Octoverse 2025: A New Developer Joins GitHub Every Second as AI Leads TypeScript to #1. https: //github.blog/news-insights/octoverse/

2025
[13]

Google. 2021. OSV: Open Source Vulnerabilities. https://osv.dev

2021
[14]

Mohammad Saqib Hasan, Saikat Chakraborty, Santu Karmaker, and Niranjan Balasubramanian. 2025. Teaching an old LLM secure coding: Localized preference optimization on distilled preferences. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 26039–26057

2025
[15]

Hao He, Bogdan Vasilescu, and Christian Kästner. 2025. Pinning Is Futile: You Need More Than Local Dependency Versioning to Defend against Supply Chain Attacks.Proc. ACM Softw. Eng.2, FSE, Article FSE013 (June 2025), 24 pages. doi:10.1145/3715728

work page doi:10.1145/3715728 2025
[16]

Junda He, Christoph Treude, and David Lo. 2025. LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead.ACM Trans. Softw. Eng. Methodol.34, 5, Article 124 (May 2025), 30 pages. doi:10.1145/3712003

work page doi:10.1145/3712003 2025
[17]

Jingxuan He, Mark Vero, Gabriela Krasnopolska, and Martin Vechev. 2024. Instruction Tuning for Secure Code Generation. InInternational Conference on Machine Learning. PMLR, 18043–18062

2024
[18]

Runzhi He, Hao He, Yuxia Zhang, and Minghui Zhou. 2023. Automating Dependency Updates in Practice: An Exploratory Study on GitHub Dependabot.IEEE Trans. Softw. Eng.49, 8 (Aug. 2023), 4004–4022. doi:10.1109/TSE.2023. 3278129

work page doi:10.1109/tse.2023 2023
[19]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review.ACM Trans. Softw. Eng. Methodol.33, 8, Article 220 (Dec. 2024), 79 pages. doi:10.1145/3695988

work page doi:10.1145/3695988 2024
[20]

Jinchang Hu, Lyuye Zhang, Chengwei Liu, Sen Yang, Song Huang, and Yang Liu. 2024. Empirical Analysis of Vulnerabilities Life Cycle in Golang Ecosystem. In2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE, Lisbon, Portugal, 2618–2630. doi:10.1145/3597503.363923

work page doi:10.1145/3597503.363923 2024
[21]

Nasif Imtiaz, Seaver Thorn, and Laurie Williams. 2021. A Comparative Study of Vulnerability Reporting by Software Composition Analysis Tools. InProceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). ACM, Bari, Italy, Article 5. doi:10.1145/3475716.3475769

work page doi:10.1145/3475716.3475769 2021
[22]

Nasif Imtiaz and Laurie Williams. 2023. Are Your Dependencies Code Reviewed?: Measuring Code Review Coverage in Dependency Updates .IEEE Transactions on Software Engineering49, 11 (Nov. 2023), 4932–4945. doi:10.1109/TSE. 2023.3319509

work page doi:10.1109/tse 2023
[23]

Nizar Islah, Justine Gehring, Diganta Misra, Eilif Muller, Irina Rish, Terry Yue Zhuo, and Massimo Caccia. 2024. GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models. arXiv:2411.05830 [cs.SE] https://arxiv.org/abs/2411.05830

work page arXiv 2024
[24]

Damien Jaime, Pascal Poizat, Joyce El Haddad, and Thomas Degueule. 2024. Balancing the Quality and Cost of Updating Dependencies. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering , Vol. 1, No. 1, Article . Publication date: May 2026. Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-...

work page arXiv 2024
[25]

Nihal Jain, Robert Kwiatkowski, Baishakhi Ray, Murali Krishna Ramanathan, and Varun Kumar. 2025. On Mitigating Code LLM Hallucinations with API Documentation. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 237–248. doi:10.1109/ICSE-SEIP66354.2025.00027

work page doi:10.1109/icse-seip66354.2025.00027 2025
[26]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2026. A Survey on Large Language Models for Code Generation.ACM Trans. Softw. Eng. Methodol.35, 2, Article 58 (Jan. 2026), 72 pages. doi:10.1145/3747588

work page doi:10.1145/3747588 2026
[28]

Sachit Kuhar, Wasi Uddin Ahmad, Zijian Wang, Nihal Jain, Haifeng Qian, Baishakhi Ray, Murali Krishna Ramanathan, Xiaofei Ma, and Anoop Deoras. 2025. LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Huma...

work page doi:10.18653/v1/2025.naacl-long.348 2025
[29]

Empirical Software Engineering23(1), 384–417 (2018)

Raula Gaikovina Kula, Daniel M. German, Ali Ouni, Takashi Ishio, and Katsuro Inoue. 2018. Do developers update their library dependencies?Empirical Softw. Engg.23, 1 (Feb. 2018), 384–417. doi:10.1007/s10664-017-9521-5

work page doi:10.1007/s10664-017-9521-5 2018
[30]

Piergiorgio Ladisa, Henrik Plate, Matias Martinez, and Olivier Barais. 2023. SoK: Taxonomy of Attacks on Open-Source Software Supply Chains. In2023 IEEE Symposium on Security and Privacy (SP). IEEE, San Francisco, CA, USA, 1509–1526. doi:10.1109/SP46215.2023.10179304

work page doi:10.1109/sp46215.2023.10179304 2023
[31]

Jasmine Latendresse, Naoures Day, SayedHassan Khatoonabadi, and Emad Shihab. 2025. The Software Librarian: Python Package Insights for Copilot. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering: Companion Proceedings. 17–20. doi:10.1109/ICSE-Companion66252.2025.00014

work page doi:10.1109/icse-companion66252.2025.00014 2025
[32]

Jasmine Latendresse, SayedHassan Khatoonabadi, Ahmad Abdellatif, and Emad Shihab. 2024. Is ChatGPT a Good Soft- ware Librarian? An Exploratory Study on the Use of ChatGPT for Software Library Recommendations. arXiv:2408.05128 doi:10.48550/arXiv.2408.05128

work page doi:10.48550/arxiv.2408.05128 2024
[33]

Jasmine Latendresse, SayedHassan Khatoonabadi, and Emad Shihab. 2025. How Robust Are LLM-Generated Library Imports? An Empirical Study Using Stack Overflow. arXiv:2507.10818 doi:10.48550/arXiv.2507.10818

work page doi:10.48550/arxiv.2507.10818 2025
[34]

Dong Li, Shanfu Shu, Meng Yan, et al. 2025. Improving Co-Decoding Based Security Hardening of Code LLMs Leveraging Knowledge Distillation.IEEE Transactions on Software Engineering51, 9 (2025), 2634–2650. doi:10.1109/TSE.2025.3591791

work page doi:10.1109/tse.2025.3591791 2025
[35]

Dong Li, Meng Yan, Yaosheng Zhang, et al. 2024. CoSec: On-the-Fly Security Hardening of Code LLMs via Supervised Co-Decoding. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1428–1439. doi:10.1145/3650212.3680371

work page doi:10.1145/3650212.3680371 2024
[36]

Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, and Scarlett Li. 2025. FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce ...

work page doi:10.18653/v1/2025.acl-long.839 2025
[37]

Bo Lin, Shangwen Wang, Yihao Qin, Liqian Chen, and Xiaoguang Mao. 2025. Give LLMs a Security Course: Securing Retrieval-Augmented Code Generation via Knowledge Injection. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security(Taipei, Taiwan)(CCS ’25). Association for Computing Machinery, New York, NY, USA, 3356–3370. doi:1...

work page doi:10.1145/3719027.3765049 2025
[38]

Chengwei Liu, Sen Chen, Lingling Fan, Bihuan Chen, Yang Liu, and Xin Peng. 2022. Demystifying the Vulnerability Propagation and Its Evolution via Dependency Trees in the npm Ecosystem. InProceedings of the 44th International Conference on Software Engineering. IEEE, Pittsburgh, PA, USA, 672–684. doi:10.1145/3510003.3510142

work page doi:10.1145/3510003.3510142 2022
[39]

Zeyu Leo Liu, Shrey Pandit, Xi Ye, Eunsol Choi, and Greg Durrett. 2025. CodeUpdateArena: Benchmarking Knowledge Editing on API Updates. arXiv:2407.06249 doi:10.48550/arXiv.2407.06249

work page doi:10.48550/arxiv.2407.06249 2025
[40]

Tarek Mahmud, Meiru Che, and Guowei Yang. 2023. Detecting Android API Compatibility Issues With API Differences. IEEE Transactions on Software Engineering49, 7 (2023), 3857–3871. doi:10.1109/TSE.2023.3274153

work page doi:10.1109/tse.2023.3274153 2023
[41]

Tarek Mahmud, Bin Duan, Meiru Che, Awatif Yasmin, Anne H. H. Ngu, and Guowei Yang. 2026. Automated Update of Android Deprecated API Usages With Large Language Models.IEEE Transactions on Software Engineering52, 1 (2026), 70–85. doi:10.1109/TSE.2025.3627897

work page doi:10.1109/tse.2025.3627897 2026
[42]

Muller, Irina Rish, Samira Ebrahimi Kahou, and Massimo Caccia

Diganta Misra, Nizar Islah, Victor May, Brice Rauby, Zihan Wang, Justine Gehring, Antonio Orvieto, Muawiz Sajjad Chaudhary, Eilif B. Muller, Irina Rish, Samira Ebrahimi Kahou, and Massimo Caccia. 2025. GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities. InNeurIPS 2025 Fourth Workshop on Deep , Vol. 1, No. 1, A...

2025
[43]

Shradha Neupane, Grant Holmes, Elizabeth Wyss, Drew Davidson, and Lorenzo De Carli. 2023. Beyond typosquatting: an in-depth look at package confusion. InProceedings of the 32nd USENIX Conference on Security Symposium(Anaheim, CA, USA)(SEC ’23). USENIX Association, USA, Article 193, 18 pages

2023
[44]

NTIA. 2019. Framing Software Component Transparency: Establishing a Common Software Bill of Material (SBOM). https://ntia.gov/files/ntia/publications/framingsbom_20191112.pdf

2019
[45]

NVD. 2021. CVE-2021-44228 Detail: Apache Log4j2 Remote Code Execution Vulnerability. https://nvd.nist.gov/vuln/ detail/CVE-2021-44228

2021
[46]

OpenAI. 2022. Tiktoken. https://github.com/openai/tiktoken. Accessed: February 2026

2022
[47]

Ivan Pashchenko, Duc-Ly Vu, and Fabio Massacci. 2020. A Qualitative Study of Dependency Management and Its Security Implications. InProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security(Virtual Event, USA)(CCS ’20). Association for Computing Machinery, New York, NY, USA, 1513–1531. doi:10.1145/3372297.3417232

work page doi:10.1145/3372297.3417232 2020
[48]

Serena Elisa Ponta, Henrik Plate, and Antonino Sabetta. 2020. Detection, Assessment and Mitigation of Vulnerabilities in Open Source Dependencies.Empirical Software Engineering25, 5 (2020), 3175–3215. doi:10.1007/s10664-020-09830-x

work page doi:10.1007/s10664-020-09830-x 2020
[49]

Piotr Przymus and Thomas Durieux. 2025. Wolves in the Repository: A Software Engineering Analysis of the XZ Utils Supply Chain Attack. InProceedings of the 22nd IEEE/ACM International Conference on Mining Software Repositories (MSR). IEEE, Ottawa, ON, Canada, 91–102. doi:10.1109/MSR66628.2025.00026

work page doi:10.1109/msr66628.2025.00026 2025
[50]

pypa. 2021. pip-audit. https://github.com/pypa/pip-audit. Accessed: February 2026

2021
[51]

Python. 2002. The Python Package Index (PyPI). https://pypi.org/. Accessed: February 2026

2002
[52]

Python. 2023. Python Release Python 3.12. https://www.python.org/downloads/release/python-3120/. Accessed: February 2026

2023
[53]

Amirali Sajadi, Binh Le, Anh Nguyen, Kostadin Damevski, and Preetha Chatterjee. 2025. Do LLMs Consider Security? An Empirical Study on Responses to Programming Questions.Empirical Software Engineering30, 4 (2025), 101. doi:10.1007/s10664-025-10658-6

work page doi:10.1007/s10664-025-10658-6 2025
[55]

Yijun Shen, Xiang Gao, Hailong Sun, and Yu Guo. 2025. Understanding vulnerabilities in software supply chains. Empirical Software Engineering30, 1 (2025), 20. doi:10.1007/s10664-024-10581-2

work page doi:10.1007/s10664-024-10581-2 2025
[56]

Claudio Di Sipio, Juri Di Rocco, Davide Di Ruscio, and Vladyslav Bulhakov. 2025. Addressing Popularity Bias in Third- Party Library Recommendations Using LLMs . In2025 IEEE International Conference on Software Analysis, Evolution and Reengineering - Companion (SANER-C). IEEE Computer Society, Los Alamitos, CA, USA, 33–40. doi:10.1109/SANER- C66551.2025.00012

work page doi:10.1109/saner- 2025
[57]

snyk. 2021. SolarWinds Orion Security Breach: A Shift In The Software Supply Chain Paradigm. https://snyk.io/blog/ solarwinds-orion-security-breach-a-shift-in-the-software-supply-chain-paradigm/

2021
[58]

Sonatype. 2026. 2026 Software Supply Chain Report. https://www.sonatype.com/state-of-the-software-supply-chain/

2026
[59]

Joseph Spracklen, Raveen Wijewickrama, AHM Nazmus Sakib, Anindya Maiti, Bimal Viswanath, and Murtuza Jadliwala
[60]

In Proceedings of the 34th USENIX Conference on Security Symposium(Seattle, WA, USA)(SEC ’25)

We have a package for you! a comprehensive analysis of package hallucinations by code generating LLMs. In Proceedings of the 34th USENIX Conference on Security Symposium(Seattle, WA, USA)(SEC ’25). USENIX Association, USA, Article 190, 20 pages
[61]

Stack Exchange, Inc. 2026. Stack Exchange Data Dump. https://stackoverflow.com/help/data-dumps. Snapshot: Jan 6, 2026

2026
[62]

Stack Exchange, Inc. 2026. Stack Exchange Data Query. https://data.stackexchange.com/stackoverflow/query/edit/ 1903717#resultSets. Snapshot: April 15, 2026

2026
[63]

Stack Overflow. 2025. 2025 Stack Overflow Developer Survey. https://survey.stackoverflow.co/2025/

2025
[64]

Zeyu Sun, Jingzheng Wu, Xiang Ling, Yilin Wei, Tianyue Luo, and Yanjun Wu. 2025. Research on Key Technologies of SBOM in Software Supply Chain.Journal of Software36, 6 (June 2025), 2604. doi:10.13328/j.cnki.jos.007308

work page doi:10.13328/j.cnki.jos.007308 2025
[65]

Timothée Thiéblemont. 2021. Griffe: Signatures for Entire Python Programs. https://mkdocstrings.github.io/griffe/

2021
[66]

Bin Wang, Wenjie Yu, Yilu Zhong, et al. 2025. AI Code in the Wild: Measuring Security Risks and Ecosystem Shifts of AI-Generated Code in Modern Software. arXiv:2512.18567 doi:10.48550/arXiv.2512.18567

work page doi:10.48550/arxiv.2512.18567 2025
[68]

Chong Wang, Kaifeng Huang, Jian Zhang, Yebo Feng, Lyuye Zhang, Yang Liu, and Xin Peng. 2025. LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-Based Code Completion. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). 885–897. doi:10.1109/ICSE55347.2025.00245 , Vol. 1, No. 1, Article . Publication date: May 2026...

work page doi:10.1109/icse55347.2025.00245 2025
[69]

Chengjie Wang, Jingzheng Wu, Hao Lyu, Xiang Ling, Tianyue Luo, Yanjun Wu, and Chen Zhao. 2026. A Large Scale Empirical Analysis on the Adherence Gap between Standards and Tools in SBOM.ACM Trans. Softw. Eng. Methodol. (Jan. 2026). doi:10.1145/3788692 Just Accepted

work page doi:10.1145/3788692 2026
[70]

Shenao Wang, Yanjie Zhao, Xinyi Hou, and Haoyu Wang. 2025. Large Language Model Supply Chain: A Research Agenda.ACM Trans. Softw. Eng. Methodol.34, 5, Article 147 (May 2025), 46 pages. doi:10.1145/3708531

work page doi:10.1145/3708531 2025
[71]

Tongtong Wu, Rongyi Chen, Wenjie Du, et al. 2026. Environment-Aware Code Generation: How Far Are We?. In Proceedings of the 48th IEEE/ACM International Conference on Software Engineering. doi:10.48550/arXiv.2601.12262

work page doi:10.48550/arxiv.2601.12262 2026
[72]

Tongtong Wu, Weigang Wu, Xingyu Wang, et al. 2024. VersiCode: Towards Version-Controllable Code Generation. arXiv:2406.07411 doi:10.48550/arXiv.2406.07411

work page doi:10.48550/arxiv.2406.07411 2024
[73]

Yulun Wu, Zeliang Yu, Ming Wen, Qiang Li, Deqing Zou, and Hai Jin. 2023. Understanding the Threats of Upstream Vulnerabilities to Downstream Projects in the Maven Ecosystem. InProceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 1046–1058. doi:10.1109/ICSE48619.2023. 00095

work page doi:10.1109/icse48619.2023 2023
[74]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. Demystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE, Article FSE037 (June 2025), 24 pages. doi:10.1145/3715754

work page doi:10.1145/3715754 2025
[75]

Yifan Xia, Zichen Xie, Peiyu Liu, Kangjie Lu, Yan Liu, Wenhai Wang, and Shouling Ji. 2025. Beyond Static Pattern Matching? Rethinking Automatic Cryptographic API Misuse Detection in the Era of LLMs.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA006 (June 2025), 24 pages. doi:10.1145/3728875

work page doi:10.1145/3728875 2025
[76]

Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Shengcheng Yu, Weisong Sun, Yun Yang, and Zhenyu Chen
[77]

A survey on large language models for software engineering.Science China Information Sciences69, 4 (2026), 141102

2026
[78]

Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA022 (June 2025), 23 pages. doi:10.1145/3728894

work page doi:10.1145/3728894 2025
[79]

Jianguo Zhao, Yuqiang Sun, Cheng Huang, et al . 2025. Towards Secure Code Generation with LLMs: A Study on Common Weakness Enumeration.IEEE Transactions on Software Engineering(2025), 1–16. doi:10.1109/TSE.2025.3619281

work page doi:10.1109/tse.2025.3619281 2025
[80]

Tingwei Zhu, Zhongzhen Wen, Shangqing Liu, Yi Li, Tian Zhang, and Xin Xia. 2026. Assessing the Capability of LLMs for Deprecated API Usage Updating from Natural Language Descriptions.ACM Trans. Softw. Eng. Methodol.(April 2026). doi:10.1145/3808230 Just Accepted

work page doi:10.1145/3808230 2026
[81]

Terry Yue Zhuo, Vu Minh Chien, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen GONG, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Davi...

2025
[82]

Terry Yue Zhuo, Junda He, Jiamou Sun, Zhenchang Xing, David Lo, John Grundy, and Xiaoning Du. 2026. Identifying and Mitigating API Misuse in Large Language Models.IEEE Transactions on Software Engineering52, 3 (2026), 855–873. doi:10.1109/TSE.2026.3651566 , Vol. 1, No. 1, Article . Publication date: May 2026

work page doi:10.1109/tse.2026.3651566 2026