pith. machine review for the scientific record. sign in

arxiv: 2605.06279 · v1 · submitted 2026-05-07 · 💻 cs.SE · cs.AI

Recognition: unknown

Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions

Chengjie Wang, Chen Zhao, Jingzheng Wu, Tianyue Luo, Xiang Ling

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:49 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM code generationvulnerable dependencieslibrary version selectionCVE exposurePython securitycompatibility failuressystemic biasmanifest files
0
0 comments X

The pith

LLMs specify library versions carrying known security vulnerabilities in 37 to 56 percent of generated tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures the security and compatibility risks created when large language models write Python code that imports third-party libraries with explicit version numbers. It tests ten different models on one thousand programming tasks drawn from Stack Overflow, tracking how often the models add version identifiers and then checking those versions against known vulnerability databases. When versions are supplied, more than one third of the resulting tasks include at least one publicly documented CVE, and roughly two thirds of those CVEs are rated critical or high severity. Most of the vulnerabilities were disclosed before the models' training data ended, and every model tends to settle on the same narrow set of problematic release numbers. The work treats version choice as a distinct failure mode that persists even when the surrounding code is otherwise correct.

Core claim

The authors show that LLMs include version identifiers in 26.83 to 95.18 percent of cases when prompted directly and in 6.45 to 59.19 percent of cases when asked to produce a manifest file. Among those specified versions, 36.70 to 55.70 percent of tasks contain at least one known CVE, and 62.75 to 74.51 percent of the CVEs are critical or high. In 72.27 to 91.37 percent of cases the CVEs were published before the model's knowledge cutoff. All ten models converge on the same small collection of risky release versions rather than distributing errors independently. Static compatibility checks pass in only 19.70 to 63.20 percent of cases, with installation failures as the main cause, and dynamic

What carries the argument

The PinTrace benchmark of 1,000 Stack Overflow tasks used to prompt models, extract version strings from generated code or manifests, and match them against public CVE records.

If this is right

  • Externally supplied version constraints sharply reduce both CVE exposure and installation failures.
  • Version selection errors are attributable to the chosen releases rather than to flaws in the surrounding code.
  • Dynamic test pass rates drop to 6.49-48.62 percent, confirming that the risky versions actually break functionality.
  • All evaluated models exhibit the same narrow preference for vulnerable releases, pointing to a shared training artifact.
  • Compatibility failures remain high even when the generated code is otherwise correct.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams that let LLMs generate dependency files may need automated post-processing to replace risky versions with known-safe alternatives.
  • The same convergence pattern could appear in other languages or ecosystems if the underlying training data favor popular but outdated releases.
  • Prompt engineering that explicitly forbids version numbers or forces latest-stable constraints offers a lightweight mitigation.
  • Long-term, training data curation that down-weights vulnerable historical releases might reduce the bias at its source.

Load-bearing premise

The 1,000 Stack Overflow tasks represent typical real-world prompts for library-dependent code, and the automated extraction of versions plus CVE lookup introduces negligible false positives or selection bias.

What would settle it

A follow-up experiment on a fresh set of prompts or with an independent CVE scanner that finds substantially lower vulnerable-version rates or no convergence across models would undermine the systemic-bias conclusion.

Figures

Figures reproduced from arXiv: 2605.06279 by Chengjie Wang, Chen Zhao, Jingzheng Wu, Tianyue Luo, Xiang Ling.

Figure 1
Figure 1. Figure 1: A motivating example of LLM-introduced version-level risk. An LLM generates functionally correct view at source ↗
Figure 2
Figure 2. Figure 2: The analysis pipeline of the measurement study. Each (task, model, mode) triple passes through code view at source ↗
Figure 3
Figure 3. Figure 3: Specified version distribution for the top-10 most frequently annotated TPLs under inline mode. Each view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of release dates of LLM-specified TPL versions per model under inline mode. The newer view at source ↗
Figure 5
Figure 5. Figure 5: Per-model vulnerability exposure profile under inline mode. (a): Distribution of the number of vulnerable view at source ↗
Figure 6
Figure 6. Figure 6: Cross-model convergence on the most frequently specified vulnerable view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of the lag between CVE disclosure dates and each model’s knowledge cutoff ( view at source ↗
Figure 8
Figure 8. Figure 8: Flow from static compatibility outcomes to dynamic execution results across BigCodeBench tasks view at source ↗
read the original abstract

Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version choices can carry security and compatibility risks, yet they have not been systematically studied. We present the first large-scale measurement study of version-level risk in LLM-generated Python code, evaluating 10 LLMs on PinTrace, a curated benchmark of 1,000 Stack Overflow programming tasks. LLMs tend to specify version identifiers when directly prompted at 26.83%-95.18%, while down to 6.45%-59.19% in creating a manifest file directly. Among the specified versions, 36.70%-55.70% of tasks contain at least one known CVE, and 62.75%-74.51% of them carry Critical or High severity ratings. In 72.27%-91.37% of cases, the associated CVEs were publicly disclosed before the model's knowledge cutoff. The statistics show all models converge on the same small set of risky release versions, indicating a systemic bias rather than isolated model error. Static compatibility rates range from 19.70% to 63.20%, with installation failure as the dominant cause. The dynamic test cases confirm the pattern by 6.49%-48.62% pass rates. Further experiments confirm that these failures are attributable to version selection rather than code quality, and that externally anchored version constraints substantially reduce both vulnerability exposure and compatibility failures. Our findings reveal LLM version selection as a first-class, previously overlooked risk surface in LLM-based development. We disclosed these findings to the community of the evaluated models, and several confirmed the issue. All the code and dataset have been released for open science at https://github.com/dw763j/PinTrace.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper presents the first large-scale measurement study of version-level security and compatibility risks in Python code generated by 10 LLMs, evaluated on the PinTrace benchmark of 1,000 curated Stack Overflow programming tasks. Key findings include LLMs specifying version identifiers at rates of 26.83%-95.18% (lower when generating manifests), with 36.70%-55.70% of tasks containing at least one known CVE among specified versions (62.75%-74.51% rated Critical or High), 72.27%-91.37% of which predate the models' knowledge cutoffs. All models converge on the same small set of risky versions; static compatibility ranges 19.70%-63.20% and dynamic pass rates 6.49%-48.62%, with experiments attributing failures primarily to version selection rather than code quality. Externally anchored constraints reduce risks, and the authors disclose findings to model providers while releasing code and data.

Significance. If the measurements are robust, the work establishes LLM version selection as a systemic, previously unquantified risk surface in LLM-assisted development, with high CVE exposure and compatibility failures that are not isolated errors but convergent behavior across models. The open release of the full dataset, code, and benchmark supports reproducibility and follow-on work; disclosure to vendors adds practical impact. This could inform secure prompting practices and training data curation in software engineering.

major comments (3)
  1. [§3] §3 (Benchmark Construction): The central claim that 36.70%-55.70% of tasks contain known CVEs depends on PinTrace's 1,000 Stack Overflow tasks being representative of real-world LLM prompting for library-dependent code, yet no validation (e.g., diversity metrics, comparison to actual usage logs, or sampling rationale) is reported. This is load-bearing for generalizing the risk surface beyond the benchmark.
  2. [§4] §4 (Version Extraction and CVE Matching): The automated pipeline for extracting version strings from generated code and matching to NVD/CVE databases reports no precision/recall figures, manual validation set, or inter-rater reliability checks. Without these, selection bias or false positives in parsing (e.g., advisory text treated as versions) could directly inflate the CVE percentages and severity distributions, which underpin the main empirical claims.
  3. [§5.2-5.3] §5.2-5.3 (Compatibility Experiments): The attribution of low static (19.70%-63.20%) and dynamic (6.49%-48.62%) compatibility rates to version selection rather than code quality relies on controlled experiments, but the construction of test cases and isolation of confounding factors (e.g., task selection bias) are not detailed sufficiently to rule out alternative explanations for the observed failures.
minor comments (3)
  1. [Abstract] Abstract: The reported ranges (e.g., 36.70%-55.70%) would benefit from explicit mapping to individual models or conditions for immediate interpretability.
  2. [Figures/Tables] Throughout: Some figure captions and table headers could more clearly distinguish between 'tasks with specified versions' versus 'all tasks' to avoid reader misinterpretation of the percentages.
  3. [§6] §6 (Discussion): The claim of 'systemic bias' from convergence on risky versions would be strengthened by a brief comparison to version distributions in non-LLM codebases or training data artifacts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing clarifications on our methodology and indicating where revisions will be made to improve transparency and robustness. Our goal is to ensure the measurements are presented with appropriate caveats while maintaining the core findings on LLM version selection risks.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The central claim that 36.70%-55.70% of tasks contain known CVEs depends on PinTrace's 1,000 Stack Overflow tasks being representative of real-world LLM prompting for library-dependent code, yet no validation (e.g., diversity metrics, comparison to actual usage logs, or sampling rationale) is reported. This is load-bearing for generalizing the risk surface beyond the benchmark.

    Authors: PinTrace was constructed by curating 1,000 Stack Overflow tasks that explicitly involve third-party library imports in Python, selected to mirror common developer queries to LLMs. The curation prioritized diversity across domains such as data processing, web development, and scientific computing. We agree that additional details on sampling rationale and task characteristics would strengthen the paper. In revision, we will add a dedicated subsection describing the curation criteria, library distribution statistics, and a limitations discussion on generalizability. Direct comparison to proprietary usage logs from LLM providers is not possible without access to such data, but the observed convergence on risky versions across 10 distinct models supports the systemic nature of the risk independent of any single benchmark. revision: partial

  2. Referee: [§4] §4 (Version Extraction and CVE Matching): The automated pipeline for extracting version strings from generated code and matching to NVD/CVE databases reports no precision/recall figures, manual validation set, or inter-rater reliability checks. Without these, selection bias or false positives in parsing (e.g., advisory text treated as versions) could directly inflate the CVE percentages and severity distributions, which underpin the main empirical claims.

    Authors: Version strings are extracted via a hybrid approach using Python's ast parser for import statements combined with targeted regex for version specifiers (e.g., in requirements or comments). Matching to CVEs uses exact version lookup in the NVD database. We did not include validation metrics in the original submission. In the revised manuscript, we will report results from a manual audit of a random sample of 200 generated outputs, including precision and recall for extraction and matching, along with the validation protocol. No instances of advisory text being misparsed as versions were encountered during pipeline development. revision: yes

  3. Referee: [§5.2-5.3] §5.2-5.3 (Compatibility Experiments): The attribution of low static (19.70%-63.20%) and dynamic (6.49%-48.62%) compatibility rates to version selection rather than code quality relies on controlled experiments, but the construction of test cases and isolation of confounding factors (e.g., task selection bias) are not detailed sufficiently to rule out alternative explanations for the observed failures.

    Authors: The controlled experiments included ablations where the same generated code was tested under different version constraints, and separate runs with externally provided correct versions. Dynamic tests were derived directly from the functional requirements in the original Stack Overflow posts. We acknowledge that more detail on test construction and bias controls is needed. In revision, we will expand the experimental section with additional descriptions of test case derivation, how library-specific functionality was covered, and further results isolating version effects from code quality. These additions will better substantiate the attribution to version selection. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

This paper conducts a large-scale empirical measurement of LLM-generated code for library version specifications, CVE presence, severity, compatibility, and related statistics across 10 models on a fixed benchmark of 1,000 Stack Overflow tasks. It reports observed frequencies (e.g., 36.70%-55.70% tasks with known CVEs) and percentages directly from automated extraction and database matching. No equations, fitted parameters, derivations, predictions, or ansatzes appear in the provided text. Central claims rest on direct counts and comparisons rather than any self-referential reduction or load-bearing self-citation chain. The study is self-contained against external benchmarks (NVD/CVE databases, static/dynamic testing) with no internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the representativeness of the PinTrace benchmark drawn from Stack Overflow tasks and the accuracy of external CVE databases for risk labeling; these are standard domain assumptions rather than paper-specific inventions.

axioms (2)
  • domain assumption The 1,000 curated Stack Overflow tasks accurately reflect typical programming scenarios that involve third-party library imports.
    The study uses these tasks as the basis for all measurements of version choice and risk.
  • domain assumption Public CVE databases provide complete and accurate vulnerability information for the library versions identified in generated code.
    Risk statistics are derived directly from matching extracted versions against these databases.

pith-pipeline@v0.9.0 · 5648 in / 1451 out tokens · 66382 ms · 2026-05-08T08:49:48.939351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 52 canonical work pages

  1. [1]

    Rahaf Alkhadra, Joud Abuzaid, Mariam AlShammari, and Nazeeruddin Mohammad. 2021. Solar winds hack: In-depth analysis and countermeasures. In2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT). IEEE, IEEE, Kharagpur, India, 1–7. doi:10.1109/ICCCNT51525.2021.9579611

  2. [2]

    Astral. 2024. uv: An Extremely Fast Python Package Installer and Resolver. https://docs.astral.sh/uv/

  3. [3]

    Astral. 2025. ty: A Fast Type Checker and Language Server for Python. https://docs.astral.sh/ty/

  4. [4]

    Bndr and contributors. 2024. pipreqs: Package-to-Import Name Mapping. https://github.com/bndr/pipreqs/blob/ master/pipreqs/mapping. Accessed: February 2026

  5. [5]

    Shih-Chieh Dai, Jun Xu, and Guanhong Tao. 2026. Rethinking the Evaluation of Secure Code Generation. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering. doi:10.48550/arXiv.2503.15554

  6. [6]

    Andreas Dann, Ben Hermann, and Eric Bodden. 2023. UpCy: Safely Updating Outdated Dependencies. InProceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 233–244. doi:10.1109/ICSE48619.2023.00031

  7. [7]

    Yanjun Fu, Ethan Baker, Yu Ding, and Yizheng Chen. 2024. Constrained Decoding for Secure Code Generation. arXiv:2405.00218 doi:10.48550/arXiv.2405.00218

  8. [8]

    Yujia Fu, Peng Liang, Amjed Tahir, et al. 2025. Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study.ACM Transactions on Software Engineering and Methodology34, 8 (2025), 1–34. doi:10.1145/3716848

  9. [9]

    Kai Gao, Runzhi He, Bing Xie, and Minghui Zhou. 2024. Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement.ACM Trans. Softw. Eng. Methodol.33, 4, Article 97 (April 2024), 27 pages. doi:10.1145/3640336

  10. [10]

    Ruofan Gao, Amjed Tahir, Peng Liang, Teo Susnjak, and Foutse Khomh. 2025. A Survey of Bugs in AI-Generated Code. arXiv:2512.05239 doi:10.48550/arXiv.2512.05239

  11. [11]

    GitHub. 2025. GitHub Copilot: AI Code Generation Statistics. https://github.blog/news-insights/research/

  12. [12]

    GitHub. 2025. Octoverse 2025: A New Developer Joins GitHub Every Second as AI Leads TypeScript to #1. https: //github.blog/news-insights/octoverse/

  13. [13]

    Google. 2021. OSV: Open Source Vulnerabilities. https://osv.dev

  14. [14]

    Mohammad Saqib Hasan, Saikat Chakraborty, Santu Karmaker, and Niranjan Balasubramanian. 2025. Teaching an old LLM secure coding: Localized preference optimization on distilled preferences. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 26039–26057

  15. [15]

    Hao He, Bogdan Vasilescu, and Christian Kästner. 2025. Pinning Is Futile: You Need More Than Local Dependency Versioning to Defend against Supply Chain Attacks.Proc. ACM Softw. Eng.2, FSE, Article FSE013 (June 2025), 24 pages. doi:10.1145/3715728

  16. [16]

    Junda He, Christoph Treude, and David Lo. 2025. LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead.ACM Trans. Softw. Eng. Methodol.34, 5, Article 124 (May 2025), 30 pages. doi:10.1145/3712003

  17. [17]

    Jingxuan He, Mark Vero, Gabriela Krasnopolska, and Martin Vechev. 2024. Instruction Tuning for Secure Code Generation. InInternational Conference on Machine Learning. PMLR, 18043–18062

  18. [18]

    Runzhi He, Hao He, Yuxia Zhang, and Minghui Zhou. 2023. Automating Dependency Updates in Practice: An Exploratory Study on GitHub Dependabot.IEEE Trans. Softw. Eng.49, 8 (Aug. 2023), 4004–4022. doi:10.1109/TSE.2023. 3278129

  19. [19]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review.ACM Trans. Softw. Eng. Methodol.33, 8, Article 220 (Dec. 2024), 79 pages. doi:10.1145/3695988

  20. [20]

    Jinchang Hu, Lyuye Zhang, Chengwei Liu, Sen Yang, Song Huang, and Yang Liu. 2024. Empirical Analysis of Vulnerabilities Life Cycle in Golang Ecosystem. In2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE, Lisbon, Portugal, 2618–2630. doi:10.1145/3597503.363923

  21. [21]

    Nasif Imtiaz, Seaver Thorn, and Laurie Williams. 2021. A Comparative Study of Vulnerability Reporting by Software Composition Analysis Tools. InProceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). ACM, Bari, Italy, Article 5. doi:10.1145/3475716.3475769

  22. [22]

    Nasif Imtiaz and Laurie Williams. 2023. Are Your Dependencies Code Reviewed?: Measuring Code Review Coverage in Dependency Updates .IEEE Transactions on Software Engineering49, 11 (Nov. 2023), 4932–4945. doi:10.1109/TSE. 2023.3319509

  23. [23]

    Nizar Islah, Justine Gehring, Diganta Misra, Eilif Muller, Irina Rish, Terry Yue Zhuo, and Massimo Caccia. 2024. GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models. arXiv:2411.05830 [cs.SE] https://arxiv.org/abs/2411.05830

  24. [24]

    Damien Jaime, Pascal Poizat, Joyce El Haddad, and Thomas Degueule. 2024. Balancing the Quality and Cost of Updating Dependencies. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering , Vol. 1, No. 1, Article . Publication date: May 2026. Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-...

  25. [25]

    Nihal Jain, Robert Kwiatkowski, Baishakhi Ray, Murali Krishna Ramanathan, and Varun Kumar. 2025. On Mitigating Code LLM Hallucinations with API Documentation. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 237–248. doi:10.1109/ICSE-SEIP66354.2025.00027

  26. [26]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2026. A Survey on Large Language Models for Code Generation.ACM Trans. Softw. Eng. Methodol.35, 2, Article 58 (Jan. 2026), 72 pages. doi:10.1145/3747588

  27. [28]

    Sachit Kuhar, Wasi Uddin Ahmad, Zijian Wang, Nihal Jain, Haifeng Qian, Baishakhi Ray, Murali Krishna Ramanathan, Xiaofei Ma, and Anoop Deoras. 2025. LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Huma...

  28. [29]

    Empirical Software Engineering23(1), 384–417 (2018)

    Raula Gaikovina Kula, Daniel M. German, Ali Ouni, Takashi Ishio, and Katsuro Inoue. 2018. Do developers update their library dependencies?Empirical Softw. Engg.23, 1 (Feb. 2018), 384–417. doi:10.1007/s10664-017-9521-5

  29. [30]

    Piergiorgio Ladisa, Henrik Plate, Matias Martinez, and Olivier Barais. 2023. SoK: Taxonomy of Attacks on Open-Source Software Supply Chains. In2023 IEEE Symposium on Security and Privacy (SP). IEEE, San Francisco, CA, USA, 1509–1526. doi:10.1109/SP46215.2023.10179304

  30. [31]

    Jasmine Latendresse, Naoures Day, SayedHassan Khatoonabadi, and Emad Shihab. 2025. The Software Librarian: Python Package Insights for Copilot. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering: Companion Proceedings. 17–20. doi:10.1109/ICSE-Companion66252.2025.00014

  31. [32]

    Jasmine Latendresse, SayedHassan Khatoonabadi, Ahmad Abdellatif, and Emad Shihab. 2024. Is ChatGPT a Good Soft- ware Librarian? An Exploratory Study on the Use of ChatGPT for Software Library Recommendations. arXiv:2408.05128 doi:10.48550/arXiv.2408.05128

  32. [33]

    Jasmine Latendresse, SayedHassan Khatoonabadi, and Emad Shihab. 2025. How Robust Are LLM-Generated Library Imports? An Empirical Study Using Stack Overflow. arXiv:2507.10818 doi:10.48550/arXiv.2507.10818

  33. [34]

    Dong Li, Shanfu Shu, Meng Yan, et al. 2025. Improving Co-Decoding Based Security Hardening of Code LLMs Leveraging Knowledge Distillation.IEEE Transactions on Software Engineering51, 9 (2025), 2634–2650. doi:10.1109/TSE.2025.3591791

  34. [35]

    Dong Li, Meng Yan, Yaosheng Zhang, et al. 2024. CoSec: On-the-Fly Security Hardening of Code LLMs via Supervised Co-Decoding. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1428–1439. doi:10.1145/3650212.3680371

  35. [36]

    Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, and Scarlett Li. 2025. FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce ...

  36. [37]

    Bo Lin, Shangwen Wang, Yihao Qin, Liqian Chen, and Xiaoguang Mao. 2025. Give LLMs a Security Course: Securing Retrieval-Augmented Code Generation via Knowledge Injection. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security(Taipei, Taiwan)(CCS ’25). Association for Computing Machinery, New York, NY, USA, 3356–3370. doi:1...

  37. [38]

    Chengwei Liu, Sen Chen, Lingling Fan, Bihuan Chen, Yang Liu, and Xin Peng. 2022. Demystifying the Vulnerability Propagation and Its Evolution via Dependency Trees in the npm Ecosystem. InProceedings of the 44th International Conference on Software Engineering. IEEE, Pittsburgh, PA, USA, 672–684. doi:10.1145/3510003.3510142

  38. [39]

    Zeyu Leo Liu, Shrey Pandit, Xi Ye, Eunsol Choi, and Greg Durrett. 2025. CodeUpdateArena: Benchmarking Knowledge Editing on API Updates. arXiv:2407.06249 doi:10.48550/arXiv.2407.06249

  39. [40]

    Tarek Mahmud, Meiru Che, and Guowei Yang. 2023. Detecting Android API Compatibility Issues With API Differences. IEEE Transactions on Software Engineering49, 7 (2023), 3857–3871. doi:10.1109/TSE.2023.3274153

  40. [41]

    Tarek Mahmud, Bin Duan, Meiru Che, Awatif Yasmin, Anne H. H. Ngu, and Guowei Yang. 2026. Automated Update of Android Deprecated API Usages With Large Language Models.IEEE Transactions on Software Engineering52, 1 (2026), 70–85. doi:10.1109/TSE.2025.3627897

  41. [42]

    Muller, Irina Rish, Samira Ebrahimi Kahou, and Massimo Caccia

    Diganta Misra, Nizar Islah, Victor May, Brice Rauby, Zihan Wang, Justine Gehring, Antonio Orvieto, Muawiz Sajjad Chaudhary, Eilif B. Muller, Irina Rish, Samira Ebrahimi Kahou, and Massimo Caccia. 2025. GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities. InNeurIPS 2025 Fourth Workshop on Deep , Vol. 1, No. 1, A...

  42. [43]

    Shradha Neupane, Grant Holmes, Elizabeth Wyss, Drew Davidson, and Lorenzo De Carli. 2023. Beyond typosquatting: an in-depth look at package confusion. InProceedings of the 32nd USENIX Conference on Security Symposium(Anaheim, CA, USA)(SEC ’23). USENIX Association, USA, Article 193, 18 pages

  43. [44]

    NTIA. 2019. Framing Software Component Transparency: Establishing a Common Software Bill of Material (SBOM). https://ntia.gov/files/ntia/publications/framingsbom_20191112.pdf

  44. [45]

    NVD. 2021. CVE-2021-44228 Detail: Apache Log4j2 Remote Code Execution Vulnerability. https://nvd.nist.gov/vuln/ detail/CVE-2021-44228

  45. [46]

    OpenAI. 2022. Tiktoken. https://github.com/openai/tiktoken. Accessed: February 2026

  46. [47]

    Ivan Pashchenko, Duc-Ly Vu, and Fabio Massacci. 2020. A Qualitative Study of Dependency Management and Its Security Implications. InProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security(Virtual Event, USA)(CCS ’20). Association for Computing Machinery, New York, NY, USA, 1513–1531. doi:10.1145/3372297.3417232

  47. [48]

    Serena Elisa Ponta, Henrik Plate, and Antonino Sabetta. 2020. Detection, Assessment and Mitigation of Vulnerabilities in Open Source Dependencies.Empirical Software Engineering25, 5 (2020), 3175–3215. doi:10.1007/s10664-020-09830-x

  48. [49]

    Piotr Przymus and Thomas Durieux. 2025. Wolves in the Repository: A Software Engineering Analysis of the XZ Utils Supply Chain Attack. InProceedings of the 22nd IEEE/ACM International Conference on Mining Software Repositories (MSR). IEEE, Ottawa, ON, Canada, 91–102. doi:10.1109/MSR66628.2025.00026

  49. [50]

    pypa. 2021. pip-audit. https://github.com/pypa/pip-audit. Accessed: February 2026

  50. [51]

    Python. 2002. The Python Package Index (PyPI). https://pypi.org/. Accessed: February 2026

  51. [52]

    Python. 2023. Python Release Python 3.12. https://www.python.org/downloads/release/python-3120/. Accessed: February 2026

  52. [53]

    Amirali Sajadi, Binh Le, Anh Nguyen, Kostadin Damevski, and Preetha Chatterjee. 2025. Do LLMs Consider Security? An Empirical Study on Responses to Programming Questions.Empirical Software Engineering30, 4 (2025), 101. doi:10.1007/s10664-025-10658-6

  53. [55]

    Yijun Shen, Xiang Gao, Hailong Sun, and Yu Guo. 2025. Understanding vulnerabilities in software supply chains. Empirical Software Engineering30, 1 (2025), 20. doi:10.1007/s10664-024-10581-2

  54. [56]

    Claudio Di Sipio, Juri Di Rocco, Davide Di Ruscio, and Vladyslav Bulhakov. 2025. Addressing Popularity Bias in Third- Party Library Recommendations Using LLMs . In2025 IEEE International Conference on Software Analysis, Evolution and Reengineering - Companion (SANER-C). IEEE Computer Society, Los Alamitos, CA, USA, 33–40. doi:10.1109/SANER- C66551.2025.00012

  55. [57]

    snyk. 2021. SolarWinds Orion Security Breach: A Shift In The Software Supply Chain Paradigm. https://snyk.io/blog/ solarwinds-orion-security-breach-a-shift-in-the-software-supply-chain-paradigm/

  56. [58]

    Sonatype. 2026. 2026 Software Supply Chain Report. https://www.sonatype.com/state-of-the-software-supply-chain/

  57. [59]

    Joseph Spracklen, Raveen Wijewickrama, AHM Nazmus Sakib, Anindya Maiti, Bimal Viswanath, and Murtuza Jadliwala

  58. [60]

    In Proceedings of the 34th USENIX Conference on Security Symposium(Seattle, WA, USA)(SEC ’25)

    We have a package for you! a comprehensive analysis of package hallucinations by code generating LLMs. In Proceedings of the 34th USENIX Conference on Security Symposium(Seattle, WA, USA)(SEC ’25). USENIX Association, USA, Article 190, 20 pages

  59. [61]

    Stack Exchange, Inc. 2026. Stack Exchange Data Dump. https://stackoverflow.com/help/data-dumps. Snapshot: Jan 6, 2026

  60. [62]

    Stack Exchange, Inc. 2026. Stack Exchange Data Query. https://data.stackexchange.com/stackoverflow/query/edit/ 1903717#resultSets. Snapshot: April 15, 2026

  61. [63]

    Stack Overflow. 2025. 2025 Stack Overflow Developer Survey. https://survey.stackoverflow.co/2025/

  62. [64]

    Zeyu Sun, Jingzheng Wu, Xiang Ling, Yilin Wei, Tianyue Luo, and Yanjun Wu. 2025. Research on Key Technologies of SBOM in Software Supply Chain.Journal of Software36, 6 (June 2025), 2604. doi:10.13328/j.cnki.jos.007308

  63. [65]

    Timothée Thiéblemont. 2021. Griffe: Signatures for Entire Python Programs. https://mkdocstrings.github.io/griffe/

  64. [66]

    Bin Wang, Wenjie Yu, Yilu Zhong, et al. 2025. AI Code in the Wild: Measuring Security Risks and Ecosystem Shifts of AI-Generated Code in Modern Software. arXiv:2512.18567 doi:10.48550/arXiv.2512.18567

  65. [68]

    Chong Wang, Kaifeng Huang, Jian Zhang, Yebo Feng, Lyuye Zhang, Yang Liu, and Xin Peng. 2025. LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-Based Code Completion. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). 885–897. doi:10.1109/ICSE55347.2025.00245 , Vol. 1, No. 1, Article . Publication date: May 2026...

  66. [69]

    Chengjie Wang, Jingzheng Wu, Hao Lyu, Xiang Ling, Tianyue Luo, Yanjun Wu, and Chen Zhao. 2026. A Large Scale Empirical Analysis on the Adherence Gap between Standards and Tools in SBOM.ACM Trans. Softw. Eng. Methodol. (Jan. 2026). doi:10.1145/3788692 Just Accepted

  67. [70]

    Shenao Wang, Yanjie Zhao, Xinyi Hou, and Haoyu Wang. 2025. Large Language Model Supply Chain: A Research Agenda.ACM Trans. Softw. Eng. Methodol.34, 5, Article 147 (May 2025), 46 pages. doi:10.1145/3708531

  68. [71]

    Tongtong Wu, Rongyi Chen, Wenjie Du, et al. 2026. Environment-Aware Code Generation: How Far Are We?. In Proceedings of the 48th IEEE/ACM International Conference on Software Engineering. doi:10.48550/arXiv.2601.12262

  69. [72]

    Tongtong Wu, Weigang Wu, Xingyu Wang, et al. 2024. VersiCode: Towards Version-Controllable Code Generation. arXiv:2406.07411 doi:10.48550/arXiv.2406.07411

  70. [73]

    Yulun Wu, Zeliang Yu, Ming Wen, Qiang Li, Deqing Zou, and Hai Jin. 2023. Understanding the Threats of Upstream Vulnerabilities to Downstream Projects in the Maven Ecosystem. InProceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 1046–1058. doi:10.1109/ICSE48619.2023. 00095

  71. [74]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. Demystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE, Article FSE037 (June 2025), 24 pages. doi:10.1145/3715754

  72. [75]

    Yifan Xia, Zichen Xie, Peiyu Liu, Kangjie Lu, Yan Liu, Wenhai Wang, and Shouling Ji. 2025. Beyond Static Pattern Matching? Rethinking Automatic Cryptographic API Misuse Detection in the Era of LLMs.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA006 (June 2025), 24 pages. doi:10.1145/3728875

  73. [76]

    Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Shengcheng Yu, Weisong Sun, Yun Yang, and Zhenyu Chen

  74. [77]

    A survey on large language models for software engineering.Science China Information Sciences69, 4 (2026), 141102

  75. [78]

    Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA022 (June 2025), 23 pages. doi:10.1145/3728894

  76. [79]

    Jianguo Zhao, Yuqiang Sun, Cheng Huang, et al . 2025. Towards Secure Code Generation with LLMs: A Study on Common Weakness Enumeration.IEEE Transactions on Software Engineering(2025), 1–16. doi:10.1109/TSE.2025.3619281

  77. [80]

    Tingwei Zhu, Zhongzhen Wen, Shangqing Liu, Yi Li, Tian Zhang, and Xin Xia. 2026. Assessing the Capability of LLMs for Deprecated API Usage Updating from Natural Language Descriptions.ACM Trans. Softw. Eng. Methodol.(April 2026). doi:10.1145/3808230 Just Accepted

  78. [81]

    Terry Yue Zhuo, Vu Minh Chien, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen GONG, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Davi...

  79. [82]

    Terry Yue Zhuo, Junda He, Jiamou Sun, Zhenchang Xing, David Lo, John Grundy, and Xiaoning Du. 2026. Identifying and Mitigating API Misuse in Large Language Models.IEEE Transactions on Software Engineering52, 3 (2026), 855–873. doi:10.1109/TSE.2026.3651566 , Vol. 1, No. 1, Article . Publication date: May 2026