pith. sign in

arxiv: 2606.29785 · v1 · pith:E3VGX7F4new · submitted 2026-06-29 · 💻 cs.SE

Uncovering Similar but Different Packages in PyPI and Potential Security Threats

Pith reviewed 2026-06-30 05:40 UTC · model grok-4.3

classification 💻 cs.SE
keywords PyPIpackage replicationvulnerability propagationmalware distributionPython ecosystemsoftware securitycode duplication
0
0 comments X

The pith

Replication in PyPI redistributes code from popular packages, hides vulnerabilities, and enables malware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines package replication on PyPI, where packages duplicate most of the code from existing ones but appear under new maintainers. Analysis of 200K packages shows this occurs frequently with popular projects, producing 1,361 replicated cases among the top 3K. The same pattern duplicates vulnerable code that standard tools overlook, yielding 256 previously unknown replicated vulnerable packages, and allows malicious packages to spread by copying popular code with small changes, resulting in seven new replicated malicious packages identified. A sympathetic reader would care because these replications affect trust and safety in the Python package ecosystem used by millions of developers.

Core claim

By analyzing one-third of the PyPI repository, the study shows that replication frequently redistributes substantial portions of existing packages under different maintainers, creates vulnerability blind spots that current detection tools rarely catch, and serves as an attack vector for malware distribution, as evidenced by 1,361 replicated popular packages, 256 previously unknown replicated vulnerable packages, and 7 new replicated malicious packages.

What carries the argument

Package replication, the duplication of most of the codebase from existing packages under different maintainers.

If this is right

  • Replication of popular packages redistributes substantial portions of existing packages under different maintainers.
  • Replication creates vulnerability blind spots that current detection tools rarely catch.
  • Replication serves as an attack vector for malware distribution through minor modifications and code injection.
  • 4.79 percent of known malicious packages replicated popular ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Package indexes could add automated similarity checks during upload to reduce developer confusion from near-duplicates.
  • Vulnerability scanners might improve coverage by cross-referencing new packages against known originals rather than analyzing them in isolation.
  • Malware detectors could flag packages that match popular ones except for small injected changes as higher-risk candidates.

Load-bearing premise

The criteria and algorithm used to classify packages as replicated rather than independently similar implementations are accurate and were applied consistently.

What would settle it

A manual audit of the reported replicated packages that finds a substantial fraction were developed independently instead of copied would reduce or eliminate the reported counts of security impacts.

Figures

Figures reproduced from arXiv: 2606.29785 by Seunghoon Woo, Soojin Han, Sunha Park.

Figure 1
Figure 1. Figure 1: Overview of the empirical study. newly registered PyPI packages uncovered seven previously unknown malicious packages, all later removed after disclosure (see Section 6). Contribution. This paper makes the following four main contributions. • Large-scale analysis. To analyze overall insights and potential risks of package replication, we constructed five datasets and conducted large-scale experiments. In p… view at source ↗
Figure 2
Figure 2. Figure 2: Graphs on name and metadata similarity. Packages maintained by the same maintainer generally [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: CDF graphs of code similarity and total downloads. In general, same-maintainer replications exhibit [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Graphs showing the vulnerability types and severities of the identified replicated vulnerable packages. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: CDF graphs of days since last release and days from first to last release. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Graphs of name similarity in replicated malicious packages and statistics of suspicious APIs. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

In this study, we present a large-scale, in-depth study of package replication in PyPI. As a vital platform, PyPI streamlines Python package distribution for developers. However, beyond small-scale code cloning, we observe that many replicated packages exist on PyPI, which duplicate most of the codebase from existing packages. Such replication not only confuses developers but also propagates known vulnerabilities and enables the creation of new malicious packages. To address this issue, we comprehensively examine the characteristics and potential threats of replicated packages. Using one-third of the entire PyPI repository (200K packages), we investigate replication from three perspectives: replication of popular packages, vulnerable packages, and malicious packages. Our experiments reveal three critical findings about package replication in PyPI: (1) by identifying 1,361 replicated packages of the top 3K popular projects, we show that replication frequently redistributes substantial portions of existing packages under different maintainers; (2) by uncovering 256 previously unknown replicated vulnerable packages, we demonstrate that replication creates vulnerability blind spots that current detection tools rarely catch; (3) by analyzing 3,883 known malicious packages, we found that 186 (4.79%) replicated popular ones, and this pattern further led us to identify seven previously unknown replicated malicious packages, highlighting its role as an attack vector for malware distribution through minor modifications and code injection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a large-scale observational study of package replication in PyPI, analyzing 200K packages (one-third of the repository). It identifies 1,361 replicated packages among the top 3K popular projects, 256 previously unknown replicated vulnerable packages, and 7 new replicated malicious packages (from 3,883 known malicious packages, of which 186 or 4.79% replicated popular ones), concluding that replication redistributes substantial code under new maintainers, creates vulnerability blind spots missed by current tools, and serves as a malware attack vector via minor modifications and injection.

Significance. If the replication detection procedure is accurate, reproducible, and validated against false positives, the results would provide concrete empirical evidence of a systemic security risk in the Python ecosystem, quantifying how code redistribution can propagate vulnerabilities and enable malware. The scale (200K packages) and specific counts could inform improvements to package managers, vulnerability scanners, and malware detection, representing a meaningful contribution to software supply-chain security research.

major comments (2)
  1. [Methods] Methods section: The similarity metric, threshold, handling of package metadata/dependencies, and validation steps (e.g., precision/recall on a labeled sample or manual audit) used to classify packages as 'replicated' versus independent implementations are not described. All headline counts (1,361 replicated popular packages, 256 replicated vulnerable packages, 7 new malicious packages) rest on this unstated procedure; without it, it is impossible to assess whether the security-threat conclusions follow from the data.
  2. [Results on popular packages] § on replication of popular packages: The claim that replication 'frequently redistributes substantial portions of existing packages' requires supporting quantitative detail such as the distribution of code-overlap percentages or similarity scores across the 1,361 cases; the current presentation leaves open whether many cases are superficially similar rather than true redistributions.
minor comments (2)
  1. [Abstract] Abstract: Clarify the sampling method for the 200K packages (random, time-based, or popularity-stratified) and the exact date or total size of PyPI at the time of data collection.
  2. [Results] The paper would benefit from a table or figure summarizing the replication similarity distribution and any false-positive rate estimates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below and will incorporate revisions to strengthen the presentation of our methods and results.

read point-by-point responses
  1. Referee: [Methods] Methods section: The similarity metric, threshold, handling of package metadata/dependencies, and validation steps (e.g., precision/recall on a labeled sample or manual audit) used to classify packages as 'replicated' versus independent implementations are not described. All headline counts (1,361 replicated popular packages, 256 replicated vulnerable packages, 7 new malicious packages) rest on this unstated procedure; without it, it is impossible to assess whether the security-threat conclusions follow from the data.

    Authors: We agree that the current manuscript does not provide adequate detail on the replication detection procedure. In the revised version, we will expand the Methods section to fully describe the similarity metric employed, the threshold for classifying replication, the handling of package metadata and dependencies, and the validation steps including any precision/recall assessment on a labeled sample or manual audit. This addition will enable readers to evaluate the reliability of the reported counts and the resulting security conclusions. revision: yes

  2. Referee: [Results on popular packages] § on replication of popular packages: The claim that replication 'frequently redistributes substantial portions of existing packages' requires supporting quantitative detail such as the distribution of code-overlap percentages or similarity scores across the 1,361 cases; the current presentation leaves open whether many cases are superficially similar rather than true redistributions.

    Authors: We acknowledge that additional quantitative support is needed to substantiate the claim of substantial code redistribution. In the revision, we will include the distribution of code-overlap percentages and similarity scores across the 1,361 replicated popular packages. This will provide concrete evidence that the identified cases involve meaningful code duplication rather than superficial similarities. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical counts from repository scan

full rationale

The paper reports observational counts (1,361 replicated popular packages, 256 vulnerable, 7 malicious) obtained by scanning one-third of PyPI. No equations, fitted parameters, predictions, or self-citations appear in the abstract or described method. The replication classification procedure is a data-processing step whose validity is external to the reported numbers; it does not reduce the outputs to the inputs by construction. This is a standard empirical measurement study whose central claims rest on the accuracy of the (unspecified here) similarity detector rather than on any definitional or self-referential loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on an unstated replication detection procedure whose parameters and validation criteria are not provided in the abstract, plus the assumption that the sampled subset represents broader PyPI behavior.

free parameters (1)
  • replication similarity threshold
    The cutoff used to decide whether two packages count as replications is not stated but must exist to produce the reported counts.
axioms (1)
  • domain assumption The 200K-package sample is representative of replication patterns across the full PyPI repository.
    The study uses one-third of the repository and generalizes the findings to the platform as a whole.

pith-pipeline@v0.9.1-grok · 5777 in / 1209 out tokens · 40783 ms · 2026-06-30T05:40:47.017041+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 32 canonical work pages · 3 internal anchors

  1. [1]

    Mahmoud Alfadel, Diego Elias Costa, and Emad Shihab. 2023. Empirical Analysis of Security Vulnerabilities in Python Packages.Empirical Software Engineering28, 3 (2023), 59. doi:10.1007/s10664-022-10278-4

  2. [2]

    Gábor Antal, Márton Keleti, and Péter Heged ˘us. 2020. Exploring the Security Awareness of the Python and JavaScript Open Source Communities. InProceedings of the 17th International Conference on Mining Software Repositories. 16–20

  3. [3]

    Ethan Bommarito and Michael Bommarito. 2019. An Empirical Analysis of the Python Package Index (PyPI).arXiv preprint arXiv:1907.11073(2019). doi:10.48550/arXiv.1907.11073

  4. [4]

    Mircea Cadariu, Eric Bouwers, Joost Visser, and Arie Van Deursen. 2015. Tracking Known Security Vulnerabilities in Proprietary Software Systems. In2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 516–519. doi:10.1109/SANER.2015.7081868

  5. [5]

    Seogyeong Cho, Seungeun Yu, and Seunghoon Woo. 2025. Cryptbara: Dependency-Guided Detection of Python Cryptographic API Misuses. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1578–1590. doi:10.1109/ASE63991.2025.00133

  6. [6]

    Ctags. 2026. Universal Ctags. https://github.com/universal-ctags/ctags

  7. [7]

    Datadog. 2026. GuardDog: A CLI Tool to Identify Malicious Packages. https://github.com/DataDog/guarddog

  8. [8]

    DependencyTrack. 2026. DependencyTrack. https://github.com/DependencyTrack/dependency-track

  9. [9]

    Ruian Duan, Omar Alrawi, Ranjita Pai Kasturi, Ryan Elder, Brendan Saltaformaggio, and Wenke Lee. 2021. Towards Measuring Supply Chain Attacks on Package Managers for Interpreted Languages. In28th Annual Network and Distributed System Security Symposium, NDSS. doi:10.14722/ndss.2021.23055

  10. [10]

    Siyue Feng, Yueming Wu, Wenjie Xue, Sikui Pan, Deqing Zou, Yang Liu, and Hai Jin. 2024. FIRE: Combining Multi-Stage Filtering with Taint Analysis for Scalable Recurring Vulnerability Detection. In33rd USENIX Security Symposium (USENIX Security 24). 1867–1884. doi:10.5555/3698900.3699005

  11. [11]

    Xingan Gao, Xiaobing Sun, Sicong Cao, Kaifeng Huang, Di Wu, Xiaolei Liu, Xingwei Lin, and Yang Xiang. 2025. MALGUARD: Towards Real-Time, Accurate, and Actionable Detection of Malicious Packages in PyPI Ecosystem. In Proceedings of the 34th USENIX Security Symposium (USENIX Security ’25). doi:10.5555/3766078.3766322

  12. [12]

    Google. 2025. OSV: A Distributed Vulnerability Database for Open Source. https://osv.dev/

  13. [13]

    Wenbo Guo, Zhengzi Xu, Chengwei Liu, Cheng Huang, Yong Fang, and Yang Liu. 2023. An Empirical Study of Malicious Code In PyPI Ecosystem. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 166–177. doi:10.1109/ASE56229.2023.00135

  14. [14]

    Stefan Haefliger, Georg Von Krogh, and Sebastian Spaeth. 2008. Code Reuse in Open Source Software.Management science54, 1 (2008), 180–193. doi:10.1287/mnsc.1070.0748

  15. [15]

    Jiyong Jang, Abeer Agrawal, and David Brumley. 2012. ReDeBug: Finding Unpatched Code Clones in Entire OS Distributions. In2012 IEEE Symposium on Security and Privacy. IEEE, 48–62. doi:10.1109/SP.2012.13

  16. [16]

    Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. DECKARD: Scalable and Accurate Tree-based Detection of Code Clones. In29th International Conference on Software Engineering (ICSE’07). IEEE, 96–105

  17. [17]

    Berkay Kaplan and Jingyu Qian. 2021. A Survey on Common Threats in npm and PyPi Registries. InInternational Workshop on Deployable Machine Learning for Security Defense. Springer, 132–156

  18. [18]

    Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. 2017. VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery. In2017 IEEE symposium on security and privacy (SP). IEEE, 595–614. doi:10.1109/SP.2017.62

  19. [19]

    Piergiorgio Ladisa, Henrik Plate, Matias Martinez, and Olivier Barais. 2022. Taxonomy of Attacks on Open-Source Software Supply Chains.arXiv preprint arXiv:2204.04008(2022). doi:10.1109/SP46215.2023.10179304

  20. [20]

    Ningke Li, Shenao Wang, Mingxi Feng, Kailong Wang, Meizhen Wang, and Haoyu Wang. 2023. MalWuKong: Towards Fast, Accurate, and Multilingual Detection of Malicious Code Poisoning in OSS Supply Chains. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1993–2005. doi:10.1109/ASE56229.2023.00073

  21. [21]

    Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Hanchao Qi, and Jie Hu. 2016. VulPecker: An Automated Vulnerability Detection System Based on Code Similarity Analysis. InProceedings of the 32nd annual conference on computer security applications. 201–213. doi:10.1145/2991079.2991102

  22. [22]

    Wentao Liang, Xiang Ling, Jingzheng Wu, Tianyue Luo, and Yanjun Wu. 2023. A Needle is an Outlier in a Haystack: Hunting Malicious PyPI Packages with Code Clustering. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 307–318. doi:10.1109/ASE56229.2023.00085

  23. [23]

    Cristina V Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. 2017. DéjàVu: A Map of Code Duplicates on GitHub.Proceedings of the ACM on Programming Languages1, OOPSLA (2017), 1–28. doi:10.1145/3133908

  24. [24]

    Leland McInnes, John Healy, and Steve Astels. 2017. hdbscan: Hierarchical density based clustering.The Journal of Open Source Software2, 11 (2017), 205. doi:10.21105/joss.00205

  25. [25]

    Leland McInnes, John Healy, and James Melville. 2018. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.arXiv preprint arXiv:1802.03426(2018). doi:10.21105/joss.00861 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE210. Publication date: July 2026. FSE210:22 Sunha Park, Soojin Han, and Seunghoon Woo

  26. [26]

    Abdechakour Mechri, Mohamed Amine Ferrag, and Merouane Debbah. 2025. SecureQwen: Leveraging LLMs for Vulnerability Detection in Python Codebases.Computers & Security148 (2025), 104151. doi:10.1016/j.cose.2024.104151

  27. [27]

    Tasuku Nakagawa, Yoshiki Higo, and Shinji Kusumoto. 2021. NIL: Large-Scale Detection of Large-Variance Clones. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 830–841. doi:10.1145/3468264.3468564

  28. [28]

    Shradha Neupane, Grant Holmes, Elizabeth Wyss, Drew Davidson, and Lorenzo De Carli. 2023. Beyond Typosquatting: An In-depth Look at Package Confusion. In32nd USENIX Security Symposium (USENIX Security 23). 3439–3456

  29. [29]

    Changan Niu, Chuanyi Li, Vincent Ng, Dongxiao Chen, Jidong Ge, and Bin Luo. 2023. An Empirical Comparison of Pre-Trained Models of Source Code. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2136–2148. doi:10.1109/ICSE48619.2023.00180

  30. [30]

    Marc Ohm, Henrik Plate, Arnold Sykosch, and Michael Meier. 2020. Backstabber’s Knife Collection: A Review of Open Source Software Supply Chain Attacks. InInternational Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer. doi:10.1007/978-3-030-52683-2_2

  31. [31]

    David Reid, Mahmoud Jahanshahi, and Audris Mockus. 2022. The Extent of Orphan Vulnerabilities from Code Reuse in Open Source Software. InProceedings of the 44th international conference on software engineering. 2104–2115

  32. [32]

    David Reid, Kristiina Rahkema, and James Walden. 2023. Large Scale Study of Orphan Vulnerabilities in the Software Supply Chain. InProceedings of the 19th International Conference on Predictive Models and Data Analytics in Software Engineering. 22–32. doi:10.1145/3617555.3617872

  33. [33]

    Safety. 2025. Safety: Python Dependency Vulnerability Scanner. https://pypi.org/project/safety/

  34. [34]

    Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. SourcererCC: Scaling Code Clone Detection to Big-Code. InProceedings of the 38th international conference on software engineering. 1157–1168

  35. [35]

    Xiaobing Sun, Xingan Gao, Sicong Cao, Lili Bo, Xiaoxue Wu, and Kaifeng Huang. 2024. 1+1>2: Integrating Deep Code Behaviors with Metadata Features for Malicious PyPI Package Detection. InProceedings of the 39th IEEE/ACM international conference on automated software engineering. 1159–1170. doi:10.1145/3691620.3695493

  36. [36]

    Vaidya, Drew Davidson, Lorenzo D Carli, and Vaibhav Rastogi

    Matthew Taylor, Ruturaj K. Vaidya, Drew Davidson, Lorenzo D Carli, and Vaibhav Rastogi. 2020. SpellBound: Defending Against Package Typosquatting. arXiv:2003.03471 [cs.SE] doi:10.48550/arXiv.2003.03471

  37. [37]

    Marat Valiev, Bogdan Vasilescu, and James Herbsleb. 2018. Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem. InProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 644–655

  38. [38]

    Hugo van Kemenade, Cal Paterson, Martin Thoma, Mike Fiedler, Richard Si, and Zsolt Dollenstein. 2025. hugovk/top- pypi-packages: Release 2025.08. Zenodo. doi:10.5281/zenodo.16672093

  39. [39]

    Duc-Ly Vu, Fabio Massacci, Ivan Pashchenko, Henrik Plate, and Antonino Sabetta. 2021. LASTPYMILE: Identifying the Discrepancy between Sources and Packages. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 780–792

  40. [40]

    Duc-Ly Vu, Ivan Pashchenko, Fabio Massacci, Henrik Plate, and Antonino Sabetta. 2020. Typosquatting and Com- bosquatting Attacks on the Python Ecosystem. In2020 ieee european symposium on security and privacy workshops (euros&pw). IEEE, 509–514. doi:10.1109/EuroSPW51379.2020.00074

  41. [41]

    Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. CodeT5+: Open Code Large Language Models for Code Understanding and Generation.arXiv preprint arXiv:2305.07922(2023)

  42. [42]

    Laura Wartschinski, Yannic Noller, Thomas Vogel, Timo Kehrer, and Lars Grunske. 2022. VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase for Python. InInformation and Software Technology. Elsevier

  43. [43]

    Seunghoon Woo, Eunjin Choi, Heejo Lee, and Hakjoo Oh. 2023. V1SCAN: Discovering 1-day Vulnerabilities in Reused C/C++ Open-source Software Components Using Code Classification Techniques. In32nd USENIX Security Symposium (USENIX Security 23). 6541–6556. doi:10.5555/3620237.3620603

  44. [44]

    Seunghoon Woo, Hyunji Hong, Eunjin Choi, and Heejo Lee. 2022. MOVERY: A Precise Approach for Modified Vulner- able Code Clone Discovery from Modified Open-Source Software Components. In31st USENIX Security Symposium (USENIX Security 22). 3037–3053

  45. [45]

    Elizabeth Wyss, Lorenzo De Carli, and Drew Davidson. 2022. What the Fork? Finding Hidden Code Clones in npm. In Proceedings of the 44th international conference on software engineering. 2415–2426. doi:10.1145/3510003.3510168

  46. [46]

    Junan Zhang, Kaifeng Huang, Yiheng Huang, Bihuan Chen, Ruisi Wang, Chong Wang, and Xin Peng. 2025. Killing Two Birds with One Stone: Malicious Package Detection in NPM and PyPI using a Single Model of Malicious Behavior Sequence.ACM Transactions on Software Engineering and Methodology34, 4 (2025), 1–28. doi:10.1145/3705304

  47. [47]

    Kunpeng Zhao, Shuya Duan, Ge Qiu, Jinyuan Zhai, Mingze Li, and Long Liu. 2024. Python source code vulnerability detection based on CodeBERT language model. In2024 7th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI). IEEE, 1–6. doi:10.1109/ACAI63924.2024.10899694 Received 2026-02-25; accepted 2026-03-24 Proc. ACM Softw....