pith. sign in

arxiv: 2606.23445 · v1 · pith:TSSGBYBSnew · submitted 2026-06-22 · 💻 cs.SE

The Prevalence and Impact of Licenses in Open Software Projects

Pith reviewed 2026-06-26 07:23 UTC · model grok-4.3

classification 💻 cs.SE
keywords open source licensespermissive licensesrestrictive licensessoftware ecosystemsproject activitylicense changesprogramming languagesC language
0
0 comments X

The pith

Moving from restrictive to permissive licenses reduces activity in C projects but increases it in Python ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes licenses across more than 100 million open source projects to map their overall distribution and changes within language ecosystems. It reports that most projects have no license at all, permissive licenses make up the majority of those that do and are growing in share over time, while restrictive licenses are more often retained. The key finding is that switching from a restrictive to a permissive license correlates with lower subsequent activity in C-language ecosystems but higher activity in Python. A reader would care because licenses determine reuse rights and can shape who contributes and how active a project becomes. The results show these effects are not uniform but depend on the programming language involved.

Core claim

Most projects contain no license. Among licensed projects, permissive licenses dominate and their share is rising over time, though restrictive licenses are retained more often. Language ecosystems differ sharply, with C strongly favoring restrictive licenses. Comparing activity levels in the year after a license change versus the year before shows that a shift from restrictive to permissive licensing is linked to reduced activity in C ecosystems and increased activity in Python.

What carries the argument

One-year activity comparison before and after license transitions, broken down by language ecosystem.

If this is right

  • Permissive licenses are becoming more common while restrictive ones persist in certain ecosystems.
  • C-language projects show lower activity after adopting permissive licenses.
  • Python projects show higher activity after adopting permissive licenses.
  • License type prevalence has shifted dramatically across time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Ecosystem maintainers might consider language-specific licensing guidance when projects consider changing terms.
  • Short-term activity metrics could be tracked after license updates to anticipate contributor response.
  • Unlicensed projects may face reuse barriers that licensed ones avoid, potentially limiting their reach.

Load-bearing premise

License detection across 100 million projects is accurate and the one-year activity window measures the effect of the license change without other factors interfering.

What would settle it

Re-running the activity comparison on the same projects and finding no measurable difference between license changers and similar non-changers, or discovering widespread errors in the automated license labels.

Figures

Figures reproduced from arXiv: 2606.23445 by Audris Mockus, Bogdan Vasilescu, Mahmoud Jahanshahi.

Figure 1
Figure 1. Figure 1: The distribution and retention rates of the most used licenses. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The distribution of license types across projects. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Proportion of license type over time latest version. For comparison, only approximately 8% of “copyleft” licenses were changed, which is consistent with our theory (H1b) that ideology-related license choice should be most “sticky”. We also analyze the proportion of adopted licenses at each point in time, categorized by license type, to identify trends in licensing preferences over time. This allows us to o… view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of license type count in projects [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: License change model - odds ratios To better understand the effect of license change direction on project metrics, we calculate the odds ratios for the change direction predictor and its interaction with language, considering only those predictors that are statistically significant at the 95% confidence level. This ensures that the results focus on meaningful relationships rather than noise [PITH_FULL_IMA… view at source ↗
read the original abstract

The terms of how publicly available source code can be used are dictated by its license. The license (or its absence), in turn, affects what code the project may reuse and how its code can be (re)used and may also affect external participation and overall activity of the project. We aim to better understand the general state of license distribution overall and within language ecosystems and to investigate if license changes are associated with a noticeable variations of project output. To accomplish that we identify licenses and license types for over 100M software projects and find that most do not contain any license, that permissive licenses represent the bulk of most licenses, and that permissive licensing is representing an increasing proportion of all licenses over time. Restrictive licenses are more likely to be retained, however. There is a great variation among language ecosystems with C-language strongly favoring restrictive licenses. The analysis of license change impact comparing activity within one year of the adoption of the initial and final licenses shows that the change from restrictive to permissive license varies with the ecosystem. C-language ecosystems show reduced activity while Python shows increased activity when comparing restrictive to permissive license transition. Our results demonstrate dramatic changes in license type prevalence over time and find that the effects of license changes may have opposite effects depending on the language ecosystem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper analyzes license distribution and changes across over 100 million open source projects. It reports that most projects lack any license, permissive licenses form the majority and are increasing over time while restrictive licenses are more often retained, with substantial variation by language ecosystem (e.g., C strongly favoring restrictive licenses). The central empirical claim is that transitions from restrictive to permissive licenses are associated with ecosystem-dependent activity changes: reduced activity in C-language projects and increased activity in Python projects, based on comparing project output one year before versus after the license change.

Significance. The scale of the study (100M+ projects) offers potentially useful descriptive data on license prevalence trends if the detection methods are validated. The ecosystem-specific impact findings, if they survive controls for confounders, would be relevant to OSS governance and license choice. However, the before-after design without matching or regression controls for project-level factors limits the strength of causal inferences about license effects.

major comments (1)
  1. [License change impact analysis] License change impact analysis (as described in the abstract): the one-year before/after activity comparison reports reduced activity for C ecosystems and increased activity for Python after restrictive-to-permissive transitions, but provides no matching, regression controls, or stratification for project age, size, contributor count, or concurrent events at the transition time. These factors plausibly differ systematically across language ecosystems and could produce the observed activity differences independently of the license change.
minor comments (2)
  1. The abstract states that licenses were identified for over 100M projects but does not specify the data source, license detection algorithm, or accuracy validation; these details are required to assess the reliability of the prevalence statistics.
  2. The claim that 'permissive licensing is representing an increasing proportion of all licenses over time' would benefit from explicit time-series figures or tables showing the trend with confidence intervals or sample sizes per year.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our large-scale empirical study of licenses in open source projects. We address the major comment below and will make revisions to strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: License change impact analysis (as described in the abstract): the one-year before/after activity comparison reports reduced activity for C ecosystems and increased activity for Python after restrictive-to-permissive transitions, but provides no matching, regression controls, or stratification for project age, size, contributor count, or concurrent events at the transition time. These factors plausibly differ systematically across language ecosystems and could produce the observed activity differences independently of the license change.

    Authors: We agree that the before-after comparison provides associations rather than causal estimates and does not include matching or regression controls for project-level factors. The manuscript frames the results using 'associated with' and 'varies with the ecosystem' to reflect its observational nature, and the key contribution is documenting the opposite directional patterns across ecosystems (reduced activity in C, increased in Python) as a descriptive finding. We will revise the discussion and limitations sections to explicitly note the absence of controls for age, size, contributor count, and concurrent events, and to state that these ecosystem differences warrant further controlled studies. Full matching or stratification was not performed due to the scale of the dataset (>100M projects) and the focus on broad prevalence trends, but we accept that adding such caveats improves the interpretation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely observational empirical analysis

full rationale

The paper reports license detection across >100M projects and before/after activity comparisons within language ecosystems. No equations, fitted parameters, predictions, ansatzes, or uniqueness theorems appear. All claims rest on direct data aggregation and simple temporal comparisons; no step reduces by construction to its own inputs or to a self-citation chain. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Purely empirical observational study with no axioms, free parameters, or invented entities; relies on data collection and classification methods not detailed here.

pith-pipeline@v0.9.1-grok · 5754 in / 911 out tokens · 27765 ms · 2026-06-26T07:23:38.648429+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. File-Level Copying Is an Implicit Dependency in Open Source

    cs.SE 2026-07 unverdicted novelty 6.0

    File-level copying acts as an implicit dependency in open source, removing provenance signals and concentrating security risks in vendored copies and license risks in direct source reuse.

Reference graph

Works this paper leans on

25 extracted references · 3 canonical work pages · cited by 1 Pith paper

  1. [1]

    Reducibility among combinatorial problems

    Emad Alamoudi, Rashid Mehmood, Wajdi Aljudaibi, Aiiad Albeshri, and Syed Hamid Hasan. 2020.Open Source and Open Data Licenses in the Smart Infrastructure Era: Review and License Selection Frameworks. Springer International Publishing, Cham, 537–559. https://doi.org/10.1007/978-3-030- 13705-2_22 Christian Bird, Nachiappan Nagappan, Harald Gall, Brendan Mur...

  2. [2]

    Andrea Capiluppi, Patricia Lago, and Maurizio Morisio

    What’s in a github star? understanding repository starring practices in a social coding platform.Journal of Systems and Software146 (2018), 112–129. Andrea Capiluppi, Patricia Lago, and Maurizio Morisio

  3. [3]

    Jorge Colazo and Yulin Fang

    On the untriviality of trivial packages: An empirical study of npm javascript packages.IEEE Transactions on Software Engineering48, 8 (2021), 2695–2708. Jorge Colazo and Yulin Fang

  4. [4]

    Xing Cui, Jingzheng Wu, Yanjun Wu, Xu Wang, Tianyue Luo, Sheng Qu, Xiang Ling, and Mutian Yang

    Impact of license choice on open source software development activity.Journal of the American Society for Information Science and Technology60, 5 (2009), 997–1011. Xing Cui, Jingzheng Wu, Yanjun Wu, Xu Wang, Tianyue Luo, Sheng Qu, Xiang Ling, and Mutian Yang

  5. [5]

    Melanie Dulong de Rosnay

    How do firms make use of open source communities?Long range planning41, 6 (2008), 629–649. Melanie Dulong de Rosnay

  6. [6]

    Brian Fitzgerald

    Open source software: Motivation and restrictive licensing.International Economics and Economic Policy4 (2007), 209–225. Brian Fitzgerald

  7. [7]

    http://www.jstor.org/stable/25148740 Tanner Fry, Tapajit Dey, Andrey Karnauch, and Audris Mockus

    The Transformation of Open Source Software.MIS Quarterly30, 3 (2006), 587–598. http://www.jstor.org/stable/25148740 Tanner Fry, Tapajit Dey, Andrey Karnauch, and Audris Mockus

  8. [8]

    https://doi.org/10.1016/S0048-7333(03)00061-1 Mahmoud Jahanshahi and Audris Mockus

    Profiting from voluntary information spillovers: how users benefit by freely revealing their innovations.Research Policy32, 10 (2003), 1753–1769. https://doi.org/10.1016/S0048-7333(03)00061-1 Mahmoud Jahanshahi and Audris Mockus

  9. [9]

    In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code)

    Cracks in the stack: Hidden vulnerabilities and licensing risks in llm pre-training datasets. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code). IEEE, 104–111. Mahmoud Jahanshahi, David Reid, Adam McDaniel, and Audris Mockus. 2025b. Oss license identification at scale: A comprehensive dataset using world of code. In2025 I...

  10. [10]

    Georgia M Kapitsaki, Nikolaos D Tselikas, Kyriakos-Ioannis D Kyriakou, and Maria Papoutsoglou

    Modeling and recommending open source licenses with findOSSLicense.IEEE Transactions on Software Engineering47, 5 (2019), 919–935. Georgia M Kapitsaki, Nikolaos D Tselikas, Kyriakos-Ioannis D Kyriakou, and Maria Papoutsoglou

  11. [11]

    Maria Kechagia, Diomidis Spinellis, and Stephanos Androutsellis-Theotokis

    Help me with this: A categorization of open source software problems.Information and Software Technology152 (2022), 107034. Maria Kechagia, Diomidis Spinellis, and Stephanos Androutsellis-Theotokis

  12. [12]

    Hemank Lamba, Asher Trockman, Daniel Armanios, Christian Kästner, Heather Miller, and Bogdan Vasilescu

    Effort, co-operation and co-ordination in an open source software project: GNOME.Information Systems Journal 12, 1 (2002), 27–42. Hemank Lamba, Asher Trockman, Daniel Armanios, Christian Kästner, Heather Miller, and Bogdan Vasilescu

  13. [13]

    http: //www.jstor.org/stable/3569837 Manuscript submitted to ACM 20 Mahmoud Jahanshahi, Bogdan Vasilescu, and Audris Mockus Josh Lerner and Jean Tirole

    Some Simple Economics of Open Source.The Journal of Industrial Economics50, 2 (Jun 2002), 197–234. http: //www.jstor.org/stable/3569837 Manuscript submitted to ACM 20 Mahmoud Jahanshahi, Bogdan Vasilescu, and Audris Mockus Josh Lerner and Jean Tirole

  14. [14]

    Yuxing Ma, Chris Bogart, Sadika Amreen, Russell Zaretzki, and Audris Mockus

    The economics of technology sharing: Open source and beyond.Journal of Economic Perspectives19, 2 (2005), 99–120. Yuxing Ma, Chris Bogart, Sadika Amreen, Russell Zaretzki, and Audris Mockus

  15. [15]

    Yuxing Ma, Audris Mockus, Russel Zaretzki, Randy Bradley, and Bogdan Bichescu

    World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data.Empirical Software Engineering26 (2021), 1–42. Yuxing Ma, Audris Mockus, Russel Zaretzki, Randy Bradley, and Bogdan Bichescu

  16. [16]

    https://doi.org/10.1109/TSE.2020.2993758 Addi Malviya-Thakur, Audris Mockus, Russell Zaretzki, Bogdan Bichescu, and Randy Bradley

    A Methodology for Analyzing Uptake of Software Technologies Among Developers.IEEE Transactions on Software Engineering48, 2 (2022), 485–501. https://doi.org/10.1109/TSE.2020.2993758 Addi Malviya-Thakur, Audris Mockus, Russell Zaretzki, Bogdan Bichescu, and Randy Bradley

  17. [17]

    In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

    How R Developers explain their Package Choice: A Survey. In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–12. https://doi.org/10.1109/ ESEM56168.2023.10304869 Audris Mockus

  18. [18]

    InFirst International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS’07: ICSE Workshops 2007)

    Large-scale code reuse in open source software. InFirst International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS’07: ICSE Workshops 2007). IEEE, 7–7. Audris Mockus, Diomidis Spinellis, Zoe Kotti, and Gabriel John Dusing

  19. [19]

    Sonali K

    Determinants of the choice of open source software license.Journal of Management Information Systems25, 3 (2008), 207–240. Sonali K. Shah

  20. [20]

    Zakariyah Shoroye, Waheeb Yaqub, Azhar Ahmed Mohammed, Zeyar Aung, and Davor Svetinovic

    Motivation, Governance, and the Viability of Hybrid Forms in Open Source Software Development.Management Science52, 7 (July 2006), 1000–1014. Zakariyah Shoroye, Waheeb Yaqub, Azhar Ahmed Mohammed, Zeyar Aung, and Davor Svetinovic

  21. [21]

    Jason Tsay, Laura Dabbish, and James Herbsleb

    Impacts of license choice and organizational sponsorship on user interest and development activity in open source software projects.Information Systems Research17, 2 (2006), 126–144. Jason Tsay, Laura Dabbish, and James Herbsleb

  22. [22]

    http://www.jstor.org/stable/41703471 Patrick Wagstrom, James D Herbsleb, Robert E Kraut, and Audris Mockus

    Carrots and Rainbows: Motivation and Social Practice in Open Source Software Development.MIS Quarterly36, 2 (2012), 649–676. http://www.jstor.org/stable/41703471 Patrick Wagstrom, James D Herbsleb, Robert E Kraut, and Audris Mockus

  23. [23]

    Jiaqi Wu, Lingfeng Bao, Xiaohu Yang, Xin Xia, and Xing Hu

    Open source license inconsistencies on github.ACM Transactions on Software Engineering and Methodology32, 5 (2023), 1–23. Jiaqi Wu, Lingfeng Bao, Xiaohu Yang, Xin Xia, and Xing Hu

  24. [24]

    Weiwei Xu, Kai Gao, Hao He, and Minghui Zhou

    Lidetector: License incompatibility detection for open source software.ACM Transactions on Software Engineering and Methodology32, 1 (2023), 1–28. Weiwei Xu, Kai Gao, Hao He, and Minghui Zhou

  25. [25]

    Inflow and retention in oss communities with commercial involvement: A case study of three hybrid projects.ACM Transactions on Software Engineering and Methodology (TOSEM)25, 2 (2016),