Open but Incompatible: A License Compatibility Analysis of Corpora for Low-Resource African Languages
Pith reviewed 2026-06-30 09:47 UTC · model grok-4.3
The pith
Many African NLP corpora cannot be legally combined or modified due to incompatible or misrepresented licenses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that license compatibility problems are widespread in corpora used for low-resource African languages. It shows this through a six-tier matrix applied to Kituba/Munukutuba, Zarma, and Moore, and through four documented failure modes: outright prohibition (JW300 removed after legal audit), composite license misrepresentation (WAXAL), a NoDerivs clause hidden behind a CC-BY label (Tanzil), and data persistence failure (Congolese Radio Corpus with most source URLs dead).
What carries the argument
A six-tier compatibility matrix that classifies how different licenses interact when corpora are combined or modified for NLP tasks.
If this is right
- CC-BY-SA and CC-BY-NC licenses cannot be combined into one published dataset.
- A NoDerivs clause prohibits tokenisation and annotation even when the label appears to allow it.
- Some corpora have been removed from public repositories after license violations were confirmed.
- Source URLs for corpora frequently become unavailable, resulting in permanent data loss.
Where Pith is reading between the lines
- Similar license audits applied to corpora for other low-resource languages could reveal comparable hidden incompatibilities.
- Following the pre-annotation checklist could reduce legal exposure when researchers create or enrich datasets.
- Identifying and combining only the legally clean corpora might enable larger, usable resources for African NLP work.
Load-bearing premise
The legal interpretations of the license terms stated on the original corpus sources and dataset cards are accurate and binding, and the audited corpora are representative of broader use in African NLP.
What would settle it
A detailed legal review that concludes the four documented failure modes do not actually block legal combination, tokenisation, or annotation of the listed corpora.
read the original abstract
Creative Commons licenses dominate African NLP corpus releases, but their compatibility rules are rarely applied. CC-BY-SA and CC-BY-NC cannot be combined in a single published dataset; a NoDerivs clause silently prohibits tokenisation and annotation. This paper audits the license provenance of over twenty corpus families used in African NLP, constructs a six-tier compatibility matrix, and applies it to three case-study languages: Kituba/Munukutuba, Zarma, and Moore. Four failure modes are documented with primary-source evidence: outright prohibition (JW300, removed from OPUS after a legal audit confirmed Terms of Service violation); composite license misrepresentation (WAXAL, whose CC-BY 4.0 claim is contradicted by its own HuggingFace dataset card); a NoDerivs clause hidden behind a CC-BY label (Tanzil); and data persistence failure (the Congolese Radio Corpus, where 402 of 405 source URLs are now dead). A pre-annotation due diligence checklist and a survey of legally clean enrichment opportunities close the paper.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript audits license provenance and compatibility for over twenty corpus families used in African NLP, constructs a six-tier compatibility matrix based on Creative Commons rules, applies it via case studies to Kituba/Munukutuba, Zarma, and Moore, and documents four failure modes with primary-source evidence: outright prohibition (JW300), composite license misrepresentation (WAXAL), hidden NoDerivs (Tanzil), and data persistence failure (Congolese Radio Corpus). It closes with a pre-annotation due diligence checklist and survey of clean enrichment opportunities.
Significance. If the primary-source readings hold, the work is significant for exposing under-appreciated legal barriers to data combination and reuse in low-resource African language NLP; the concrete examples, matrix, and checklist provide actionable value to practitioners and could reduce downstream incompatibility risks. The empirical audit approach with named sources is a clear strength.
major comments (2)
- [Case studies section] Case studies section (failure modes): The incompatibility conclusions for WAXAL (CC-BY 4.0 claim contradicted by dataset card), Tanzil (NoDerivs hidden behind CC-BY), and JW300 (ToS violation) rest entirely on the authors' parsing of primary sources and cards; no reference to CC legal compatibility charts, jurisdiction notes, or external verification is provided, which is load-bearing for the central claim that these constitute documented failure modes.
- [Compatibility matrix] Methods / matrix construction: The six-tier compatibility matrix is applied to the case studies, but the manuscript does not detail the exact derivation rules from CC license clauses (e.g., how SA/NC/ND interact across the tiers or handling of composite licenses), preventing independent assessment of whether the matrix accurately reflects CC rules.
minor comments (2)
- [Abstract] The abstract states 'over twenty corpus families' but provides no table or appendix listing them with their licenses; adding this would improve traceability of the audit scope.
- [Checklist section] The pre-annotation checklist is mentioned as closing the paper but its items are not enumerated in the provided abstract; ensuring it is explicitly listed would aid usability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and commit to revisions that strengthen the evidentiary basis and reproducibility of the analysis.
read point-by-point responses
-
Referee: [Case studies section] Case studies section (failure modes): The incompatibility conclusions for WAXAL (CC-BY 4.0 claim contradicted by dataset card), Tanzil (NoDerivs hidden behind CC-BY), and JW300 (ToS violation) rest entirely on the authors' parsing of primary sources and cards; no reference to CC legal compatibility charts, jurisdiction notes, or external verification is provided, which is load-bearing for the central claim that these constitute documented failure modes.
Authors: We agree that explicit cross-references to CC compatibility resources would strengthen the presentation. While the claims derive directly from the cited primary license texts and dataset cards (which remain the authoritative sources), the revised manuscript will add citations to the official Creative Commons compatibility charts, license compatibility tables, and any available jurisdiction notes. This addition will provide the requested external verification layer without altering the core findings. revision: yes
-
Referee: [Compatibility matrix] Methods / matrix construction: The six-tier compatibility matrix is applied to the case studies, but the manuscript does not detail the exact derivation rules from CC license clauses (e.g., how SA/NC/ND interact across the tiers or handling of composite licenses), preventing independent assessment of whether the matrix accurately reflects CC rules.
Authors: We accept this point. The revised Methods section will include an expanded subsection that explicitly derives each tier from the relevant CC license clauses. This will document the interaction rules for ShareAlike, NonCommercial, and NoDerivatives conditions, as well as the treatment of composite or multi-license corpora, enabling independent verification of the matrix. revision: yes
Circularity Check
Empirical audit with no derivation chain or fitted inputs
full rationale
The paper performs a primary-source audit of license terms on existing corpora, documents four failure modes with direct evidence from dataset cards and URLs, and applies a compatibility matrix to case studies. No equations, parameters, or predictions are present that could reduce outputs to inputs by construction. The matrix and checklist are presented as tools derived from standard CC rules rather than self-referential fits. Self-citations are absent from the provided text. This matches the default expectation of a non-circular empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Creative Commons license terms as stated on corpus sources and dataset cards are the authoritative and enforceable versions.
Reference graph
Works this paper leans on
-
[1]
For high- resourcelanguages, thisisordinarilynotaproblem
Introduction NLP researchers are not lawyers. For high- resourcelanguages, thisisordinarilynotaproblem. Corpora in common use have been legally vetted over decades of practice. For low-resource African languages, neither condition holds. Since 2019, parallel corpora, NER datasets, sen- timent benchmarks, and speech resources have been published for dozens...
2019
-
[2]
Related Work Legal scholarship on open licensing.Creative Commons licensing has attracted sustained schol- arly critique. Katz (2006) identifies two structural problems: variant proliferation creates user con- arXiv:2606.28867v1 [cs.CL] 27 Jun 2026 fusion, and ShareAlike terms create compatibility deadlocks that prevent legal distribution of deriva- tives...
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[3]
License Taxonomy I define six license tiers for African NLP text corpora, ordered from least to most restric- tive. For non-specialist readers:NC(Non- Commercial) means the resource may not be used for revenue-generating purposes as defined by the license;whatconstitutescommercialuseiscontext- dependentandjurisdiction-sensitive,butpublishing an annotated ...
-
[4]
“×” denotes an incompatible combination: no single license can satisfy both sources’ requirements simultaneously
License Compatibility Matrix Table 2 shows the legally valid output license when two corpus sources are combined. “×” denotes an incompatible combination: no single license can satisfy both sources’ requirements simultaneously. A note on provenance quality independent of license tier: the compatibility matrix captures out- put license requirements, not th...
2020
-
[5]
Note that MasakhaNER’s HuggingFace dataset card lists CC BY-NC 4.0 for the dataset release; the source-text licensing is heterogeneous
uses Wikipedia text in its annotation pipeline. Note that MasakhaNER’s HuggingFace dataset card lists CC BY-NC 4.0 for the dataset release; the source-text licensing is heterogeneous. Practi- tioners should verify the specific version they use. The important point is that license decisions must be madebeforeannotation begins. Choosing CC- BY-SA forecloses...
-
[6]
Various open data
African NLP Corpus Survey Table 4 (Appendix A) surveys the corpus families used in African NLP with their tier assignments; an asterisk (*) marks web-mined corpora where the dataset license covers packaging or database rights, not rights in the underlying text. 5.1. Common Corpus: African Language Representation We streamed the full Common Corpus training...
2025
-
[7]
Common Corpus is not an indepen- dent African-language source: it repackages the same Wikipedia dumps audited in Table 4, with worse provenance metadata
are language-identification errors on French archival documents; none contains usable African- language text. Common Corpus is not an indepen- dent African-language source: it repackages the same Wikipedia dumps audited in Table 4, with worse provenance metadata. Researchers should not count both as separate entries. The ratio is ap- proximately 200:1 (En...
-
[8]
This dataset does not include JW300-derived text or derivatives thereof
Four Failure Modes 6.1. Prohibition: JW300 JW300(AgićandVulić,2019)wasaparallelcorpus covering 300+ languages, built from the Jehovah’s Witnesses websitejw.org. It was widely used in African NLP from 2019 onward due to coverage of languages with no other parallel text. The legal problem is straightforward. The jw.orgTermsofServiceexplicitlyprohibittextand...
2019
-
[9]
The legal constraints are severe, several doc- umented corpora contain license problems, and authentic open-license text for under-resourced African languages is limited
Enrichment Opportunities Within the Open-License Landscape The foregoing analysis could be read pessimisti- cally. The legal constraints are severe, several doc- umented corpora contain license problems, and authentic open-license text for under-resourced African languages is limited. The opposite reading is more productive: identifying the legal constrai...
2012
-
[10]
Avoid or flag: CCMatrix (no stated license), TED2020 (T4b), JW300 (T5), bible-uedin (CC0 claimed; ver- ify per-translation rights)
A Legal Due Diligence Checklist Four steps before annotation begins: Step 1: Inventory sources.Consult: Wikipedia, Leipzig, UDHR, Tatoeba, FLORES- 200, FLEURS, WAXAL (per-provider), WURA, eBible.org (per-translation), African Storybook (per- story), TICO-19, Common Voice (per-subset), MT560/HuggingFace (with provenance caveat), OPUS (excluding JW300-deriv...
-
[11]
The compat- ibility matrix (Table 2) requires no legal expertise: it is a lookup table
Discussion None of the errors documented here was wilful; each was a legal assumption that NLP practice gave no reason to question explicitly. The compat- ibility matrix (Table 2) requires no legal expertise: it is a lookup table. Tier assignments require a one-timeprovenancecheckpercorpus. Thecheck- list requires discipline. A common concern is that Wiki...
-
[12]
JW300 was used because it existed and seemed open
Conclusion The four case studies share a pattern: a legal as- sumption was made implicitly that would not have survived explicit examination. JW300 was used because it existed and seemed open. Tanzil was treated as CC-BY because that is what the label said. WAXAL’s per-provider terms were not traced because the arXiv paper did not prompt it. The CRC’s You...
2026
-
[13]
Bibliographical References David Ifeoluwa Adelani, Jade Abbott, et al. 2021. MasakhaNER: Named entity recognition for African languages. InTransactions of the Asso- ciation for Computational Linguistics, volume 9, pages 1116–1131. MIT Press. David Ifeoluwa Adelani, Graham Carr, et al. 2022. MasakhaNER 2.0: Africa-centric transfer learn- ing for named enti...
2021
-
[14]
arXiv preprint arXiv:2602.02734 , year =
Association for Computational Linguistics. Željko Agić and Ivan Vulić. 2019. JW300: A wide- coverage parallel corpus for low-resource lan- guages. InProceedingsofthe57thAnnualMeet- ing of the Association for Computational Linguis- tics, pages 3204–3210. Association for Compu- tational Linguistics. Marta Bañón, Pinzhen Chen, Barry Haddow, Ken- neth Heafiel...
-
[15]
InProceedings of the First Workshop on Systematic Biases in MT Research
NTREX-128 – news test references for MT evaluation of 128 languages. InProceedings of the First Workshop on Systematic Biases in MT Research. Timnit Gebru et al. 2021. Datasheets for datasets. Communications of the ACM, 64(12):86–92. GoingDutch.ai. 2024. GEITje takedown.https: //goingdutch.ai/nl/posts/geitje-t akedown/. Accessed February 2026. Dirk Goldha...
2021
-
[16]
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
Data governance in the age of large-scale data-driven language technology. InProceed- ings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 2206– 2222. Zachary Katz. 2006. Pitfalls of open licensing: An analysis of creative commons licensing.IDEA: The Intellectual Property Law Review, 46(3). Julia Kreutzer et al. 2022. Quali...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
2020.Congolese Radio Corpus (CRC) for Lingala.Data persis- tence failure.Originally claimed hundreds of hours of YouTube broadcast audio
Language Resource References Wheatley, Julian and others. 2020.Congolese Radio Corpus (CRC) for Lingala.Data persis- tence failure.Originally claimed hundreds of hours of YouTube broadcast audio. Audit con- ductedFebruary2026found402of405YouTube IDs dead (404 errors). Reproducible content: approximately 8.3 hours elicited LRSC speech (IPA-transcribed) + a...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.