Open but Incompatible: A License Compatibility Analysis of Corpora for Low-Resource African Languages

Ernst van Gassen

arxiv: 2606.28867 · v1 · pith:A7O7LSONnew · submitted 2026-06-27 · 💻 cs.CL

Open but Incompatible: A License Compatibility Analysis of Corpora for Low-Resource African Languages

Ernst van Gassen This is my paper

Pith reviewed 2026-06-30 09:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords license compatibilityAfrican NLPcorpus licensingCreative Commonslow-resource languagesdata provenancedataset auditcompatibility matrix

0 comments

The pith

Many African NLP corpora cannot be legally combined or modified due to incompatible or misrepresented licenses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Creative Commons licenses dominate releases of corpora for African languages in NLP, but rules on sharing, derivatives, and commercial use are rarely checked or followed. The paper audits more than twenty corpus families, builds a six-tier compatibility matrix, and applies it to three case-study languages. It documents four concrete failure modes with primary-source evidence, ranging from outright prohibition to hidden restrictive clauses and lost data links. A sympathetic reader would care because these issues can make published datasets unusable for tokenisation, annotation, or combination, risking legal violations in model training. The paper closes with a due-diligence checklist to avoid such problems.

Core claim

The paper claims that license compatibility problems are widespread in corpora used for low-resource African languages. It shows this through a six-tier matrix applied to Kituba/Munukutuba, Zarma, and Moore, and through four documented failure modes: outright prohibition (JW300 removed after legal audit), composite license misrepresentation (WAXAL), a NoDerivs clause hidden behind a CC-BY label (Tanzil), and data persistence failure (Congolese Radio Corpus with most source URLs dead).

What carries the argument

A six-tier compatibility matrix that classifies how different licenses interact when corpora are combined or modified for NLP tasks.

If this is right

CC-BY-SA and CC-BY-NC licenses cannot be combined into one published dataset.
A NoDerivs clause prohibits tokenisation and annotation even when the label appears to allow it.
Some corpora have been removed from public repositories after license violations were confirmed.
Source URLs for corpora frequently become unavailable, resulting in permanent data loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar license audits applied to corpora for other low-resource languages could reveal comparable hidden incompatibilities.
Following the pre-annotation checklist could reduce legal exposure when researchers create or enrich datasets.
Identifying and combining only the legally clean corpora might enable larger, usable resources for African NLP work.

Load-bearing premise

The legal interpretations of the license terms stated on the original corpus sources and dataset cards are accurate and binding, and the audited corpora are representative of broader use in African NLP.

What would settle it

A detailed legal review that concludes the four documented failure modes do not actually block legal combination, tokenisation, or annotation of the listed corpora.

read the original abstract

Creative Commons licenses dominate African NLP corpus releases, but their compatibility rules are rarely applied. CC-BY-SA and CC-BY-NC cannot be combined in a single published dataset; a NoDerivs clause silently prohibits tokenisation and annotation. This paper audits the license provenance of over twenty corpus families used in African NLP, constructs a six-tier compatibility matrix, and applies it to three case-study languages: Kituba/Munukutuba, Zarma, and Moore. Four failure modes are documented with primary-source evidence: outright prohibition (JW300, removed from OPUS after a legal audit confirmed Terms of Service violation); composite license misrepresentation (WAXAL, whose CC-BY 4.0 claim is contradicted by its own HuggingFace dataset card); a NoDerivs clause hidden behind a CC-BY label (Tanzil); and data persistence failure (the Congolese Radio Corpus, where 402 of 405 source URLs are now dead). A pre-annotation due diligence checklist and a survey of legally clean enrichment opportunities close the paper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This audit flags real license conflicts in African NLP datasets with primary-source examples, but the legal readings lack outside confirmation.

read the letter

The core point is that this paper runs a targeted audit of license compatibility across corpora used in African NLP and turns up four documented failure modes that actually block reuse. The six-tier matrix and the case studies on Kituba/Munukutuba, Zarma, and Moore give a practical way to check combinations before merging data.

What the work does well is stick to named examples with direct references to original corpus pages, Hugging Face cards, and prior removals like JW300 from OPUS. The composite misrepresentation in WAXAL, the hidden NoDerivs in Tanzil, and the 402 dead URLs in the Congolese Radio Corpus are all tied to specific sources rather than general claims. The pre-annotation checklist at the end is a usable output that dataset builders could apply without extra theory.

The soft spot is the reliance on the authors' own parsing of CC terms and ToS without external legal review or jurisdiction checks. The stress-test concern holds here: if fair-use carve-outs, dataset-card errors, or local rules change the picture, the incompatibility conclusions weaken. The paper supplies the source links so others can verify, but that step is left to the reader.

The citation pattern draws on existing license-compatibility literature and applies it to this subfield, which is appropriate for an audit. No equations or fitted models are involved, so the circularity burden stays low.

This paper is for people working on low-resource African languages who need to combine or release data legally. Dataset curators and researchers planning enrichment pipelines will get the most direct value. It deserves peer review because the empirical cases are grounded and the topic affects equitable data access, even if the legal analysis would benefit from added verification in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript audits license provenance and compatibility for over twenty corpus families used in African NLP, constructs a six-tier compatibility matrix based on Creative Commons rules, applies it via case studies to Kituba/Munukutuba, Zarma, and Moore, and documents four failure modes with primary-source evidence: outright prohibition (JW300), composite license misrepresentation (WAXAL), hidden NoDerivs (Tanzil), and data persistence failure (Congolese Radio Corpus). It closes with a pre-annotation due diligence checklist and survey of clean enrichment opportunities.

Significance. If the primary-source readings hold, the work is significant for exposing under-appreciated legal barriers to data combination and reuse in low-resource African language NLP; the concrete examples, matrix, and checklist provide actionable value to practitioners and could reduce downstream incompatibility risks. The empirical audit approach with named sources is a clear strength.

major comments (2)

[Case studies section] Case studies section (failure modes): The incompatibility conclusions for WAXAL (CC-BY 4.0 claim contradicted by dataset card), Tanzil (NoDerivs hidden behind CC-BY), and JW300 (ToS violation) rest entirely on the authors' parsing of primary sources and cards; no reference to CC legal compatibility charts, jurisdiction notes, or external verification is provided, which is load-bearing for the central claim that these constitute documented failure modes.
[Compatibility matrix] Methods / matrix construction: The six-tier compatibility matrix is applied to the case studies, but the manuscript does not detail the exact derivation rules from CC license clauses (e.g., how SA/NC/ND interact across the tiers or handling of composite licenses), preventing independent assessment of whether the matrix accurately reflects CC rules.

minor comments (2)

[Abstract] The abstract states 'over twenty corpus families' but provides no table or appendix listing them with their licenses; adding this would improve traceability of the audit scope.
[Checklist section] The pre-annotation checklist is mentioned as closing the paper but its items are not enumerated in the provided abstract; ensuring it is explicitly listed would aid usability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and commit to revisions that strengthen the evidentiary basis and reproducibility of the analysis.

read point-by-point responses

Referee: [Case studies section] Case studies section (failure modes): The incompatibility conclusions for WAXAL (CC-BY 4.0 claim contradicted by dataset card), Tanzil (NoDerivs hidden behind CC-BY), and JW300 (ToS violation) rest entirely on the authors' parsing of primary sources and cards; no reference to CC legal compatibility charts, jurisdiction notes, or external verification is provided, which is load-bearing for the central claim that these constitute documented failure modes.

Authors: We agree that explicit cross-references to CC compatibility resources would strengthen the presentation. While the claims derive directly from the cited primary license texts and dataset cards (which remain the authoritative sources), the revised manuscript will add citations to the official Creative Commons compatibility charts, license compatibility tables, and any available jurisdiction notes. This addition will provide the requested external verification layer without altering the core findings. revision: yes
Referee: [Compatibility matrix] Methods / matrix construction: The six-tier compatibility matrix is applied to the case studies, but the manuscript does not detail the exact derivation rules from CC license clauses (e.g., how SA/NC/ND interact across the tiers or handling of composite licenses), preventing independent assessment of whether the matrix accurately reflects CC rules.

Authors: We accept this point. The revised Methods section will include an expanded subsection that explicitly derives each tier from the relevant CC license clauses. This will document the interaction rules for ShareAlike, NonCommercial, and NoDerivatives conditions, as well as the treatment of composite or multi-license corpora, enabling independent verification of the matrix. revision: yes

Circularity Check

0 steps flagged

Empirical audit with no derivation chain or fitted inputs

full rationale

The paper performs a primary-source audit of license terms on existing corpora, documents four failure modes with direct evidence from dataset cards and URLs, and applies a compatibility matrix to case studies. No equations, parameters, or predictions are present that could reduce outputs to inputs by construction. The matrix and checklist are presented as tools derived from standard CC rules rather than self-referential fits. Self-citations are absent from the provided text. This matches the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters, invented entities, or ad-hoc axioms are introduced; the work rests on standard legal interpretations of Creative Commons license compatibility rules as domain assumptions.

axioms (1)

domain assumption Creative Commons license terms as stated on corpus sources and dataset cards are the authoritative and enforceable versions.
Invoked when classifying licenses and identifying contradictions such as the WAXAL and Tanzil cases.

pith-pipeline@v0.9.1-grok · 5707 in / 1247 out tokens · 29728 ms · 2026-06-30T09:47:30.985299+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 3 canonical work pages · 2 internal anchors

[1]

For high- resourcelanguages, thisisordinarilynotaproblem

Introduction NLP researchers are not lawyers. For high- resourcelanguages, thisisordinarilynotaproblem. Corpora in common use have been legally vetted over decades of practice. For low-resource African languages, neither condition holds. Since 2019, parallel corpora, NER datasets, sen- timent benchmarks, and speech resources have been published for dozens...

2019
[2]

Related Work Legal scholarship on open licensing.Creative Commons licensing has attracted sustained schol- arly critique. Katz (2006) identifies two structural problems: variant proliferation creates user con- arXiv:2606.28867v1 [cs.CL] 27 Jun 2026 fusion, and ShareAlike terms create compatibility deadlocks that prevent legal distribution of deriva- tives...

work page internal anchor Pith review Pith/arXiv arXiv 2006
[3]

License Taxonomy I define six license tiers for African NLP text corpora, ordered from least to most restric- tive. For non-specialist readers:NC(Non- Commercial) means the resource may not be used for revenue-generating purposes as defined by the license;whatconstitutescommercialuseiscontext- dependentandjurisdiction-sensitive,butpublishing an annotated ...
[4]

“×” denotes an incompatible combination: no single license can satisfy both sources’ requirements simultaneously

License Compatibility Matrix Table 2 shows the legally valid output license when two corpus sources are combined. “×” denotes an incompatible combination: no single license can satisfy both sources’ requirements simultaneously. A note on provenance quality independent of license tier: the compatibility matrix captures out- put license requirements, not th...

2020
[5]

Note that MasakhaNER’s HuggingFace dataset card lists CC BY-NC 4.0 for the dataset release; the source-text licensing is heterogeneous

uses Wikipedia text in its annotation pipeline. Note that MasakhaNER’s HuggingFace dataset card lists CC BY-NC 4.0 for the dataset release; the source-text licensing is heterogeneous. Practi- tioners should verify the specific version they use. The important point is that license decisions must be madebeforeannotation begins. Choosing CC- BY-SA forecloses...
[6]

Various open data

African NLP Corpus Survey Table 4 (Appendix A) surveys the corpus families used in African NLP with their tier assignments; an asterisk (*) marks web-mined corpora where the dataset license covers packaging or database rights, not rights in the underlying text. 5.1. Common Corpus: African Language Representation We streamed the full Common Corpus training...

2025
[7]

Common Corpus is not an indepen- dent African-language source: it repackages the same Wikipedia dumps audited in Table 4, with worse provenance metadata

are language-identification errors on French archival documents; none contains usable African- language text. Common Corpus is not an indepen- dent African-language source: it repackages the same Wikipedia dumps audited in Table 4, with worse provenance metadata. Researchers should not count both as separate entries. The ratio is ap- proximately 200:1 (En...
[8]

This dataset does not include JW300-derived text or derivatives thereof

Four Failure Modes 6.1. Prohibition: JW300 JW300(AgićandVulić,2019)wasaparallelcorpus covering 300+ languages, built from the Jehovah’s Witnesses websitejw.org. It was widely used in African NLP from 2019 onward due to coverage of languages with no other parallel text. The legal problem is straightforward. The jw.orgTermsofServiceexplicitlyprohibittextand...

2019
[9]

The legal constraints are severe, several doc- umented corpora contain license problems, and authentic open-license text for under-resourced African languages is limited

Enrichment Opportunities Within the Open-License Landscape The foregoing analysis could be read pessimisti- cally. The legal constraints are severe, several doc- umented corpora contain license problems, and authentic open-license text for under-resourced African languages is limited. The opposite reading is more productive: identifying the legal constrai...

2012
[10]

Avoid or flag: CCMatrix (no stated license), TED2020 (T4b), JW300 (T5), bible-uedin (CC0 claimed; ver- ify per-translation rights)

A Legal Due Diligence Checklist Four steps before annotation begins: Step 1: Inventory sources.Consult: Wikipedia, Leipzig, UDHR, Tatoeba, FLORES- 200, FLEURS, WAXAL (per-provider), WURA, eBible.org (per-translation), African Storybook (per- story), TICO-19, Common Voice (per-subset), MT560/HuggingFace (with provenance caveat), OPUS (excluding JW300-deriv...
[11]

The compat- ibility matrix (Table 2) requires no legal expertise: it is a lookup table

Discussion None of the errors documented here was wilful; each was a legal assumption that NLP practice gave no reason to question explicitly. The compat- ibility matrix (Table 2) requires no legal expertise: it is a lookup table. Tier assignments require a one-timeprovenancecheckpercorpus. Thecheck- list requires discipline. A common concern is that Wiki...
[12]

JW300 was used because it existed and seemed open

Conclusion The four case studies share a pattern: a legal as- sumption was made implicitly that would not have survived explicit examination. JW300 was used because it existed and seemed open. Tanzil was treated as CC-BY because that is what the label said. WAXAL’s per-provider terms were not traced because the arXiv paper did not prompt it. The CRC’s You...

2026
[13]

Bibliographical References David Ifeoluwa Adelani, Jade Abbott, et al. 2021. MasakhaNER: Named entity recognition for African languages. InTransactions of the Asso- ciation for Computational Linguistics, volume 9, pages 1116–1131. MIT Press. David Ifeoluwa Adelani, Graham Carr, et al. 2022. MasakhaNER 2.0: Africa-centric transfer learn- ing for named enti...

2021
[14]

arXiv preprint arXiv:2602.02734 , year =

Association for Computational Linguistics. Željko Agić and Ivan Vulić. 2019. JW300: A wide- coverage parallel corpus for low-resource lan- guages. InProceedingsofthe57thAnnualMeet- ing of the Association for Computational Linguis- tics, pages 3204–3210. Association for Compu- tational Linguistics. Marta Bañón, Pinzhen Chen, Barry Haddow, Ken- neth Heafiel...

work page arXiv 2019
[15]

InProceedings of the First Workshop on Systematic Biases in MT Research

NTREX-128 – news test references for MT evaluation of 128 languages. InProceedings of the First Workshop on Systematic Biases in MT Research. Timnit Gebru et al. 2021. Datasheets for datasets. Communications of the ACM, 64(12):86–92. GoingDutch.ai. 2024. GEITje takedown.https: //goingdutch.ai/nl/posts/geitje-t akedown/. Accessed February 2026. Dirk Goldha...

2021
[16]

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Data governance in the age of large-scale data-driven language technology. InProceed- ings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 2206– 2222. Zachary Katz. 2006. Pitfalls of open licensing: An analysis of creative commons licensing.IDEA: The Intellectual Property Law Review, 46(3). Julia Kreutzer et al. 2022. Quali...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

2020.Congolese Radio Corpus (CRC) for Lingala.Data persis- tence failure.Originally claimed hundreds of hours of YouTube broadcast audio

Language Resource References Wheatley, Julian and others. 2020.Congolese Radio Corpus (CRC) for Lingala.Data persis- tence failure.Originally claimed hundreds of hours of YouTube broadcast audio. Audit con- ductedFebruary2026found402of405YouTube IDs dead (404 errors). Reproducible content: approximately 8.3 hours elicited LRSC speech (IPA-transcribed) + a...

2020

[1] [1]

For high- resourcelanguages, thisisordinarilynotaproblem

Introduction NLP researchers are not lawyers. For high- resourcelanguages, thisisordinarilynotaproblem. Corpora in common use have been legally vetted over decades of practice. For low-resource African languages, neither condition holds. Since 2019, parallel corpora, NER datasets, sen- timent benchmarks, and speech resources have been published for dozens...

2019

[2] [2]

Related Work Legal scholarship on open licensing.Creative Commons licensing has attracted sustained schol- arly critique. Katz (2006) identifies two structural problems: variant proliferation creates user con- arXiv:2606.28867v1 [cs.CL] 27 Jun 2026 fusion, and ShareAlike terms create compatibility deadlocks that prevent legal distribution of deriva- tives...

work page internal anchor Pith review Pith/arXiv arXiv 2006

[3] [3]

License Taxonomy I define six license tiers for African NLP text corpora, ordered from least to most restric- tive. For non-specialist readers:NC(Non- Commercial) means the resource may not be used for revenue-generating purposes as defined by the license;whatconstitutescommercialuseiscontext- dependentandjurisdiction-sensitive,butpublishing an annotated ...

[4] [4]

“×” denotes an incompatible combination: no single license can satisfy both sources’ requirements simultaneously

License Compatibility Matrix Table 2 shows the legally valid output license when two corpus sources are combined. “×” denotes an incompatible combination: no single license can satisfy both sources’ requirements simultaneously. A note on provenance quality independent of license tier: the compatibility matrix captures out- put license requirements, not th...

2020

[5] [5]

Note that MasakhaNER’s HuggingFace dataset card lists CC BY-NC 4.0 for the dataset release; the source-text licensing is heterogeneous

uses Wikipedia text in its annotation pipeline. Note that MasakhaNER’s HuggingFace dataset card lists CC BY-NC 4.0 for the dataset release; the source-text licensing is heterogeneous. Practi- tioners should verify the specific version they use. The important point is that license decisions must be madebeforeannotation begins. Choosing CC- BY-SA forecloses...

[6] [6]

Various open data

African NLP Corpus Survey Table 4 (Appendix A) surveys the corpus families used in African NLP with their tier assignments; an asterisk (*) marks web-mined corpora where the dataset license covers packaging or database rights, not rights in the underlying text. 5.1. Common Corpus: African Language Representation We streamed the full Common Corpus training...

2025

[7] [7]

Common Corpus is not an indepen- dent African-language source: it repackages the same Wikipedia dumps audited in Table 4, with worse provenance metadata

are language-identification errors on French archival documents; none contains usable African- language text. Common Corpus is not an indepen- dent African-language source: it repackages the same Wikipedia dumps audited in Table 4, with worse provenance metadata. Researchers should not count both as separate entries. The ratio is ap- proximately 200:1 (En...

[8] [8]

This dataset does not include JW300-derived text or derivatives thereof

Four Failure Modes 6.1. Prohibition: JW300 JW300(AgićandVulić,2019)wasaparallelcorpus covering 300+ languages, built from the Jehovah’s Witnesses websitejw.org. It was widely used in African NLP from 2019 onward due to coverage of languages with no other parallel text. The legal problem is straightforward. The jw.orgTermsofServiceexplicitlyprohibittextand...

2019

[9] [9]

The legal constraints are severe, several doc- umented corpora contain license problems, and authentic open-license text for under-resourced African languages is limited

Enrichment Opportunities Within the Open-License Landscape The foregoing analysis could be read pessimisti- cally. The legal constraints are severe, several doc- umented corpora contain license problems, and authentic open-license text for under-resourced African languages is limited. The opposite reading is more productive: identifying the legal constrai...

2012

[10] [10]

Avoid or flag: CCMatrix (no stated license), TED2020 (T4b), JW300 (T5), bible-uedin (CC0 claimed; ver- ify per-translation rights)

A Legal Due Diligence Checklist Four steps before annotation begins: Step 1: Inventory sources.Consult: Wikipedia, Leipzig, UDHR, Tatoeba, FLORES- 200, FLEURS, WAXAL (per-provider), WURA, eBible.org (per-translation), African Storybook (per- story), TICO-19, Common Voice (per-subset), MT560/HuggingFace (with provenance caveat), OPUS (excluding JW300-deriv...

[11] [11]

The compat- ibility matrix (Table 2) requires no legal expertise: it is a lookup table

Discussion None of the errors documented here was wilful; each was a legal assumption that NLP practice gave no reason to question explicitly. The compat- ibility matrix (Table 2) requires no legal expertise: it is a lookup table. Tier assignments require a one-timeprovenancecheckpercorpus. Thecheck- list requires discipline. A common concern is that Wiki...

[12] [12]

JW300 was used because it existed and seemed open

Conclusion The four case studies share a pattern: a legal as- sumption was made implicitly that would not have survived explicit examination. JW300 was used because it existed and seemed open. Tanzil was treated as CC-BY because that is what the label said. WAXAL’s per-provider terms were not traced because the arXiv paper did not prompt it. The CRC’s You...

2026

[13] [13]

Bibliographical References David Ifeoluwa Adelani, Jade Abbott, et al. 2021. MasakhaNER: Named entity recognition for African languages. InTransactions of the Asso- ciation for Computational Linguistics, volume 9, pages 1116–1131. MIT Press. David Ifeoluwa Adelani, Graham Carr, et al. 2022. MasakhaNER 2.0: Africa-centric transfer learn- ing for named enti...

2021

[14] [14]

arXiv preprint arXiv:2602.02734 , year =

Association for Computational Linguistics. Željko Agić and Ivan Vulić. 2019. JW300: A wide- coverage parallel corpus for low-resource lan- guages. InProceedingsofthe57thAnnualMeet- ing of the Association for Computational Linguis- tics, pages 3204–3210. Association for Compu- tational Linguistics. Marta Bañón, Pinzhen Chen, Barry Haddow, Ken- neth Heafiel...

work page arXiv 2019

[15] [15]

InProceedings of the First Workshop on Systematic Biases in MT Research

NTREX-128 – news test references for MT evaluation of 128 languages. InProceedings of the First Workshop on Systematic Biases in MT Research. Timnit Gebru et al. 2021. Datasheets for datasets. Communications of the ACM, 64(12):86–92. GoingDutch.ai. 2024. GEITje takedown.https: //goingdutch.ai/nl/posts/geitje-t akedown/. Accessed February 2026. Dirk Goldha...

2021

[16] [16]

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Data governance in the age of large-scale data-driven language technology. InProceed- ings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 2206– 2222. Zachary Katz. 2006. Pitfalls of open licensing: An analysis of creative commons licensing.IDEA: The Intellectual Property Law Review, 46(3). Julia Kreutzer et al. 2022. Quali...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

2020.Congolese Radio Corpus (CRC) for Lingala.Data persis- tence failure.Originally claimed hundreds of hours of YouTube broadcast audio

Language Resource References Wheatley, Julian and others. 2020.Congolese Radio Corpus (CRC) for Lingala.Data persis- tence failure.Originally claimed hundreds of hours of YouTube broadcast audio. Audit con- ductedFebruary2026found402of405YouTube IDs dead (404 errors). Reproducible content: approximately 8.3 hours elicited LRSC speech (IPA-transcribed) + a...

2020