pith. machine review for the scientific record. sign in

arxiv: 2604.08448 · v1 · submitted 2026-04-09 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:58 UTC · model grok-4.3

classification 💻 cs.CL
keywords multilingual speech datasetKenyan languagesautomatic speech recognitiontext-to-speechlow-resource languagesspeech data collectionDholuo Kikuyu Kalenjin Maasai Somali
0
0 comments X

The pith

AfriVoices-KE supplies about 3,000 hours of audio across five Kenyan languages to support speech technology.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AfriVoices-KE as a large multilingual speech dataset covering Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. Data comes from 4,777 native speakers and splits into 750 hours of scripted speech drawn from text sources plus 2,250 hours of spontaneous speech gathered via prompts. Collection used a mobile app and multiple quality checks to handle low-resource conditions such as unreliable infrastructure and trust barriers. The work aims to reduce the underrepresentation of these languages in automatic speech recognition and text-to-speech systems while supporting digital preservation of Kenyan linguistic heritage.

Core claim

AfriVoices-KE is a multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages, with 750 hours scripted and 2,250 hours spontaneous, collected from 4,777 speakers through dual methodologies and multi-layer quality assurance to serve as a foundational resource for inclusive speech technologies.

What carries the argument

Dual collection methodology that pairs scripted recordings from compiled text corpora with spontaneous speech elicited by textual and image prompts, supported by a mobile app and automated plus human quality validation.

If this is right

  • Automatic speech recognition systems can be developed for Dholuo, Kikuyu, Kalenjin, Maasai, and Somali.
  • Text-to-speech tools become feasible for everyday Kenyan communication needs.
  • The resource supports study and preservation of dialectal differences in natural speech.
  • Future work can extend the same dual-method approach to other underrepresented languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The spontaneous portion may help models handle informal or accented speech better than scripted-only data.
  • Local partnerships used for collection could serve as a template for community-driven datasets elsewhere.
  • The eleven domain areas in the scripted texts may allow targeted applications in health, agriculture, or education.

Load-bearing premise

The recordings and annotations capture real linguistic variation and dialectal nuance at high enough quality to train useful speech systems despite collection challenges.

What would settle it

Training an automatic speech recognizer on this dataset and finding its word error rate no better than that of models trained on prior small Kenyan-language corpora.

Figures

Figures reproduced from arXiv: 2604.08448 by Alfred Omondi Otom, Andrew Kipkebut, Angela Wambui Kanyi, Brian Gichana Omwenga, Ciira wa Maina, Cynthia Amol, Edward Ombui, Edwin Onkoba, Hope Kerubo, Ian Ndung'u Kang'ethe, Joseph Muguro, Leila Misula, Lilian Wanzare, Nelson Odhiambo, Rennish Mboya, Vivian Oloo, zekiel Maina.

Figure 1
Figure 1. Figure 1: Geographical distribution of the five Kenyan languages covered in AfriVoices-KE. Dholuo (ISO 639-3: luo) is a River-Lake Nilotic language spoken by over 4.2 million people in west￾ern Kenya, with additional speakers in northern Tanzania (Linguistic Data Consortium, 2020). The language exhibits two principal dialects, Milambo (southern) and Nyandwat (northern), reflecting geo￾graphical variation across coun… view at source ↗
Figure 2
Figure 2. Figure 2: The Custom Voice Collection App: scripted speech recording workflow, illustrating sen￾tence selection, noise check, recording and play￾back, and submission confirmation. Challenges. Sourcing culturally relevant domain￾specific texts from fragmented repositories was difficult, especially for technical domains such as Healthcare and Digital Government Services. Translation consistency was also variable acros… view at source ↗
Figure 3
Figure 3. Figure 3: The Custom Voice Collection App: un￾scripted speech recording workflow, illustrating prompt selection, noise check, recording and play￾back, and submission confirmation. Transcription. Unscripted audio was transcribed verbatim by trained native speakers using the Cus￾tom Transcription App ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Transcription workflow in the Custom App: [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of language contributors across counties in Kenya. 5.2. Error Analysis This section examines rejection rates across languages, comparing scripted and unscripted datasets to identify patterns in data quality and collection conditions. It also presents an analysis of the primary factors contributing to data rejection in the collection process. Rejection rates exhib￾ited variation across language… view at source ↗
read the original abstract

AfriVoices-KE is a large-scale multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across diverse regions and demographics. This work addresses the critical underrepresentation of African languages in speech technology by providing a high-quality, linguistically diverse resource. Data collection followed a dual methodology: scripted recordings drew from compiled text corpora, translations, and domain-specific generated sentences spanning eleven domains relevant to the Kenyan context, while unscripted speech was elicited through textual and image prompts to capture natural linguistic variation and dialectal nuances. A customized mobile application enabled contributors to record using smartphones. Quality assurance operated at multiple layers, encompassing automated signal-to-noise ratio validation prior to recording and human review for content accuracy. Though the project encountered challenges common to low-resource settings, including unreliable infrastructure, device compatibility issues, and community trust barriers, these were mitigated through local mobilizers, stakeholder partnerships, and adaptive training protocols. AfriVoices-KE provides a foundational resource for developing inclusive automatic speech recognition and text-to-speech systems, while advancing the digital preservation of Kenya's linguistic heritage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces AfriVoices-KE, a multilingual speech dataset of approximately 3,000 hours of audio across five Kenyan languages (Dholuo, Kikuyu, Kalenjin, Maasai, Somali) collected from 4,777 native speakers. It describes a dual methodology of scripted recordings from compiled corpora/translations/domain-specific sentences and spontaneous speech elicited via textual/image prompts, implemented through a mobile app, with multi-layer quality assurance via automated SNR validation and human content review. The work addresses underrepresentation of these languages in speech technology and discusses mitigation of low-resource challenges such as infrastructure and trust barriers.

Significance. If the quality and representativeness claims hold, the dataset would be a valuable addition to low-resource speech resources, given its scale, inclusion of spontaneous speech for natural variation, and focus on Kenyan languages. Open release of such data supports development of inclusive ASR and TTS systems and aids digital preservation efforts. The direct data-collection focus is a strength for reproducibility in the field.

major comments (3)
  1. Abstract: The claim that the dual collection methodology and multi-layer QA produced high-quality, representative data capturing linguistic variation is not supported by any quantitative outcomes such as SNR distributions, rejection rates, inter-reviewer agreement, or post-QA verified hour counts per language.
  2. Data Collection section (or equivalent): No per-language breakdown of the 750 scripted vs. 2,250 spontaneous hours is provided, nor details on how prompt-induced artifacts were avoided in spontaneous recordings, which is load-bearing for the claim of capturing dialectal nuances.
  3. Quality Assurance and Challenges sections: Mitigation strategies for infrastructure/trust issues are described but without evidence of effectiveness (e.g., participation rates, before/after metrics, or demographic tables showing age/gender/region coverage per language), leaving the representativeness assertion unverified.
minor comments (2)
  1. Add explicit references to comparable African speech datasets (e.g., in Related Work) to better situate the contribution.
  2. Clarify the exact number of domains covered in scripted text and any domain-specific statistics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review. The comments identify key areas where additional quantitative evidence would strengthen the manuscript's claims about data quality and representativeness. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: Abstract: The claim that the dual collection methodology and multi-layer QA produced high-quality, representative data capturing linguistic variation is not supported by any quantitative outcomes such as SNR distributions, rejection rates, inter-reviewer agreement, or post-QA verified hour counts per language.

    Authors: We agree that the abstract asserts high quality without sufficient supporting metrics in the current version. In the revised manuscript, we will tone down the abstract language slightly and add a new subsection in Quality Assurance that reports SNR distributions (mean and range per language), automated and human rejection rates, inter-reviewer agreement (where multiple reviewers were used), and post-QA verified hour counts per language. These figures are available from our internal logs and will be included to substantiate the claims. revision: yes

  2. Referee: Data Collection section (or equivalent): No per-language breakdown of the 750 scripted vs. 2,250 spontaneous hours is provided, nor details on how prompt-induced artifacts were avoided in spontaneous recordings, which is load-bearing for the claim of capturing dialectal nuances.

    Authors: The manuscript currently reports only aggregate hours. We will add a table in the Data Collection section providing the scripted and spontaneous hour counts for each of the five languages. We will also expand the spontaneous speech subsection to describe the prompt design (culturally appropriate open-ended textual and image prompts) and the verification steps taken to reduce artifacts, including manual review for naturalness and dialectal fidelity. These additions will directly address the concern about capturing dialectal nuances. revision: yes

  3. Referee: Quality Assurance and Challenges sections: Mitigation strategies for infrastructure/trust issues are described but without evidence of effectiveness (e.g., participation rates, before/after metrics, or demographic tables showing age/gender/region coverage per language), leaving the representativeness assertion unverified.

    Authors: We acknowledge that the description of mitigation strategies lacks supporting evidence. In revision we will insert a demographic table (age, gender, region) broken down by language and report participation rates achieved via local mobilizers and partnerships. Before/after quantitative metrics for trust-building are not available from our field process; we will instead provide qualitative evidence from project reports on how these strategies enabled collection. The table and rates will be added to the Challenges section. revision: partial

Circularity Check

0 steps flagged

No circularity: purely descriptive dataset paper with no derivations or predictions

full rationale

The paper is a direct description of data collection methodology, speaker recruitment, dual scripted/spontaneous protocols, mobile app usage, and multi-layer QA for a new speech corpus. It contains no equations, no fitted parameters, no predictions of downstream performance, and no load-bearing self-citations or uniqueness theorems. All claims about scale (~3000 hours), quality, and diversity are presented as direct outcomes of the described process rather than derived results that reduce to the inputs by construction. This matches the default non-circular case for resource papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions about data collection validity in low-resource environments and the effectiveness of described mitigations; no free parameters or new entities are introduced.

axioms (2)
  • domain assumption Mobile-app recordings from prompted native speakers produce representative samples of natural speech and dialectal variation
    Invoked in the description of spontaneous speech elicitation and demographic coverage.
  • domain assumption Automated SNR validation plus human review sufficiently ensures content accuracy and signal quality
    Stated as the multi-layer quality assurance process.

pith-pipeline@v0.9.0 · 5616 in / 1317 out tokens · 59195 ms · 2026-05-10T17:58:48.340524+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Introduction The rapid advancement of speech technologies, such as Automatic Speech Recognition (ASR) and Text-To-Speech (TTS) systems, has revolutionized human-computer interaction, enabling applications in healthcare, education, agriculture, and finan- cial services. However, these technologies remain heavily skewed towards high-resource languages like ...

  2. [2]

    Compile a comprehensive text corpus for scripted recordings by collating, translating, and generating sentences in the target lan- guages

  3. [3]

    Develop diverse prompts (text, visual, audio) toelicitspontaneousspeech,capturingnatural language use and dialectal variations

  4. [4]

    Conduct large-scale data collection using a customized mobile app

  5. [5]

    Transcribe all recordings with high accuracy, annotating code-switched terms and preserv- ing dialectal nuances to enhance dataset util- ity. By leveraging a crowd-sourced mobile app, rig- orous speech quality assessment, and native- speaker transcription with code-switching anno- tations, AfriVoices-KE surpasses existing Kenyan datasets in volume, divers...

  6. [6]

    Related Work With over 2,000 languages spoken across the con- tinent (Eberhard et al., 2025), African linguistic di- versity is vast, yet the lack of robust datasets limits the development of inclusive language technology tools. Data used for training most language mod- els are often sourced from web-crawled sources (Penedo et al., 2023) such as CommonCra...

  7. [7]

    (2024), careful and systematic planning, including ethical considera- tions, are essential in the design of any data collec- tion initiative

    Preliminary guiding decisions As highlighted by Okorie et al. (2024), careful and systematic planning, including ethical considera- tions, are essential in the design of any data collec- tion initiative. Four critical decisions were identified as foundational to the process: theselection of languagesto be included in the data collection, the appropriatemo...

  8. [8]

    In this photo

    Dataset Curation This section covers the process of curating the dataset from development of the tool, scripted and unscripted data collection and quality control, high- lighting on the challenges and opportunities. 1https://www.karya.in/ 2https://digitalumuganda.com/ 3https://commonvoice.mozilla.org/ 4.1. Data Collection Tool The Custom Voice Collection ...

  9. [9]

    Scripted recordings ac- count for 669 hours (22.3%) and unscripted record- ings for 2,336 hours (77.7%), yielding an average unscripted-to-scripted ratio of approximately 3.5:1

    Dataset Description The AfriVoices-KE dataset comprises about 3,000 hours of audio across five languages, collected from 4,677 contributors. Scripted recordings ac- count for 669 hours (22.3%) and unscripted record- ings for 2,336 hours (77.7%), yielding an average unscripted-to-scripted ratio of approximately 3.5:1. AsshowninTable1,Kikuyucontributedthehi...

  10. [10]

    Ethical Considerations Ethical practices, including informed consent, par- ticipant anonymity, and cultural sensitivity, guided all activities throughout the project. 6.1. Ethical Approval and Consent The AfriVoices-KE project received formal approval from the host institution Review Board and the Na- tional level research permit, ensuring compliance. All...

  11. [11]

    The dataset is continuously updated, and users are advised to cite the latest version in pub- 4AnexchangerateofapproximatelyKsh130perUSD is used throughout

    Dataset Release and Use The AfriVoices-KE dataset is publicly available on Hugging Face5 under a CC BY 4.0 license, with access managed through a request form to track usage. The dataset is continuously updated, and users are advised to cite the latest version in pub- 4AnexchangerateofapproximatelyKsh130perUSD is used throughout. 5https://huggingface.co/A...

  12. [12]

    Acknowledgements The authors gratefully acknowledge the contribu- tions of the project staff, language leads, resource persons, respondents, annotators, and linguists, whose collective expertise and dedication were in- strumental in the conceptualization, data collection, and successful implementation of this project. We further extend our sincere appreci...

  13. [13]

    References Tejumade Afonja, Chinwe Mbataku, Anuoluwapo Malomo, Opeoluwa Okubadejo, Lucky Fran- cis, Marvelous Nwadike, and Iroro Orife. 2021. SautiDB: Nigerian accent dataset collection. arXiv preprint arXiv:2112.06199. Cynthia Jayne Amol, Everlyn Asiko Chimoto, RoseDelilahGesicho,AntonyMGitau,NaomeA Etori, Caringtone Kinyanjui, Steven Ndung’u, Lawrence M...

  14. [14]

    InProceedings of the Language Resources and Evaluation Confer- ence (LREC), pages 3148–3155

    Bibletts: a large, high-fidelity, multi- speaker and multi-lingual corpus of bible read- ings for speech synthesis. InProceedings of the Language Resources and Evaluation Confer- ence (LREC), pages 3148–3155. H. Nigatu, Solomon Teferra Abate, and Martha Yi- firu. 2024. The digital presence of african lan- guages: Assessing the gap in large-scale web corpo...

  15. [15]

    InProceedings of the 2024 Joint International Conference on Computational Lin- guistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9296–9303

    Ìròyìnspeech: A multi-purpose yorùbá speech corpus. InProceedings of the 2024 Joint International Conference on Computational Lin- guistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9296–9303. Gold Nmesoma Okorie, Chioma Ann Udeh, Ejuma Martha Adaga, Obinna Donald DaraO- jimba, and Osato Itohan Oriekhoe. 2024. Ethical considerations in ...

  16. [16]

    Afrispeech-200: Pan-african accented speech dataset for clinical and general domain asr.Transactions of the Association for Compu- tational Linguistics, 11:1669–1685. P. Owego et al. 2025. A comparative study of Dholuo dialects: Kisumu South Nyanza vs Boro-Ukwala.Journal of East African Studies, 19(2):210–225. Vassil Panayotov, Guoguo Chen, et al. 2020. M...

  17. [17]

    InProceedings of the Thirteenth Language Resources and Evalua- tion Conference, pages 7277–7283

    BembaSpeech: A speech recognition cor- pus for the Bemba language. InProceedings of the Thirteenth Language Resources and Evalua- tion Conference, pages 7277–7283. Barack Wanjawa, Lilian Wanzare, Florence Ind- ede, Owen McOnyango, Edward Ombui, and LawrenceMuchemi.2023. Kencorpus: Akenyan language corpus of swahili, dholuo and luhya for naturallanguagepro...