arxiv: 2604.16287 · v1 · submitted 2026-04-17 · 💻 cs.SD

Recognition: unknown

NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages

Marie Maltais , Yejin Jeon , Min Ma , Shamsuddeen Hassan Muhammad , Idris Abdulmumin , Maryam Ibrahim Mukhtar , Daud Abolade , Joel Okepefi

show 2 more authors

Johnson Sewedo David Ifeoluwa Adelani

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:06 UTC · model grok-4.3

classification 💻 cs.SD

keywords speech-to-speech translationlow-resource languagesNigerian languagesmulti-accent speechaudio language modelsbenchmark datasetfew-shot translation

0 comments

The pith

NaijaS2ST supplies parallel speech data for English and four Nigerian languages to compare translation models under real accent variation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases NaijaS2ST, a new parallel speech translation dataset covering roughly 50 hours per language for Igbo, Hausa, Yoruba, and Nigerian Pidgin paired with English. The recordings include many speakers and accents to mirror everyday multilingual conditions. The authors then run head-to-head benchmarks of cascaded pipelines, end-to-end models, and audio large language models in both speech-to-text and speech-to-speech directions. Audio LLMs with few-shot prompting prove stronger than fine-tuned cascaded or end-to-end systems for speech-to-text translation. For direct speech-to-speech translation the cascaded and audio-LLM approaches reach similar performance, showing that task-specific models still have substantial ground to gain.

Core claim

NaijaS2ST demonstrates that audio LLMs prompted with a few examples outperform fine-tuned cascaded and end-to-end models on speech-to-text translation for these languages, yet cascaded and audio-LLM systems produce comparable results on speech-to-speech translation, leaving clear room for improvement in targeted speech-to-speech architectures.

What carries the argument

The NaijaS2ST dataset itself, which supplies multi-speaker, multi-accent parallel speech pairs across the four target languages and enables controlled comparisons among cascaded, end-to-end, and few-shot audio-LLM translation paradigms.

If this is right

Few-shot audio LLMs become a practical starting point for speech-to-text translation in other low-resource African languages.
Direct speech-to-speech systems require new training objectives or architectures beyond current cascaded or LLM routes.
The multi-accent nature of the data supports development of translation models that generalize across regional varieties within each language.
Future work can use the same splits to isolate the effect of accent diversity on translation quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended to zero-shot or cross-lingual prompting to test how far the audio-LLM advantage generalizes without any target-language examples.
Similar datasets for other language families might reveal whether the observed gap between speech-to-text and speech-to-speech performance is specific to tonal or pitch-accent languages.
The dataset size and speaker diversity make it suitable for studying how accent mismatch between training and test speakers affects each modeling approach.

Load-bearing premise

The collected recordings and speaker splits are assumed to capture realistic accent and speaker variation without introducing hidden biases that would favor one modeling paradigm over another.

What would settle it

Re-running the same model comparisons on a fresh test set of speakers and accents drawn entirely from outside the original NaijaS2ST collection and measuring whether the performance ordering between audio LLMs and cascaded systems reverses.

Figures

Figures reproduced from arXiv: 2604.16287 by Daud Abolade, David Ifeoluwa Adelani, Idris Abdulmumin, Joel Okepefi, Johnson Sewedo, Marie Maltais, Maryam Ibrahim Mukhtar, Min Ma, Shamsuddeen Hassan Muhammad, Yejin Jeon.

**Figure 1.** Figure 1: Prompt template used for LRL → English speech-to-text translation in Gemini 2.5 and Gemini 3.1. [USER] You are a professional translator. Here are {number_of_few_shot_examples} examples of {language} transcriptions and their English translations:\n{examples_str}\n\n" Following these examples, translate the following {language} transcription to English. Only output the English translation without any additi… view at source ↗

**Figure 2.** Figure 2: Prompt template used for LRL → Eng MT for Tiny Aya. [SYSTEM] You are a translation expert. Here are {len(fewshot_examples)} examples of {language} audio transcribed, then translated into English.Following these examples, transcribe the last given audio, and use the transcription to provide its exact English translation. Return only the English translation without any additional text." [USER] Translate this… view at source ↗

**Figure 3.** Figure 3: Prompt template used for LRL → English speech-to-text translation for GPT-Audio 1.5 C ChrF and SpBLEU Results In addition to the SSA-COMET results reported previously, we also provide ChrF scores for other S2TT experiments. Furthermore, we include SpBLEU, which is a metric designed for machine translation that evaluates translation quality. We also report a variant, ASR-SpBLEU, which applies the same eval… view at source ↗

read the original abstract

Speech translation for low-resource languages remains fundamentally limited by the scarcity of high-quality, diverse parallel speech data, a challenge that is especially pronounced in African linguistic contexts. To address this, we introduce NaijaS2ST, a parallel speech translation dataset spanning Igbo, Hausa, Yor\`ub\'a, and Nigerian Pidgin paired with English. The dataset comprises approximately 50 hours of speech per language and captures substantial variation in speakers and accents, reflecting realistic multilingual and multi-accent conditions. With NaijaS2ST, we conduct a comprehensive benchmark of cascaded, end-to-end (E2E), and AudioLLM-based approaches across bidirectional translation settings. Our results show that audio LLMs with few-shot examples are more effective for speech-to-text translation than cascaded and end-to-end methods trained on fine-tuned data. However, for speech-to-speech translation, the cascaded and audio LLM paradigms yield comparable performance, indicating that there is still considerable room for improvement in developing targeted, task-specific models for this setting. By providing both a high-quality dataset and a systematic benchmark, we hope that NaijaS2ST will serve as a strong foundation for advancing research in low-resource, multilingual speech translation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NaijaS2ST supplies a new 50-hour multi-accent parallel corpus for four Nigerian languages and shows few-shot AudioLLMs beating fine-tuned baselines on speech-to-text but only matching them on speech-to-speech.

read the letter

The paper's real contribution is the NaijaS2ST dataset itself: roughly 50 hours of speech per language across Igbo, Hausa, Yoruba, and Nigerian Pidgin, paired with English, and built with deliberate speaker and accent diversity. That kind of resource has been missing for these languages, and the authors have made the effort to capture realistic variation rather than clean studio recordings. They then run a straightforward benchmark pitting cascaded systems, end-to-end models, and few-shot AudioLLMs against each other on both speech-to-text and speech-to-speech in both directions. The pattern they report—that AudioLLMs pull ahead on text output but stay comparable on direct speech output—is plausible and worth having on record. The work is entirely empirical, with no equations or fitted parameters to worry about, and the abstract phrasing for the speech-to-speech case stays appropriately cautious. The main limitation is that the abstract gives no detail on data splits, exact metrics, or how the fine-tuning and few-shot setups were matched for compute and data exposure. Without those, it is difficult to judge whether the reported gaps are stable or sensitive to small changes in evaluation. The data collection description sounds reasonable, but a referee would still want to see the actual release, the train/dev/test statistics, and any accent-balance checks. This is the kind of paper that belongs in a speech or multilingual NLP venue. Researchers working on low-resource speech translation or on AudioLLM evaluation will want the dataset and the baseline numbers. It is solid enough to deserve peer review; the dataset alone makes it worth the referees' time, even if the results section needs tightening on experimental controls.

Referee Report

0 major / 2 minor

Summary. The paper introduces NaijaS2ST, a parallel speech translation dataset covering Igbo, Hausa, Yorùbá, and Nigerian Pidgin paired with English, comprising approximately 50 hours of speech per language with substantial speaker and accent variation to reflect realistic multilingual conditions. It conducts a benchmark of cascaded, end-to-end, and AudioLLM-based approaches for bidirectional speech-to-text and speech-to-speech translation, reporting that few-shot AudioLLMs outperform fine-tuned cascaded and E2E methods on speech-to-text while cascaded and AudioLLM paradigms yield comparable results on speech-to-speech, with room for improvement in task-specific models.

Significance. If the benchmark outcomes prove robust upon verification of methods and splits, the work would be significant for low-resource speech translation by supplying a high-quality, multi-accent dataset for underrepresented African languages and by providing empirical evidence on the relative strengths of AudioLLMs in few-shot versus fine-tuned settings. The empirical focus with no circularity or invented parameters, combined with the cautious phrasing on the S2S results, strengthens its value as a foundation for future targeted research.

minor comments (2)

Abstract: The rendering of 'Yor`ub'a' with backticks appears to be a LaTeX artifact; replace with proper diacritics (Yorùbá) for consistency in the published version.
Abstract: The claim of 'bidirectional translation settings' would be clearer if the exact language pairs and directions evaluated were enumerated, even at a high level.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the dataset's significance for low-resource African languages, and recommendation for minor revision. The assessment accurately captures our contributions regarding NaijaS2ST and the benchmark results on AudioLLMs versus cascaded/E2E approaches. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is entirely empirical: it introduces the NaijaS2ST parallel speech dataset (approximately 50 hours per language with speaker/accent variation) and reports benchmark results comparing cascaded, end-to-end, and AudioLLM-based translation systems on bidirectional tasks. No equations, derivations, fitted parameters, or self-referential claims appear in the abstract or described content. The strongest claim (AudioLLMs with few-shot examples outperforming fine-tuned methods for speech-to-text, with comparable results for speech-to-speech) rests directly on the experimental outcomes rather than any reduction to inputs by construction. No load-bearing steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset introduction and benchmarking paper. It introduces no mathematical models, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5568 in / 1252 out tokens · 55916 ms · 2026-05-10T07:06:48.493208+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 4 canonical work pages · 1 internal anchor

[1]

A few thousand translations go a long way! leveraging pre-trained models for African news trans- lation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 3053–3070, Seattle, United States. Association for Computational Linguistics. David Ifeoluwa Adela...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

InPro- ceedings of the 61st Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 16251–16269

Speechmatrix: A large-scale mined corpus of multilingual speech-to-speech translations. InPro- ceedings of the 61st Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 16251–16269. Chris Emezue, The NaijaV oices Community, Busayo Awobade, Abraham Toluwase Owodunni, Handel Emezue, Gloria Monica Tobechukwu Emezue...
[3]

InInterspeech 2025, pages 1338–1342

The NaijaV oices Dataset: Cultivating Large- Scale, High-Quality, Culturally-Rich Speech Data for African Languages. InInterspeech 2025, pages 1338–1342. Thierry Etchegoyhen, Haritz Arzelus, Harritxu Gete, Aitor Alvarez, Iván G. Torre, Juan Manuel Martín- Doñas, Ander González-Docasal, and Edson Benites Fernandez. 2022. Cascade or direct speech transla- t...

2025
[4]

InProceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, Abu Dhabi, United Arab Emirates (Hybrid)

Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. InProceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Yuki Ito, Hassan Shahmohammadi, Siddhant Ar...

work page arXiv 2025
[5]

Direct speech-to-speech translation with a sequence-to-sequence model. InProc. Interspeech 2019, pages 1123–1127. Seung-Bin Kim, Sang-Hoon Lee, and Seong-Whan Lee

2019
[6]

arXiv preprint arXiv:2312.05187 , year=

Transentence: speech-to-speech translation via language-agnostic sentence-level speech encod- ing without language-parallel data. InICASSP 2024 - 2024 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 12722–12726. Tom Labiausse, Laurent Mazaré, Edouard Grave, Alexandre Défossez, and Neil Zeghidour. 2025. High-fidel...

work page arXiv 2024
[7]

Omnilingual asr: Open- source multilingual speech recognition for 1600+ languages.arXiv preprint arXiv:2511.09690, 2025

ÌròyìnSpeech: A multi-purpose Yorùbá speech corpus. InProceedings of the 2024 Joint International Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 9296–9303, Torino, Italia. ELRA and ICCL. Tobi Olatunji, Tejumade Afonja, Aditya Yadavalli, Chris Chinenye Emezue, Sahib Singh, Bonaventure F. P. Dossou, J...

work page arXiv 2024