TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

Chien-Cheng Chen; Huang-Cheng Chou; Hung-yi Lee; James Glass; Kai-Wei Chang; Shrikanth Narayanan; Wenze Ren; Yi-Cheng Lin; Yuan-Fu Liao; Yu-Han Huang

arxiv: 2603.21478 · v2 · pith:4OAJCZXPnew · submitted 2026-03-23 · 💻 cs.CL · cs.LG· eess.AS

TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

Kai-Wei Chang , Yi-Cheng Lin , Huang-Cheng Chou , Wenze Ren , Yu-Han Huang , Yun-Shao Tsai , Chien-Cheng Chen , Yu Tsao

show 4 more authors

Yuan-Fu Liao Shrikanth Narayanan James Glass Hung-yi Lee

This is my paper

classification 💻 cs.CL cs.LGeess.AS

keywords datasetdatalow-resourcetaigispeechintentlanguagesminingspeech

0 comments

read the original abstract

Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbf{TaigiSpeech}, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design enables scalable dataset construction for low-resource and unwritten spoken languages. TaigiSpeech will be released under the CC BY 4.0 license to facilitate broad adoption and research on low-resource and unwritten languages. The project website and the dataset can be found on https://kwchang.org/taigispeech.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
eess.AS 2026-04 unverdicted novelty 7.0

Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- a...