Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension

Chia-Hsuan Li; Chi-Liang Liu; Hung-yi Lee; Szu-Lin Wu

arxiv: 1804.00320 · v1 · pith:NTZJQ42Fnew · submitted 2018-04-01 · 💻 cs.CL

Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension

Chia-Hsuan Li , Szu-Lin Wu , Chi-Liang Liu , Hung-yi Lee This is my paper

classification 💻 cs.CL

keywords comprehensionspokencontentimpactsquaderrorslisteningmachine

0 comments

read the original abstract

Reading comprehension has been widely studied. One of the most representative reading comprehension tasks is Stanford Question Answering Dataset (SQuAD), on which machine is already comparable with human. On the other hand, accessing large collections of multimedia or spoken content is much more difficult and time-consuming than plain text content for humans. It's therefore highly attractive to develop machines which can automatically understand spoken content. In this paper, we propose a new listening comprehension task - Spoken SQuAD. On the new task, we found that speech recognition errors have catastrophic impact on machine comprehension, and several approaches are proposed to mitigate the impact.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ProactiveLLM: Learning Active Interaction for Streaming Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

ProactiveLLM enables active interaction in streaming LLMs by learning semantic sufficiency cues from partial inputs through mask-based modeling and synchronized privileged self-distillation without external supervision.
End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering
cs.SD 2025-11 unverdicted novelty 5.0

CLSR is an end-to-end contrastive language-speech retriever using an intermediate text-like conversion step to improve retrieval of relevant segments from long audio for spoken question answering.