pith. sign in

arxiv: 2309.08105 · v2 · pith:K5HL2TGKnew · submitted 2023-09-15 · 📡 eess.AS · cs.SD

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

classification 📡 eess.AS cs.SD
keywords libriheavycorpusaudioscasingcontexthourslibrilightother
0
0 comments X
read the original abstract

In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is the largest freely-available corpus of speech with supervisions. Different from other open-sourced datasets that only provide normalized transcriptions, Libriheavy contains richer information such as punctuation, casing and text context, which brings more flexibility for system building. Specifically, we propose a general and efficient pipeline to locate, align and segment the audios in previously published Librilight to its corresponding texts. The same as Librilight, Libriheavy also has three training subsets small, medium, large of the sizes 500h, 5000h, 50000h respectively. We also extract the dev and test evaluation sets from the aligned audios and guarantee there is no overlapping speakers and books in training sets. Baseline systems are built on the popular CTC-Attention and transducer models. Additionally, we open-source our dataset creatation pipeline which can also be used to other audio alignment tasks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Misinformation Speaks and Converses: Rethinking Fact-Checking in Audio Platforms

    cs.CL 2026-04 unverdicted novelty 4.0

    Audio misinformation requires rethinking fact-checking pipelines due to its spoken and conversational properties that traditional text-based methods overlook.