pith. machine review for the scientific record. sign in

arxiv: 1808.10583 · v2 · submitted 2018-08-31 · 💻 cs.CL

Recognition: unknown

AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale

Hui Bu, Jiayu Du, Xingyu Na, Xuechen Liu

classification 💻 cs.CL
keywords aishell-2corpusindustrialmandarinresearchcontainingdatahope
0
0 comments X
read the original abstract

AISHELL-1 is by far the largest open-source speech corpus available for Mandarin speech recognition research. It was released with a baseline system containing solid training and testing pipelines for Mandarin ASR. In AISHELL-2, 1000 hours of clean read-speech data from iOS is published, which is free for academic usage. On top of AISHELL-2 corpus, an improved recipe is developed and released, containing key components for industrial applications, such as Chinese word segmentation, flexible vocabulary expension and phone set transformation etc. Pipelines support various state-of-the-art techniques, such as time-delayed neural networks and Lattic-Free MMI objective funciton. In addition, we also release dev and test data from other channels(Android and Mic). For research community, we hope that AISHELL-2 corpus can be a solid resource for topics like transfer learning and robust ASR. For industry, we hope AISHELL-2 recipe can be a helpful reference for building meaningful industrial systems and products.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

    cs.CL 2026-05 unverdicted novelty 7.0

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...

  2. MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

    cs.CL 2026-04 unverdicted novelty 6.0

    MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.

  3. Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

    eess.AS 2026-04 unverdicted novelty 6.0

    A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.

  4. FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

    cs.SD 2026-04 unverdicted novelty 6.0

    FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.

  5. Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    eess.AS 2023-11 unverdicted novelty 6.0

    Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.

  6. Kimi-Audio Technical Report

    eess.AS 2025-04 unverdicted novelty 5.0

    Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...

  7. NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

    eess.AS 2026-04 unverdicted novelty 4.0

    NIM4-ASR delivers SOTA ASR performance on public benchmarks using a 2.3B-parameter LLM with multi-stage training, real-time streaming, and million-scale hotword customization via RAG.

  8. Qwen2-Audio Technical Report

    eess.AS 2024-07 unverdicted novelty 4.0

    Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.