pith. sign in

arxiv: 2606.01016 · v1 · pith:NDYZDGHBnew · submitted 2026-05-31 · 💻 cs.CL · cs.AI· eess.AS

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

Pith reviewed 2026-06-28 17:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AIeess.AS
keywords speech understandingmultilingual benchmarkdialectslow-resource languagesend-to-end modelschain-of-thoughtparalinguistic cuessynthetic speech
0
0 comments X

The pith

PolySpeech-100 benchmark reveals open-source E2E speech models outperform cascade systems on heavy dialects by preserving paralinguistic cues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a new benchmark covering 110 language variants to test speech understanding beyond simple transcription. It evaluates 22 models and finds that direct end-to-end audio processing helps open-source models handle dialects better than pipelines that first convert speech to text. The work also shows open-source models drop sharply on low-resource languages while commercial ones hold up, and that chain-of-thought prompting often hurts results in zero-shot speech tasks. A sympathetic reader cares because current evaluation methods miss these modality-specific strengths and weaknesses. The benchmark uses a hybrid pipeline of real recordings plus synthetic speech to reach this scale.

Core claim

PolySpeech-100 is a benchmark for native-level speech comprehension across 110 linguistic variants that uses a hybrid construction pipeline to combine gold-standard human recordings with instruction-driven synthetic speech. Evaluation of 22 models shows open-source E2E models outperform cascade ASR-plus-LLM systems on heavy dialects because direct audio processing retains prosodic features and paralinguistic cues lost in transcription; open-source models exhibit catastrophic degradation on low-resource languages while commercial models remain robust; and chain-of-thought prompting degrades performance under standard zero-shot settings, indicating a modality alignment gap.

What carries the argument

The hybrid construction pipeline that augments human recordings with synthetic speech to reach 110 variants while targeting native-level semantic comprehension.

If this is right

  • Direct audio input is required to capture dialect-specific intonation and stress that text transcription discards.
  • Open-source speech models need targeted improvements for low-resource languages to close the gap with commercial systems.
  • Standard chain-of-thought prompting should be avoided in zero-shot speech understanding tasks because it reduces accuracy.
  • Future model development must prioritize architectures that align audio and language modalities without relying on intermediate text.
  • Benchmarks limited to high-resource languages and ASR miss critical failure modes in real-world multilingual use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training regimes that explicitly optimize for prosody retention could widen the advantage of E2E models on dialects.
  • The observed degradation on low-resource languages suggests that scaling laws for speech may require language-specific data balancing rather than uniform scaling.
  • If the modality alignment gap persists, hybrid systems may need separate audio-native reasoning heads instead of borrowing from text LLMs.
  • Public release of the benchmark data allows direct testing of whether new models close the low-resource gap without synthetic artifacts.

Load-bearing premise

The synthetic speech added to human recordings accurately represents native comprehension without introducing artifacts that change how models rank against each other.

What would settle it

If re-running the same models on an all-human-recorded subset of the same tasks reverses the ranking between E2E and cascade systems on dialects, the claim that direct audio processing is the decisive factor would not hold.

Figures

Figures reproduced from arXiv: 2606.01016 by Lu Fan, Shiwei Wu, Shulan Ruan, Sicheng Yang, You He, Yu Liu, Zhi Li.

Figure 1
Figure 1. Figure 1: Geographic distribution of PolySpeech-100. The points represent specific languages and dialects included in the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The data construction pipeline of PolySpeech-100. The framework consists of three stages: multi-source data aggrega [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance benchmarking of various audio models across diverse linguistic scenarios. The radar charts illustrate [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: While a general negative correlation exists between audio [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance analysis of 4 Speech-LLMs across vary [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Analysis of prediction bias and prompt sensitivity across three linguistic groups. The bar charts display the selection [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The t-SNE visualization of multilingual speech em [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The user interface of the PolySpeech-100 interactive [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The standard text system prompt used for models [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The Chain-of-Thought (CoT) system prompt dis [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Confusion matrices of 4 models across all test samples. The matrices are row-normalized to show the recall rate [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: The model successfully deduces that A is NOT the [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Despite perfectly sound logic that explicitly vali [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparison of Ground Truth and ASR Transcrip [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
read the original abstract

While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level' speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech-100.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces PolySpeech-100, a benchmark for native-level speech comprehension across 110 linguistic variants (19 Chinese dialects + 80+ low-resource languages). It employs a hybrid construction pipeline combining gold-standard human recordings with instruction-driven synthetic speech, evaluates 22 models (including commercial and open-source E2E Speech-LLMs and cascades), and reports three main findings: open-source E2E models outperform ASR+LLM cascades on heavy dialects because direct audio processing preserves paralinguistic and prosodic cues; commercial models are robust while open-source models show catastrophic degradation on low-resource languages; and Chain-of-Thought prompting degrades zero-shot performance, suggesting a modality alignment gap. Data, demo, and code are released publicly.

Significance. If the benchmark construction is shown to be free of systematic artifacts, the work would be significant for highlighting performance gaps in current Speech-LLMs on underrepresented languages and dialects, and for providing evidence that E2E architectures can retain prosodic information lost in transcription-based cascades. The public release of the dataset and evaluation code is a clear strength that enables reproducibility and follow-on work.

major comments (1)
  1. [Abstract / Benchmark Construction] Abstract (hybrid construction pipeline) and associated methods: The headline claims that E2E models outperform cascades on heavy dialects (due to preserved paralinguistic cues) and that open-source models suffer catastrophic degradation on low-resource languages rest on the assumption that instruction-driven synthetic speech faithfully represents native prosody, intonation, and stress without introducing artifacts that differentially impact E2E audio encoders versus ASR components in cascades. No validation (e.g., prosodic feature comparisons, human perceptual tests, or ablation on synthesis quality per dialect/language) is described to support this; if synthesis quality varies systematically, the reported gaps could be benchmark artifacts rather than model properties.
minor comments (2)
  1. Clarify the exact total (110 variants vs. title's 100+) and provide a breakdown table of human vs. synthetic portions per language group.
  2. [Abstract] The abstract states 'native-level' comprehension but does not define the metric or human baseline used to establish this threshold.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on PolySpeech-100. The major comment highlights an important aspect of benchmark validity regarding the synthetic speech component. We address this point below and commit to revisions that strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Benchmark Construction] Abstract (hybrid construction pipeline) and associated methods: The headline claims that E2E models outperform cascades on heavy dialects (due to preserved paralinguistic cues) and that open-source models suffer catastrophic degradation on low-resource languages rest on the assumption that instruction-driven synthetic speech faithfully represents native prosody, intonation, and stress without introducing artifacts that differentially impact E2E audio encoders versus ASR components in cascades. No validation (e.g., prosodic feature comparisons, human perceptual tests, or ablation on synthesis quality per dialect/language) is described to support this; if synthesis quality varies systematically, the reported gaps could be benchmark artifacts rather than model properties.

    Authors: We agree that explicit validation of the synthetic speech is necessary to rule out systematic artifacts. The hybrid pipeline prioritizes gold-standard human recordings where available and uses instruction-driven synthesis only to extend coverage to underrepresented variants; however, the original submission did not report prosodic analyses, perceptual tests, or per-language synthesis ablations. In the revised version we will add: (1) quantitative comparison of prosodic features (F0, duration, energy, rhythm) between synthetic and human speech on a stratified sample of 10 dialects/languages; (2) human perceptual ratings (naturalness, intelligibility, prosodic fidelity) collected from native speakers; and (3) an ablation that evaluates model performance on the human-recorded subset alone versus the full hybrid set. These additions will directly test whether the reported E2E advantages and degradation patterns persist when synthesis quality is controlled. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark evaluations are externally grounded

full rationale

The paper presents a new benchmark constructed via a hybrid human+synthetic pipeline and reports empirical results from evaluating 22 independent external models (Gemini-3, GPT-Audio, Qwen2.5-Omni, etc.). No equations, fitted parameters, or derived quantities are defined in terms of the target claims. No self-citations are invoked to justify uniqueness or load-bearing premises. All headline observations (E2E vs cascade gaps on dialects, open-source degradation on low-resource languages, CoT degradation) are presented as direct measurements against third-party systems and are therefore falsifiable outside the paper's own data construction. This matches the default expectation of a non-circular empirical benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work relies on standard practices for data collection and model evaluation.

pith-pipeline@v0.9.1-grok · 5874 in / 1114 out tokens · 25026 ms · 2026-06-28T17:42:29.522659+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

117 extracted references · 68 canonical work pages · 21 internal anchors

  1. [1]

    Inclusion AI, Biao Gong, Cheng Zou, et al. 2025. Ming-Omni: A Unified Mul- timodal Model for Perception and Generation.CoRRabs/2506.09344 (2025). https://doi.org/10.48550/arXiv.2506.09344

  2. [2]

    Keyu An, Qian Chen, Chong Deng, et al . 2024. FunAudioLLM: Voice Un- derstanding and Generation Foundation Models for Natural Interaction Be- tween Humans and LLMs.CoRRabs/2407.04051 (2024). arXiv:2407.04051 doi:10.48550/ARXIV.2407.04051

  3. [3]

    Philip Anastassiou, Jiawei Chen, Jitong Chen, et al. 2024. Seed-TTS: A Family of High-Quality Versatile Speech Generation Models.CoRRabs/2406.02430 (2024). arXiv:2406.02430 doi:10.48550/ARXIV.2406.02430

  4. [4]

    Rosana Ardila, Megan Branson, Kelly Davis, et al . 2020. Common Voice: A Massively-Multilingual Speech Corpus. InLanguage Resources and Evaluation Conference, LREC 2020. European Language Resources Association, 4218–4222. https://aclanthology.org/2020.lrec-1.520/

  5. [5]

    Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, et al . 2025. On The Landscape of Spoken Language Models: A Comprehensive Survey.CoRR abs/2504.08528 (2025). arXiv:2504.08528 doi:10.48550/ARXIV.2504.08528

  6. [6]

    Lucas Bandarkar, Davis Liang, Benjamin Muller, et al. 2024. The Belebele Bench- mark: a Parallel Reading Comprehension Dataset in 122 Language Variants. In Meeting of the Association for Computational Linguistics, ACL 2024. Association for Computational Linguistics, 749–775. doi:10.18653/V1/2024.ACL-LONG.44

  7. [7]

    Martijn Bartelds, Nay San, Bradley McDonnell, et al . 2023. Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation. InMeeting of the Association for Computational Linguistics, ACL

  8. [8]

    Goanta, N

    Association for Computational Linguistics, 715–729. doi:10.18653/V1/2023. ACL-LONG.42

  9. [9]

    Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski, et al. 2020. SLURP: A Spoken Language Understanding Resource Package. In2020 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics, 7252–7262. doi:10.18653/V1/2020.EMNLP-MAIN.588

  10. [10]

    Hui Bu, Jiayu Du, Xingyu Na, et al. 2017. AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline.CoRRabs/1709.05522 (2017). arXiv:1709.05522 http://arxiv.org/abs/1709.05522

  11. [11]

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, et al. 2008. IEMOCAP: interactive emotional dyadic motion capture database.Lang. Resour. Evaluation42, 4 (2008), 335–359. doi:10.1007/S10579-008-9076-6

  12. [12]

    ByteDance Seed Team. 2026. Introducing Seed Full-Duplex Speech LLM: Attentive Listening, Robust Interference Suppression, Enabling More Natural Interaction. https://seed.bytedance.com/en/blog/introducing-seed-full-duplex- speech-llm-attentive-listening-robust-interference-suppression-enabling- more-natural-interaction. Project Page: https://seed.bytedance...

  13. [13]

    Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, et al . 2021. GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10, 000 Hours of Transcribed Audio. In22nd Annual Conference of the International Speech Communication Association, Interspeech 2021. ISCA, 3670–3674. doi:10.21437/INTERSPEECH.2021-1965

  14. [14]

    Hongjie Chen, Zehan Li, Guangmin Xia, et al. 2024. Telespeechpt: Large-scale chinese multi-dialect and multi-accent speech pre-training. InNational Confer- ence on Man-Machine Speech Communication. 183–190

  15. [15]

    Junjie Chen, Yao Hu, Junjie Li, et al. 2025. FireRedChat: A Pluggable, Full-Duplex Voice Interaction System with Cascaded and Semi-Cascaded Implementations. CoRRabs/2509.06502 (2025). arXiv:2509.06502 doi:10.48550/ARXIV.2509.06502

  16. [16]

    Qian Chen, Yafeng Chen, Yanni Chen, et al. 2025. MinMo: A Multimodal Large Language Model for Seamless Voice Interaction.CoRRabs/2501.06282 (2025). arXiv:2501.06282 doi:10.48550/ARXIV.2501.06282

  17. [17]

    Wenxi Chen, Ziyang Ma, Ruiqi Yan, et al . 2025. SLAM-Omni: Timbre- Controllable Voice Interaction System with Single-Stage Training. InFindings of the Association for Computational Linguistics, ACL 2025, Vol. ACL 2025. As- sociation for Computational Linguistics, 2262–2282. https://aclanthology.org/ 2025.findings-acl.115/

  18. [18]

    Yiming Chen, Xianghu Yue, Chen Zhang, et al. 2024. VoiceBench: Benchmarking LLM-Based Voice Assistants.CoRRabs/2410.17196 (2024). arXiv:2410.17196 doi:10.48550/ARXIV.2410.17196

  19. [19]

    Ziqi Chen, Gongyu Chen, Yihua Wang, et al . 2025. DiaMoE-TTS: A Unified IPA-Based Dialect TTS Framework with Mixture-of-Experts and Parameter- Efficient Zero-Shot Adaptation.CoRRabs/2509.22727 (2025). arXiv:2509.22727 doi:10.48550/ARXIV.2509.22727

  20. [20]

    Yunfei Chu, Jin Xu, Qian Yang, et al . 2024. Qwen2-Audio Technical Report. CoRRabs/2407.10759 (2024). arXiv:2407.10759 doi:10.48550/ARXIV.2407.10759

  21. [21]

    Christopher Cieri, David Graff, Owen Kimball, et al . [n. d.].Fisher English Training Speech Part 1 Speech. doi:10.35111/da4a-se30 KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Sicheng Yang et al

  22. [22]

    Alexis Conneau, Min Ma, Simran Khanuja, et al . 2022. FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech. InIEEE Spoken Language Technology Workshop, SLT 2022. IEEE, 798–805. doi:10.1109/SLT54892. 2023.10023141

  23. [23]

    Costa-jussà, Bokai Yu, Pierre Andrews, et al

    Marta R. Costa-jussà, Bokai Yu, Pierre Andrews, et al . 2025. 2M-BELEBELE: Highly Multilingual Speech and American Sign Language Comprehension Dataset Download PDF. InFindings of the Association for Computational Lin- guistics, ACL 2025, Vol. ACL 2025. Association for Computational Linguistics, 10893–10904. https://aclanthology.org/2025.findings-acl.569/

  24. [24]

    Junbo Cui, Bokai Xu, Chongyi Wang, et al . 2026. MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction.CoRRabs/2604.27393 (2026). arXiv:2604.27393 doi:10.48550/ARXIV.2604.27393

  25. [25]

    Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, et al. 2025. Recent Advances in Speech Language Models: A Survey. InMeeting of the Association for Computational Linguistics, ACL 2025. Association for Computational Linguistics, 13943–13970. https://aclanthology.org/2025.acl-long.682/

  26. [26]

    Yuhang Dai, Ziyu Zhang, Shuai Wang, et al. 2025. WenetSpeech-Chuan: A Large- Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing. CoRRabs/2509.18004 (2025). arXiv:2509.18004 doi:10.48550/ARXIV.2509.18004

  27. [27]

    Alexandre Défossez, Laurent Mazaré, Manu Orsini, et al. 2024. Moshi: a speech- text foundation model for real-time dialogue.CoRRabs/2410.00037 (2024). arXiv:2410.00037 doi:10.48550/ARXIV.2410.00037

  28. [28]

    Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, et al . 2025. MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs. InFindings of the Association for Computational Linguistics, ACL 2025, Vol. ACL 2025. Association for Computational Linguistics, 18632–18702. https://aclanthology.org/2025.findings...

  29. [29]

    Xinhan Di, Zihao Chen, Yunming Liang, et al . 2024. Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation. CoRRabs/2408.00284 (2024). arXiv:2408.00284 doi:10.48550/ARXIV.2408.00284

  30. [30]

    Seungheon Doh, Keunwoo Choi, Jongpil Lee, et al. 2023. LP-MusicCaps: LLM- Based Pseudo Music Captioning. InInternational Society for Music Information Retrieval Conference, ISMIR 2023. 409–416. doi:10.5281/ZENODO.10265311

  31. [31]

    Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. 2020. Clotho: an Audio Captioning Dataset. In2020 IEEE International Conference on Acoustics. IEEE, 736–740. doi:10.1109/ICASSP40776.2020.9052990

  32. [32]

    Jiayu Du, Xingyu Na, Xuechen Liu, et al . 2018. AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale.CoRRabs/1808.10583 (2018). arXiv:1808.10583 http://arxiv.org/abs/1808.10583

  33. [33]

    Zhihao Du, Changfeng Gao, Yuxuan Wang, et al . 2025. CosyVoice 3: To- wards In-the-wild Speech Generation via Scaling-up and Post-training.CoRR abs/2505.17589 (2025). arXiv:2505.17589 doi:10.48550/ARXIV.2505.17589

  34. [34]

    Qingkai Fang, Shoutao Guo, Yan Zhou, et al . 2025. LLaMA-Omni: Seamless Speech Interaction with Large Language Models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025. OpenReview.net. https: //openreview.net/forum?id=PYmrUQmMEw

  35. [35]

    Qingkai Fang, Yan Zhou, Shoutao Guo, et al. 2025. LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis. CoRRabs/2505.02625 (2025). arXiv:2505.02625 doi:10.48550/ARXIV.2505.02625

  36. [36]

    Yuan Gong, Jin Yu, and James R. Glass. 2022. Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition. InIEEE International Confer- ence on Acoustics, Speech and Signal Processing, ICASSP 2022. IEEE, 151–155. doi:10.1109/ICASSP43922.2022.9746828

  37. [37]

    Google DeepMind. 2025. Gemini 3 Flash Model Card. https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf

  38. [38]

    Jack Hong, Shilin Yan, Jiayin Cai, et al. 2025. WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs.CoRRabs/2502.04326 (2025). arXiv:2502.04326 doi:10.48550/ARXIV.2502.04326

  39. [39]

    Chien-Yu Huang, Ke-Han Lu, Shih-Heng Wang, et al. 2024. Dynamic-Superb: Towards a Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark For Speech. InIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024. IEEE, 12136–12140. doi:10.1109/ICASSP48485. 2024.10448257

  40. [40]

    Hunt and Alan W

    Andrew J. Hunt and Alan W. Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In1996 IEEE International Conference on Acoustics. IEEE Computer Society, 373–376. doi:10.1109/ICASSP. 1996.541110

  41. [41]

    Katsumi Ibaraki and David Chiang. 2025. Frustratingly Easy Data Augmentation for Low-Resource ASR.CoRRabs/2509.15373 (2025). arXiv:2509.15373 doi:10. 48550/ARXIV.2509.15373

  42. [42]

    Shengpeng Ji, Yifu Chen, Minghui Fang, et al . 2024. WavChat: A Survey of Spoken Dialogue Models.CoRRabs/2411.13577 (2024). arXiv:2411.13577 doi:10. 48550/ARXIV.2411.13577

  43. [43]

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, et al . 2019. Audio- Caps: Generating Captions for Audios in The Wild. In2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 119–132. doi:10.18653/V1/N19-1011

  44. [44]

    Jiyeon Kim, Mehul Kumar, Dhananjaya Gowda, et al. 2021. Semi-Supervised Transfer Learning for Language Expansion of End-to-End Speech Recognition Models to Low-Resource Languages. InIEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021. IEEE, 984–988. doi:10.1109/ASRU51503. 2021.9688019

  45. [45]

    KimiTeam, Ding Ding, Zeqian Ju, et al . 2025. Kimi-Audio Technical Report. CoRRabs/2504.18425 (2025). arXiv:2504.18425 doi:10.48550/ARXIV.2504.18425

  46. [46]

    Chia-Hsuan Li, Szu-Lin Wu, Chi-Liang Liu, et al. 2018. Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehen- sion. In19th Annual Conference of the International Speech Communication Associ- ation, Interspeech 2018. ISCA, 3459–3463. doi:10.21437/INTERSPEECH.2018-1714

  47. [47]

    Guojian Li, Chengyou Wang, Hongfei Xue, et al . 2025. Easy Turn: Integrat- ing Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems.CoRRabs/2509.23938 (2025). arXiv:2509.23938 doi:10.48550/ARXIV.2509.23938

  48. [48]

    Longhao Li, Zhao Guo, Hongjie Chen, et al. 2025. WenetSpeech-Yue: A Large- scale Cantonese Speech Corpus with Multi-dimensional Annotation.CoRR abs/2509.03959 (2025). arXiv:2509.03959 doi:10.48550/ARXIV.2509.03959

  49. [49]

    Tianpeng Li, Jun Liu, Tao Zhang, et al . 2025. Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction.CoRRabs/2502.17239 (2025). arXiv:2502.17239 doi:10.48550/ARXIV.2502.17239

  50. [50]

    Xuechen Li, Tianyi Zhang, Yann Dubois, et al. 2023. AlpacaEval: An Automatic Evaluator of Instruction-following Models

  51. [51]

    Yizhi Li, Ge Zhang, Yinghao Ma, et al . 2024. OmniBench: Towards The Future of Universal Omni-Language Models.CoRRabs/2409.15272 (2024). arXiv:2409.15272 doi:10.48550/ARXIV.2409.15272

  52. [52]

    Zhanxun Liu, Yifan Duan, Mengmeng Wang, et al. 2025. X-Talk: On the Un- derestimated Potential of Modular Speech-to-Speech Dialogue System.CoRR abs/2512.18706 (2025). arXiv:2512.18706 doi:10.48550/ARXIV.2512.18706

  53. [53]

    Ziyang Ma, Yinghao Ma, Yanqiao Zhu, et al . 2025. MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix.CoRR abs/2505.13032 (2025). arXiv:2505.13032 doi:10.48550/ARXIV.2505.13032

  54. [54]

    Ziyang Ma, Guanrou Yang, Wenxi Chen, et al. 2026. SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing.IEEE Journal of Selected Topics in Signal Processing(2026)

  55. [55]

    Xinhao Mei, Chutong Meng, Haohe Liu, et al . 2024. WavCaps: A ChatGPT- Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Mul- timodal Research.IEEE ACM Trans. Audio Speech Lang. Process.32 (2024), 3339–3354. doi:10.1109/TASLP.2024.3419446

  56. [56]

    Yangyang Meng, Jinpeng Li, Guodong Lin, et al. 2025. Dolphin: A Large-Scale Au- tomatic Speech Recognition Model for Eastern Languages.CoRRabs/2503.20212 (2025). arXiv:2503.20212 doi:10.48550/ARXIV.2503.20212

  57. [57]

    OpenAI. 2024. GPT-4o System Card.CoRRabs/2410.21276 (2024). arXiv:2410.21276 doi:10.48550/ARXIV.2410.21276

  58. [58]

    OpenAI. 2025. gpt-audio-mini. https://platform.openai.com/docs/models/gpt- audio-mini

  59. [59]

    Vassil Panayotov, Guoguo Chen, Daniel Povey, et al . 2015. Librispeech: An ASR corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics. IEEE, 5206–5210. doi:10.1109/ICASSP.2015.7178964

  60. [60]

    Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, et al. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Conference of the Association for Computational Linguistics, ACL 2019. Associa- tion for Computational Linguistics, 527–536. doi:10.18653/V1/P19-1050

  61. [61]

    Vineel Pratap, Andros Tjandra, Bowen Shi, et al. 2024. Scaling Speech Tech- nology to 1, 000+ Languages.J. Mach. Learn. Res.25 (2024), 97:1–97:52. https://jmlr.org/papers/v25/23-1318.html

  62. [62]

    Vineel Pratap, Qiantong Xu, Anuroop Sriram, et al. 2020. MLS: A Large-Scale Multilingual Dataset for Speech Research. In21st Annual Conference of the International Speech Communication Association, Interspeech 2020. ISCA, 2757–

  63. [63]

    doi:10.21437/INTERSPEECH.2020-2826

  64. [64]

    Alec Radford, Jong Wook Kim, Tao Xu, et al. 2023. Robust Speech Recognition via Large-Scale Weak Supervision. InInternational Conference on Machine Learning, ICML 2023, Vol. 202. PMLR, 28492–28518. https://proceedings.mlr.press/v202/ radford23a.html

  65. [65]

    Rajarshi Roy, Jonathan Raiman, Sang-Gil Lee, et al. 2026. PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models.CoRR abs/2602.06053 (2026). arXiv:2602.06053 doi:10.48550/ARXIV.2602.06053

  66. [66]

    Shulan Ruan, Huijie Liu, Zhao Chen, Bin Feng, Kun Zhang, Caleb Chen Cao, Enhong Chen, and Lei Chen. 2025. CPWS: Confident programmatic weak supervision for high-quality data labeling.ACM Transactions on Information Systems43, 4 (2025), 1–26

  67. [67]

    Sakshi, Utkarsh Tyagi, Sonal Kumar, et al

    S. Sakshi, Utkarsh Tyagi, Sonal Kumar, et al. 2025. MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark. InThe Thirteenth International Conference on Learning Representations, ICLR 2025. OpenReview.net. https: //openreview.net/forum?id=TeVAZXr3yv

  68. [68]

    Jun Shi, Shulan Ruan, Ziqi Zhu, Minfan Zhao, Hong An, Xudong Xue, and Bing Yan. 2024. Predictive accuracy-based active learning for medical image PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea segmentation. InProceedings of the Thirty-Third Int...

  69. [69]

    Qundong Shi, Jie Zhou, Biyuan Lin, et al . 2026. UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models.arXiv preprint arXiv:2601.01373(2026)

  70. [70]

    Xian Shi, Qiangze Feng, and Lei Xie. 2020. The ASRU 2019 Mandarin-English Code-Switching Speech Recognition Challenge: Open Datasets, Tracks, Methods and Results.CoRRabs/2007.05916 (2020). arXiv:2007.05916 https://arxiv.org/ abs/2007.05916

  71. [71]

    Xian Shi, Xiong Wang, Zhifang Guo, et al. 2026. Qwen3-ASR Technical Report. arXiv preprint arXiv:2601.21337(2026)

  72. [72]

    Shuzheng Si, Wentao Ma, Haoyu Gao, et al. 2023. SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023. http://papers.nips.cc/ paper_files/paper/2023/hash/7b16688a2b053a1b01474ab5c78ce662-Abstract- Data...

  73. [73]

    StepFun. 2026. StepAudio 2.5 TTS: Contextual TTS. https://platform.stepfun. com/docs/zh/guides/models/stepaudio-2.5-tts

  74. [74]

    Tianxiang Sun, Xiaotian Zhang, Zhengfu He, et al . 2024. MOSS: An Open Conversational Large Language Model.Mach. Intell. Res.21, 5 (2024), 888–905. doi:10.1007/S11633-024-1502-8

  75. [75]

    Zhiyuan Tang, Dong Wang, Yanguang Xu, et al. 2021. KeSpeech: An Open Source Speech Dataset of Mandarin and Its Eight Subdialects. InNeural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021. https://datasets-benchmarks-proceedings.neurips.cc/paper/ 2021/hash/0336dcbab05b9d5ad24f4333c7658a0e-Abstract-round2.html

  76. [76]

    Llama Team. 2024. The Llama 3 Herd of Models.CoRRabs/2407.21783 (2024). arXiv:2407.21783 doi:10.48550/ARXIV.2407.21783

  77. [77]

    Qwen Team. 2026. Qwen3.5-Omni Technical Report.CoRRabs/2604.15804 (2026). arXiv:2604.15804 doi:10.48550/ARXIV.2604.15804

  78. [78]

    Tongyi Fun Team, Qian Chen, Luyao Cheng, et al . 2025. Fun-Audio-Chat Technical Report.CoRRabs/2512.20156 (2025). arXiv:2512.20156 doi:10.48550/ ARXIV.2512.20156

  79. [79]

    Zeyue Tian, Binxin Yang, Zhaoyang Liu, et al. 2026. Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing.CoRR abs/2604.10708 (2026). arXiv:2604.10708 doi:10.48550/ARXIV.2604.10708

  80. [80]

    Changhan Wang, Morgane Rivière, Ann Lee, et al. 2021. VoxPopuli: A Large- Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. InMeeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021. Association for Computat...

Showing first 80 references.