PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

Lu Fan; Shiwei Wu; Shulan Ruan; Sicheng Yang; You He; Yu Liu; Zhi Li

arxiv: 2606.01016 · v1 · pith:NDYZDGHBnew · submitted 2026-05-31 · 💻 cs.CL · cs.AI· eess.AS

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

Sicheng Yang , Shulan Ruan , Shiwei Wu , Yu Liu , Lu Fan , Zhi Li , You He This is my paper

Pith reviewed 2026-06-28 17:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AIeess.AS

keywords speech understandingmultilingual benchmarkdialectslow-resource languagesend-to-end modelschain-of-thoughtparalinguistic cuessynthetic speech

0 comments

The pith

PolySpeech-100 benchmark reveals open-source E2E speech models outperform cascade systems on heavy dialects by preserving paralinguistic cues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a new benchmark covering 110 language variants to test speech understanding beyond simple transcription. It evaluates 22 models and finds that direct end-to-end audio processing helps open-source models handle dialects better than pipelines that first convert speech to text. The work also shows open-source models drop sharply on low-resource languages while commercial ones hold up, and that chain-of-thought prompting often hurts results in zero-shot speech tasks. A sympathetic reader cares because current evaluation methods miss these modality-specific strengths and weaknesses. The benchmark uses a hybrid pipeline of real recordings plus synthetic speech to reach this scale.

Core claim

PolySpeech-100 is a benchmark for native-level speech comprehension across 110 linguistic variants that uses a hybrid construction pipeline to combine gold-standard human recordings with instruction-driven synthetic speech. Evaluation of 22 models shows open-source E2E models outperform cascade ASR-plus-LLM systems on heavy dialects because direct audio processing retains prosodic features and paralinguistic cues lost in transcription; open-source models exhibit catastrophic degradation on low-resource languages while commercial models remain robust; and chain-of-thought prompting degrades performance under standard zero-shot settings, indicating a modality alignment gap.

What carries the argument

The hybrid construction pipeline that augments human recordings with synthetic speech to reach 110 variants while targeting native-level semantic comprehension.

If this is right

Direct audio input is required to capture dialect-specific intonation and stress that text transcription discards.
Open-source speech models need targeted improvements for low-resource languages to close the gap with commercial systems.
Standard chain-of-thought prompting should be avoided in zero-shot speech understanding tasks because it reduces accuracy.
Future model development must prioritize architectures that align audio and language modalities without relying on intermediate text.
Benchmarks limited to high-resource languages and ASR miss critical failure modes in real-world multilingual use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training regimes that explicitly optimize for prosody retention could widen the advantage of E2E models on dialects.
The observed degradation on low-resource languages suggests that scaling laws for speech may require language-specific data balancing rather than uniform scaling.
If the modality alignment gap persists, hybrid systems may need separate audio-native reasoning heads instead of borrowing from text LLMs.
Public release of the benchmark data allows direct testing of whether new models close the low-resource gap without synthetic artifacts.

Load-bearing premise

The synthetic speech added to human recordings accurately represents native comprehension without introducing artifacts that change how models rank against each other.

What would settle it

If re-running the same models on an all-human-recorded subset of the same tasks reverses the ranking between E2E and cascade systems on dialects, the claim that direct audio processing is the decisive factor would not hold.

Figures

Figures reproduced from arXiv: 2606.01016 by Lu Fan, Shiwei Wu, Shulan Ruan, Sicheng Yang, You He, Yu Liu, Zhi Li.

**Figure 1.** Figure 1: Geographic distribution of PolySpeech-100. The points represent specific languages and dialects included in the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The data construction pipeline of PolySpeech-100. The framework consists of three stages: multi-source data aggrega [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance benchmarking of various audio models across diverse linguistic scenarios. The radar charts illustrate [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: While a general negative correlation exists between audio [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 4.** Figure 4: Performance analysis of 4 Speech-LLMs across vary [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Analysis of prediction bias and prompt sensitivity across three linguistic groups. The bar charts display the selection [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: The t-SNE visualization of multilingual speech em [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: The user interface of the PolySpeech-100 interactive [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: The standard text system prompt used for models [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: The Chain-of-Thought (CoT) system prompt dis [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Confusion matrices of 4 models across all test samples. The matrices are row-normalized to show the recall rate [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 12.** Figure 12: The model successfully deduces that A is NOT the [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Despite perfectly sound logic that explicitly vali [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Comparison of Ground Truth and ASR Transcrip [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

read the original abstract

While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level' speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech-100.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PolySpeech-100 gives a useful scale-up on multilingual speech benchmarks but its hybrid real-synthetic pipeline needs direct validation before the dialect and low-resource claims can be trusted.

read the letter

The paper's main contribution is a benchmark with 110 variants that tries to test semantic speech understanding rather than just ASR, covering 19 Chinese dialects and 80+ low-resource languages via a hybrid pipeline of human recordings plus instruction-driven synthetic speech. They run 22 models including commercial and open-source ones and report three observations: open-source E2E models beat cascades on heavy dialects, open-source models drop hard on low-resource languages, and zero-shot CoT often hurts rather than helps.

The scale and the public release of data, demo, and code are the clear positives. Existing speech-LLM evals really are skewed toward high-resource languages and transcription tasks, so a broader test set is worth having even if the numbers need checking.

The soft spot is the construction pipeline. The headline claims rest on the idea that E2E models preserve paralinguistic and prosodic cues that cascades lose, yet the synthetic portion for dialects and low-resource items has no reported checks against native intonation, stress, or artifacts. If the TTS introduces uniform distortions or varies in quality, the E2E advantage and the size of the low-resource gaps could be benchmark artifacts rather than model properties. The abstract also uses strong phrasing like "catastrophic degradation" without the supporting tables or error breakdowns here.

This is for groups building or evaluating speech-LLMs who care about coverage beyond English and Mandarin. It is worth sending to peer review because the benchmark scope is large enough to matter and the public artifacts let others test the pipeline directly, even though the data validation step will need work.

Referee Report

1 major / 2 minor

Summary. The paper introduces PolySpeech-100, a benchmark for native-level speech comprehension across 110 linguistic variants (19 Chinese dialects + 80+ low-resource languages). It employs a hybrid construction pipeline combining gold-standard human recordings with instruction-driven synthetic speech, evaluates 22 models (including commercial and open-source E2E Speech-LLMs and cascades), and reports three main findings: open-source E2E models outperform ASR+LLM cascades on heavy dialects because direct audio processing preserves paralinguistic and prosodic cues; commercial models are robust while open-source models show catastrophic degradation on low-resource languages; and Chain-of-Thought prompting degrades zero-shot performance, suggesting a modality alignment gap. Data, demo, and code are released publicly.

Significance. If the benchmark construction is shown to be free of systematic artifacts, the work would be significant for highlighting performance gaps in current Speech-LLMs on underrepresented languages and dialects, and for providing evidence that E2E architectures can retain prosodic information lost in transcription-based cascades. The public release of the dataset and evaluation code is a clear strength that enables reproducibility and follow-on work.

major comments (1)

[Abstract / Benchmark Construction] Abstract (hybrid construction pipeline) and associated methods: The headline claims that E2E models outperform cascades on heavy dialects (due to preserved paralinguistic cues) and that open-source models suffer catastrophic degradation on low-resource languages rest on the assumption that instruction-driven synthetic speech faithfully represents native prosody, intonation, and stress without introducing artifacts that differentially impact E2E audio encoders versus ASR components in cascades. No validation (e.g., prosodic feature comparisons, human perceptual tests, or ablation on synthesis quality per dialect/language) is described to support this; if synthesis quality varies systematically, the reported gaps could be benchmark artifacts rather than model properties.

minor comments (2)

Clarify the exact total (110 variants vs. title's 100+) and provide a breakdown table of human vs. synthetic portions per language group.
[Abstract] The abstract states 'native-level' comprehension but does not define the metric or human baseline used to establish this threshold.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on PolySpeech-100. The major comment highlights an important aspect of benchmark validity regarding the synthetic speech component. We address this point below and commit to revisions that strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Benchmark Construction] Abstract (hybrid construction pipeline) and associated methods: The headline claims that E2E models outperform cascades on heavy dialects (due to preserved paralinguistic cues) and that open-source models suffer catastrophic degradation on low-resource languages rest on the assumption that instruction-driven synthetic speech faithfully represents native prosody, intonation, and stress without introducing artifacts that differentially impact E2E audio encoders versus ASR components in cascades. No validation (e.g., prosodic feature comparisons, human perceptual tests, or ablation on synthesis quality per dialect/language) is described to support this; if synthesis quality varies systematically, the reported gaps could be benchmark artifacts rather than model properties.

Authors: We agree that explicit validation of the synthetic speech is necessary to rule out systematic artifacts. The hybrid pipeline prioritizes gold-standard human recordings where available and uses instruction-driven synthesis only to extend coverage to underrepresented variants; however, the original submission did not report prosodic analyses, perceptual tests, or per-language synthesis ablations. In the revised version we will add: (1) quantitative comparison of prosodic features (F0, duration, energy, rhythm) between synthetic and human speech on a stratified sample of 10 dialects/languages; (2) human perceptual ratings (naturalness, intelligibility, prosodic fidelity) collected from native speakers; and (3) an ablation that evaluates model performance on the human-recorded subset alone versus the full hybrid set. These additions will directly test whether the reported E2E advantages and degradation patterns persist when synthesis quality is controlled. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark evaluations are externally grounded

full rationale

The paper presents a new benchmark constructed via a hybrid human+synthetic pipeline and reports empirical results from evaluating 22 independent external models (Gemini-3, GPT-Audio, Qwen2.5-Omni, etc.). No equations, fitted parameters, or derived quantities are defined in terms of the target claims. No self-citations are invoked to justify uniqueness or load-bearing premises. All headline observations (E2E vs cascade gaps on dialects, open-source degradation on low-resource languages, CoT degradation) are presented as direct measurements against third-party systems and are therefore falsifiable outside the paper's own data construction. This matches the default expectation of a non-circular empirical benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work relies on standard practices for data collection and model evaluation.

pith-pipeline@v0.9.1-grok · 5874 in / 1114 out tokens · 25026 ms · 2026-06-28T17:42:29.522659+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

117 extracted references · 68 canonical work pages · 21 internal anchors

[1]

Inclusion AI, Biao Gong, Cheng Zou, et al. 2025. Ming-Omni: A Unified Mul- timodal Model for Perception and Generation.CoRRabs/2506.09344 (2025). https://doi.org/10.48550/arXiv.2506.09344

work page doi:10.48550/arxiv.2506.09344 2025
[2]

Keyu An, Qian Chen, Chong Deng, et al . 2024. FunAudioLLM: Voice Un- derstanding and Generation Foundation Models for Natural Interaction Be- tween Humans and LLMs.CoRRabs/2407.04051 (2024). arXiv:2407.04051 doi:10.48550/ARXIV.2407.04051

work page doi:10.48550/arxiv.2407.04051 2024
[3]

Philip Anastassiou, Jiawei Chen, Jitong Chen, et al. 2024. Seed-TTS: A Family of High-Quality Versatile Speech Generation Models.CoRRabs/2406.02430 (2024). arXiv:2406.02430 doi:10.48550/ARXIV.2406.02430

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.02430 2024
[4]

Rosana Ardila, Megan Branson, Kelly Davis, et al . 2020. Common Voice: A Massively-Multilingual Speech Corpus. InLanguage Resources and Evaluation Conference, LREC 2020. European Language Resources Association, 4218–4222. https://aclanthology.org/2020.lrec-1.520/

2020
[5]

Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, et al . 2025. On The Landscape of Spoken Language Models: A Comprehensive Survey.CoRR abs/2504.08528 (2025). arXiv:2504.08528 doi:10.48550/ARXIV.2504.08528

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.08528 2025
[6]

Lucas Bandarkar, Davis Liang, Benjamin Muller, et al. 2024. The Belebele Bench- mark: a Parallel Reading Comprehension Dataset in 122 Language Variants. In Meeting of the Association for Computational Linguistics, ACL 2024. Association for Computational Linguistics, 749–775. doi:10.18653/V1/2024.ACL-LONG.44

work page doi:10.18653/v1/2024.acl-long.44 2024
[7]

Martijn Bartelds, Nay San, Bradley McDonnell, et al . 2023. Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation. InMeeting of the Association for Computational Linguistics, ACL

2023
[8]

Goanta, N

Association for Computational Linguistics, 715–729. doi:10.18653/V1/2023. ACL-LONG.42

work page doi:10.18653/v1/2023 2023
[9]

Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski, et al. 2020. SLURP: A Spoken Language Understanding Resource Package. In2020 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics, 7252–7262. doi:10.18653/V1/2020.EMNLP-MAIN.588

work page doi:10.18653/v1/2020.emnlp-main.588 2020
[10]

Hui Bu, Jiayu Du, Xingyu Na, et al. 2017. AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline.CoRRabs/1709.05522 (2017). arXiv:1709.05522 http://arxiv.org/abs/1709.05522

Pith/arXiv arXiv 2017
[11]

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, et al. 2008. IEMOCAP: interactive emotional dyadic motion capture database.Lang. Resour. Evaluation42, 4 (2008), 335–359. doi:10.1007/S10579-008-9076-6

work page doi:10.1007/s10579-008-9076-6 2008
[12]

ByteDance Seed Team. 2026. Introducing Seed Full-Duplex Speech LLM: Attentive Listening, Robust Interference Suppression, Enabling More Natural Interaction. https://seed.bytedance.com/en/blog/introducing-seed-full-duplex- speech-llm-attentive-listening-robust-interference-suppression-enabling- more-natural-interaction. Project Page: https://seed.bytedance...

2026
[13]

Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, et al . 2021. GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10, 000 Hours of Transcribed Audio. In22nd Annual Conference of the International Speech Communication Association, Interspeech 2021. ISCA, 3670–3674. doi:10.21437/INTERSPEECH.2021-1965

work page doi:10.21437/interspeech.2021-1965 2021
[14]

Hongjie Chen, Zehan Li, Guangmin Xia, et al. 2024. Telespeechpt: Large-scale chinese multi-dialect and multi-accent speech pre-training. InNational Confer- ence on Man-Machine Speech Communication. 183–190

2024
[15]

Junjie Chen, Yao Hu, Junjie Li, et al. 2025. FireRedChat: A Pluggable, Full-Duplex Voice Interaction System with Cascaded and Semi-Cascaded Implementations. CoRRabs/2509.06502 (2025). arXiv:2509.06502 doi:10.48550/ARXIV.2509.06502

work page doi:10.48550/arxiv.2509.06502 2025
[16]

Qian Chen, Yafeng Chen, Yanni Chen, et al. 2025. MinMo: A Multimodal Large Language Model for Seamless Voice Interaction.CoRRabs/2501.06282 (2025). arXiv:2501.06282 doi:10.48550/ARXIV.2501.06282

work page doi:10.48550/arxiv.2501.06282 2025
[17]

Wenxi Chen, Ziyang Ma, Ruiqi Yan, et al . 2025. SLAM-Omni: Timbre- Controllable Voice Interaction System with Single-Stage Training. InFindings of the Association for Computational Linguistics, ACL 2025, Vol. ACL 2025. As- sociation for Computational Linguistics, 2262–2282. https://aclanthology.org/ 2025.findings-acl.115/

2025
[18]

Yiming Chen, Xianghu Yue, Chen Zhang, et al. 2024. VoiceBench: Benchmarking LLM-Based Voice Assistants.CoRRabs/2410.17196 (2024). arXiv:2410.17196 doi:10.48550/ARXIV.2410.17196

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.17196 2024
[19]

Ziqi Chen, Gongyu Chen, Yihua Wang, et al . 2025. DiaMoE-TTS: A Unified IPA-Based Dialect TTS Framework with Mixture-of-Experts and Parameter- Efficient Zero-Shot Adaptation.CoRRabs/2509.22727 (2025). arXiv:2509.22727 doi:10.48550/ARXIV.2509.22727

work page doi:10.48550/arxiv.2509.22727 2025
[20]

Yunfei Chu, Jin Xu, Qian Yang, et al . 2024. Qwen2-Audio Technical Report. CoRRabs/2407.10759 (2024). arXiv:2407.10759 doi:10.48550/ARXIV.2407.10759

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.10759 2024
[21]

Christopher Cieri, David Graff, Owen Kimball, et al . [n. d.].Fisher English Training Speech Part 1 Speech. doi:10.35111/da4a-se30 KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Sicheng Yang et al

work page doi:10.35111/da4a-se30 2026
[22]

Alexis Conneau, Min Ma, Simran Khanuja, et al . 2022. FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech. InIEEE Spoken Language Technology Workshop, SLT 2022. IEEE, 798–805. doi:10.1109/SLT54892. 2023.10023141

work page doi:10.1109/slt54892 2022
[23]

Costa-jussà, Bokai Yu, Pierre Andrews, et al

Marta R. Costa-jussà, Bokai Yu, Pierre Andrews, et al . 2025. 2M-BELEBELE: Highly Multilingual Speech and American Sign Language Comprehension Dataset Download PDF. InFindings of the Association for Computational Lin- guistics, ACL 2025, Vol. ACL 2025. Association for Computational Linguistics, 10893–10904. https://aclanthology.org/2025.findings-acl.569/

2025
[24]

Junbo Cui, Bokai Xu, Chongyi Wang, et al . 2026. MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction.CoRRabs/2604.27393 (2026). arXiv:2604.27393 doi:10.48550/ARXIV.2604.27393

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.27393 2026
[25]

Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, et al. 2025. Recent Advances in Speech Language Models: A Survey. InMeeting of the Association for Computational Linguistics, ACL 2025. Association for Computational Linguistics, 13943–13970. https://aclanthology.org/2025.acl-long.682/

2025
[26]

Yuhang Dai, Ziyu Zhang, Shuai Wang, et al. 2025. WenetSpeech-Chuan: A Large- Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing. CoRRabs/2509.18004 (2025). arXiv:2509.18004 doi:10.48550/ARXIV.2509.18004

work page doi:10.48550/arxiv.2509.18004 2025
[27]

Alexandre Défossez, Laurent Mazaré, Manu Orsini, et al. 2024. Moshi: a speech- text foundation model for real-time dialogue.CoRRabs/2410.00037 (2024). arXiv:2410.00037 doi:10.48550/ARXIV.2410.00037

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.00037 2024
[28]

Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, et al . 2025. MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs. InFindings of the Association for Computational Linguistics, ACL 2025, Vol. ACL 2025. Association for Computational Linguistics, 18632–18702. https://aclanthology.org/2025.findings...

2025
[29]

Xinhan Di, Zihao Chen, Yunming Liang, et al . 2024. Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation. CoRRabs/2408.00284 (2024). arXiv:2408.00284 doi:10.48550/ARXIV.2408.00284

work page doi:10.48550/arxiv.2408.00284 2024
[30]

Seungheon Doh, Keunwoo Choi, Jongpil Lee, et al. 2023. LP-MusicCaps: LLM- Based Pseudo Music Captioning. InInternational Society for Music Information Retrieval Conference, ISMIR 2023. 409–416. doi:10.5281/ZENODO.10265311

work page doi:10.5281/zenodo.10265311 2023
[31]

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. 2020. Clotho: an Audio Captioning Dataset. In2020 IEEE International Conference on Acoustics. IEEE, 736–740. doi:10.1109/ICASSP40776.2020.9052990

work page doi:10.1109/icassp40776.2020.9052990 2020
[32]

Jiayu Du, Xingyu Na, Xuechen Liu, et al . 2018. AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale.CoRRabs/1808.10583 (2018). arXiv:1808.10583 http://arxiv.org/abs/1808.10583

Pith/arXiv arXiv 2018
[33]

Zhihao Du, Changfeng Gao, Yuxuan Wang, et al . 2025. CosyVoice 3: To- wards In-the-wild Speech Generation via Scaling-up and Post-training.CoRR abs/2505.17589 (2025). arXiv:2505.17589 doi:10.48550/ARXIV.2505.17589

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.17589 2025
[34]

Qingkai Fang, Shoutao Guo, Yan Zhou, et al . 2025. LLaMA-Omni: Seamless Speech Interaction with Large Language Models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025. OpenReview.net. https: //openreview.net/forum?id=PYmrUQmMEw

2025
[35]

Qingkai Fang, Yan Zhou, Shoutao Guo, et al. 2025. LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis. CoRRabs/2505.02625 (2025). arXiv:2505.02625 doi:10.48550/ARXIV.2505.02625

work page doi:10.48550/arxiv.2505.02625 2025
[36]

Yuan Gong, Jin Yu, and James R. Glass. 2022. Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition. InIEEE International Confer- ence on Acoustics, Speech and Signal Processing, ICASSP 2022. IEEE, 151–155. doi:10.1109/ICASSP43922.2022.9746828

work page doi:10.1109/icassp43922.2022.9746828 2022
[37]

Google DeepMind. 2025. Gemini 3 Flash Model Card. https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf

2025
[38]

Jack Hong, Shilin Yan, Jiayin Cai, et al. 2025. WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs.CoRRabs/2502.04326 (2025). arXiv:2502.04326 doi:10.48550/ARXIV.2502.04326

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.04326 2025
[39]

Chien-Yu Huang, Ke-Han Lu, Shih-Heng Wang, et al. 2024. Dynamic-Superb: Towards a Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark For Speech. InIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024. IEEE, 12136–12140. doi:10.1109/ICASSP48485. 2024.10448257

work page doi:10.1109/icassp48485 2024
[40]

Hunt and Alan W

Andrew J. Hunt and Alan W. Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In1996 IEEE International Conference on Acoustics. IEEE Computer Society, 373–376. doi:10.1109/ICASSP. 1996.541110

work page doi:10.1109/icassp 1996
[41]

Katsumi Ibaraki and David Chiang. 2025. Frustratingly Easy Data Augmentation for Low-Resource ASR.CoRRabs/2509.15373 (2025). arXiv:2509.15373 doi:10. 48550/ARXIV.2509.15373

arXiv 2025
[42]

Shengpeng Ji, Yifu Chen, Minghui Fang, et al . 2024. WavChat: A Survey of Spoken Dialogue Models.CoRRabs/2411.13577 (2024). arXiv:2411.13577 doi:10. 48550/ARXIV.2411.13577

arXiv 2024
[43]

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, et al . 2019. Audio- Caps: Generating Captions for Audios in The Wild. In2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 119–132. doi:10.18653/V1/N19-1011

work page doi:10.18653/v1/n19-1011 2019
[44]

Jiyeon Kim, Mehul Kumar, Dhananjaya Gowda, et al. 2021. Semi-Supervised Transfer Learning for Language Expansion of End-to-End Speech Recognition Models to Low-Resource Languages. InIEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021. IEEE, 984–988. doi:10.1109/ASRU51503. 2021.9688019

work page doi:10.1109/asru51503 2021
[45]

KimiTeam, Ding Ding, Zeqian Ju, et al . 2025. Kimi-Audio Technical Report. CoRRabs/2504.18425 (2025). arXiv:2504.18425 doi:10.48550/ARXIV.2504.18425

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.18425 2025
[46]

Chia-Hsuan Li, Szu-Lin Wu, Chi-Liang Liu, et al. 2018. Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehen- sion. In19th Annual Conference of the International Speech Communication Associ- ation, Interspeech 2018. ISCA, 3459–3463. doi:10.21437/INTERSPEECH.2018-1714

work page doi:10.21437/interspeech.2018-1714 2018
[47]

Guojian Li, Chengyou Wang, Hongfei Xue, et al . 2025. Easy Turn: Integrat- ing Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems.CoRRabs/2509.23938 (2025). arXiv:2509.23938 doi:10.48550/ARXIV.2509.23938

work page doi:10.48550/arxiv.2509.23938 2025
[48]

Longhao Li, Zhao Guo, Hongjie Chen, et al. 2025. WenetSpeech-Yue: A Large- scale Cantonese Speech Corpus with Multi-dimensional Annotation.CoRR abs/2509.03959 (2025). arXiv:2509.03959 doi:10.48550/ARXIV.2509.03959

work page doi:10.48550/arxiv.2509.03959 2025
[49]

Tianpeng Li, Jun Liu, Tao Zhang, et al . 2025. Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction.CoRRabs/2502.17239 (2025). arXiv:2502.17239 doi:10.48550/ARXIV.2502.17239

work page doi:10.48550/arxiv.2502.17239 2025
[50]

Xuechen Li, Tianyi Zhang, Yann Dubois, et al. 2023. AlpacaEval: An Automatic Evaluator of Instruction-following Models

2023
[51]

Yizhi Li, Ge Zhang, Yinghao Ma, et al . 2024. OmniBench: Towards The Future of Universal Omni-Language Models.CoRRabs/2409.15272 (2024). arXiv:2409.15272 doi:10.48550/ARXIV.2409.15272

work page doi:10.48550/arxiv.2409.15272 2024
[52]

Zhanxun Liu, Yifan Duan, Mengmeng Wang, et al. 2025. X-Talk: On the Un- derestimated Potential of Modular Speech-to-Speech Dialogue System.CoRR abs/2512.18706 (2025). arXiv:2512.18706 doi:10.48550/ARXIV.2512.18706

work page doi:10.48550/arxiv.2512.18706 2025
[53]

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, et al . 2025. MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix.CoRR abs/2505.13032 (2025). arXiv:2505.13032 doi:10.48550/ARXIV.2505.13032

work page doi:10.48550/arxiv.2505.13032 2025
[54]

Ziyang Ma, Guanrou Yang, Wenxi Chen, et al. 2026. SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing.IEEE Journal of Selected Topics in Signal Processing(2026)

2026
[55]

Xinhao Mei, Chutong Meng, Haohe Liu, et al . 2024. WavCaps: A ChatGPT- Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Mul- timodal Research.IEEE ACM Trans. Audio Speech Lang. Process.32 (2024), 3339–3354. doi:10.1109/TASLP.2024.3419446

work page doi:10.1109/taslp.2024.3419446 2024
[56]

Yangyang Meng, Jinpeng Li, Guodong Lin, et al. 2025. Dolphin: A Large-Scale Au- tomatic Speech Recognition Model for Eastern Languages.CoRRabs/2503.20212 (2025). arXiv:2503.20212 doi:10.48550/ARXIV.2503.20212

work page doi:10.48550/arxiv.2503.20212 2025
[57]

OpenAI. 2024. GPT-4o System Card.CoRRabs/2410.21276 (2024). arXiv:2410.21276 doi:10.48550/ARXIV.2410.21276

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.21276 2024
[58]

OpenAI. 2025. gpt-audio-mini. https://platform.openai.com/docs/models/gpt- audio-mini

2025
[59]

Vassil Panayotov, Guoguo Chen, Daniel Povey, et al . 2015. Librispeech: An ASR corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics. IEEE, 5206–5210. doi:10.1109/ICASSP.2015.7178964

work page doi:10.1109/icassp.2015.7178964 2015
[60]

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, et al. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Conference of the Association for Computational Linguistics, ACL 2019. Associa- tion for Computational Linguistics, 527–536. doi:10.18653/V1/P19-1050

work page doi:10.18653/v1/p19-1050 2019
[61]

Vineel Pratap, Andros Tjandra, Bowen Shi, et al. 2024. Scaling Speech Tech- nology to 1, 000+ Languages.J. Mach. Learn. Res.25 (2024), 97:1–97:52. https://jmlr.org/papers/v25/23-1318.html

2024
[62]

Vineel Pratap, Qiantong Xu, Anuroop Sriram, et al. 2020. MLS: A Large-Scale Multilingual Dataset for Speech Research. In21st Annual Conference of the International Speech Communication Association, Interspeech 2020. ISCA, 2757–

2020
[63]

doi:10.21437/INTERSPEECH.2020-2826

work page doi:10.21437/interspeech.2020-2826 2020
[64]

Alec Radford, Jong Wook Kim, Tao Xu, et al. 2023. Robust Speech Recognition via Large-Scale Weak Supervision. InInternational Conference on Machine Learning, ICML 2023, Vol. 202. PMLR, 28492–28518. https://proceedings.mlr.press/v202/ radford23a.html

2023
[65]

Rajarshi Roy, Jonathan Raiman, Sang-Gil Lee, et al. 2026. PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models.CoRR abs/2602.06053 (2026). arXiv:2602.06053 doi:10.48550/ARXIV.2602.06053

work page doi:10.48550/arxiv.2602.06053 2026
[66]

Shulan Ruan, Huijie Liu, Zhao Chen, Bin Feng, Kun Zhang, Caleb Chen Cao, Enhong Chen, and Lei Chen. 2025. CPWS: Confident programmatic weak supervision for high-quality data labeling.ACM Transactions on Information Systems43, 4 (2025), 1–26

2025
[67]

Sakshi, Utkarsh Tyagi, Sonal Kumar, et al

S. Sakshi, Utkarsh Tyagi, Sonal Kumar, et al. 2025. MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark. InThe Thirteenth International Conference on Learning Representations, ICLR 2025. OpenReview.net. https: //openreview.net/forum?id=TeVAZXr3yv

2025
[68]

Jun Shi, Shulan Ruan, Ziqi Zhu, Minfan Zhao, Hong An, Xudong Xue, and Bing Yan. 2024. Predictive accuracy-based active learning for medical image PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea segmentation. InProceedings of the Thirty-Third Int...

2024
[69]

Qundong Shi, Jie Zhou, Biyuan Lin, et al . 2026. UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models.arXiv preprint arXiv:2601.01373(2026)

arXiv 2026
[70]

Xian Shi, Qiangze Feng, and Lei Xie. 2020. The ASRU 2019 Mandarin-English Code-Switching Speech Recognition Challenge: Open Datasets, Tracks, Methods and Results.CoRRabs/2007.05916 (2020). arXiv:2007.05916 https://arxiv.org/ abs/2007.05916

arXiv 2020
[71]

Xian Shi, Xiong Wang, Zhifang Guo, et al. 2026. Qwen3-ASR Technical Report. arXiv preprint arXiv:2601.21337(2026)

Pith/arXiv arXiv 2026
[72]

Shuzheng Si, Wentao Ma, Haoyu Gao, et al. 2023. SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023. http://papers.nips.cc/ paper_files/paper/2023/hash/7b16688a2b053a1b01474ab5c78ce662-Abstract- Data...

2023
[73]

StepFun. 2026. StepAudio 2.5 TTS: Contextual TTS. https://platform.stepfun. com/docs/zh/guides/models/stepaudio-2.5-tts

2026
[74]

Tianxiang Sun, Xiaotian Zhang, Zhengfu He, et al . 2024. MOSS: An Open Conversational Large Language Model.Mach. Intell. Res.21, 5 (2024), 888–905. doi:10.1007/S11633-024-1502-8

work page doi:10.1007/s11633-024-1502-8 2024
[75]

Zhiyuan Tang, Dong Wang, Yanguang Xu, et al. 2021. KeSpeech: An Open Source Speech Dataset of Mandarin and Its Eight Subdialects. InNeural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021. https://datasets-benchmarks-proceedings.neurips.cc/paper/ 2021/hash/0336dcbab05b9d5ad24f4333c7658a0e-Abstract-round2.html

2021
[76]

Llama Team. 2024. The Llama 3 Herd of Models.CoRRabs/2407.21783 (2024). arXiv:2407.21783 doi:10.48550/ARXIV.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[77]

Qwen Team. 2026. Qwen3.5-Omni Technical Report.CoRRabs/2604.15804 (2026). arXiv:2604.15804 doi:10.48550/ARXIV.2604.15804

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.15804 2026
[78]

Tongyi Fun Team, Qian Chen, Luyao Cheng, et al . 2025. Fun-Audio-Chat Technical Report.CoRRabs/2512.20156 (2025). arXiv:2512.20156 doi:10.48550/ ARXIV.2512.20156

arXiv 2025
[79]

Zeyue Tian, Binxin Yang, Zhaoyang Liu, et al. 2026. Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing.CoRR abs/2604.10708 (2026). arXiv:2604.10708 doi:10.48550/ARXIV.2604.10708

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.10708 2026
[80]

Changhan Wang, Morgane Rivière, Ann Lee, et al. 2021. VoxPopuli: A Large- Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. InMeeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021. Association for Computat...

2021

Showing first 80 references.

[1] [1]

Inclusion AI, Biao Gong, Cheng Zou, et al. 2025. Ming-Omni: A Unified Mul- timodal Model for Perception and Generation.CoRRabs/2506.09344 (2025). https://doi.org/10.48550/arXiv.2506.09344

work page doi:10.48550/arxiv.2506.09344 2025

[2] [2]

Keyu An, Qian Chen, Chong Deng, et al . 2024. FunAudioLLM: Voice Un- derstanding and Generation Foundation Models for Natural Interaction Be- tween Humans and LLMs.CoRRabs/2407.04051 (2024). arXiv:2407.04051 doi:10.48550/ARXIV.2407.04051

work page doi:10.48550/arxiv.2407.04051 2024

[3] [3]

Philip Anastassiou, Jiawei Chen, Jitong Chen, et al. 2024. Seed-TTS: A Family of High-Quality Versatile Speech Generation Models.CoRRabs/2406.02430 (2024). arXiv:2406.02430 doi:10.48550/ARXIV.2406.02430

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.02430 2024

[4] [4]

Rosana Ardila, Megan Branson, Kelly Davis, et al . 2020. Common Voice: A Massively-Multilingual Speech Corpus. InLanguage Resources and Evaluation Conference, LREC 2020. European Language Resources Association, 4218–4222. https://aclanthology.org/2020.lrec-1.520/

2020

[5] [5]

Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, et al . 2025. On The Landscape of Spoken Language Models: A Comprehensive Survey.CoRR abs/2504.08528 (2025). arXiv:2504.08528 doi:10.48550/ARXIV.2504.08528

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.08528 2025

[6] [6]

Lucas Bandarkar, Davis Liang, Benjamin Muller, et al. 2024. The Belebele Bench- mark: a Parallel Reading Comprehension Dataset in 122 Language Variants. In Meeting of the Association for Computational Linguistics, ACL 2024. Association for Computational Linguistics, 749–775. doi:10.18653/V1/2024.ACL-LONG.44

work page doi:10.18653/v1/2024.acl-long.44 2024

[7] [7]

Martijn Bartelds, Nay San, Bradley McDonnell, et al . 2023. Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation. InMeeting of the Association for Computational Linguistics, ACL

2023

[8] [8]

Goanta, N

Association for Computational Linguistics, 715–729. doi:10.18653/V1/2023. ACL-LONG.42

work page doi:10.18653/v1/2023 2023

[9] [9]

Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski, et al. 2020. SLURP: A Spoken Language Understanding Resource Package. In2020 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics, 7252–7262. doi:10.18653/V1/2020.EMNLP-MAIN.588

work page doi:10.18653/v1/2020.emnlp-main.588 2020

[10] [10]

Hui Bu, Jiayu Du, Xingyu Na, et al. 2017. AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline.CoRRabs/1709.05522 (2017). arXiv:1709.05522 http://arxiv.org/abs/1709.05522

Pith/arXiv arXiv 2017

[11] [11]

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, et al. 2008. IEMOCAP: interactive emotional dyadic motion capture database.Lang. Resour. Evaluation42, 4 (2008), 335–359. doi:10.1007/S10579-008-9076-6

work page doi:10.1007/s10579-008-9076-6 2008

[12] [12]

ByteDance Seed Team. 2026. Introducing Seed Full-Duplex Speech LLM: Attentive Listening, Robust Interference Suppression, Enabling More Natural Interaction. https://seed.bytedance.com/en/blog/introducing-seed-full-duplex- speech-llm-attentive-listening-robust-interference-suppression-enabling- more-natural-interaction. Project Page: https://seed.bytedance...

2026

[13] [13]

Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, et al . 2021. GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10, 000 Hours of Transcribed Audio. In22nd Annual Conference of the International Speech Communication Association, Interspeech 2021. ISCA, 3670–3674. doi:10.21437/INTERSPEECH.2021-1965

work page doi:10.21437/interspeech.2021-1965 2021

[14] [14]

Hongjie Chen, Zehan Li, Guangmin Xia, et al. 2024. Telespeechpt: Large-scale chinese multi-dialect and multi-accent speech pre-training. InNational Confer- ence on Man-Machine Speech Communication. 183–190

2024

[15] [15]

Junjie Chen, Yao Hu, Junjie Li, et al. 2025. FireRedChat: A Pluggable, Full-Duplex Voice Interaction System with Cascaded and Semi-Cascaded Implementations. CoRRabs/2509.06502 (2025). arXiv:2509.06502 doi:10.48550/ARXIV.2509.06502

work page doi:10.48550/arxiv.2509.06502 2025

[16] [16]

Qian Chen, Yafeng Chen, Yanni Chen, et al. 2025. MinMo: A Multimodal Large Language Model for Seamless Voice Interaction.CoRRabs/2501.06282 (2025). arXiv:2501.06282 doi:10.48550/ARXIV.2501.06282

work page doi:10.48550/arxiv.2501.06282 2025

[17] [17]

Wenxi Chen, Ziyang Ma, Ruiqi Yan, et al . 2025. SLAM-Omni: Timbre- Controllable Voice Interaction System with Single-Stage Training. InFindings of the Association for Computational Linguistics, ACL 2025, Vol. ACL 2025. As- sociation for Computational Linguistics, 2262–2282. https://aclanthology.org/ 2025.findings-acl.115/

2025

[18] [18]

Yiming Chen, Xianghu Yue, Chen Zhang, et al. 2024. VoiceBench: Benchmarking LLM-Based Voice Assistants.CoRRabs/2410.17196 (2024). arXiv:2410.17196 doi:10.48550/ARXIV.2410.17196

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.17196 2024

[19] [19]

Ziqi Chen, Gongyu Chen, Yihua Wang, et al . 2025. DiaMoE-TTS: A Unified IPA-Based Dialect TTS Framework with Mixture-of-Experts and Parameter- Efficient Zero-Shot Adaptation.CoRRabs/2509.22727 (2025). arXiv:2509.22727 doi:10.48550/ARXIV.2509.22727

work page doi:10.48550/arxiv.2509.22727 2025

[20] [20]

Yunfei Chu, Jin Xu, Qian Yang, et al . 2024. Qwen2-Audio Technical Report. CoRRabs/2407.10759 (2024). arXiv:2407.10759 doi:10.48550/ARXIV.2407.10759

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.10759 2024

[21] [21]

Christopher Cieri, David Graff, Owen Kimball, et al . [n. d.].Fisher English Training Speech Part 1 Speech. doi:10.35111/da4a-se30 KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Sicheng Yang et al

work page doi:10.35111/da4a-se30 2026

[22] [22]

Alexis Conneau, Min Ma, Simran Khanuja, et al . 2022. FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech. InIEEE Spoken Language Technology Workshop, SLT 2022. IEEE, 798–805. doi:10.1109/SLT54892. 2023.10023141

work page doi:10.1109/slt54892 2022

[23] [23]

Costa-jussà, Bokai Yu, Pierre Andrews, et al

Marta R. Costa-jussà, Bokai Yu, Pierre Andrews, et al . 2025. 2M-BELEBELE: Highly Multilingual Speech and American Sign Language Comprehension Dataset Download PDF. InFindings of the Association for Computational Lin- guistics, ACL 2025, Vol. ACL 2025. Association for Computational Linguistics, 10893–10904. https://aclanthology.org/2025.findings-acl.569/

2025

[24] [24]

Junbo Cui, Bokai Xu, Chongyi Wang, et al . 2026. MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction.CoRRabs/2604.27393 (2026). arXiv:2604.27393 doi:10.48550/ARXIV.2604.27393

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.27393 2026

[25] [25]

Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, et al. 2025. Recent Advances in Speech Language Models: A Survey. InMeeting of the Association for Computational Linguistics, ACL 2025. Association for Computational Linguistics, 13943–13970. https://aclanthology.org/2025.acl-long.682/

2025

[26] [26]

Yuhang Dai, Ziyu Zhang, Shuai Wang, et al. 2025. WenetSpeech-Chuan: A Large- Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing. CoRRabs/2509.18004 (2025). arXiv:2509.18004 doi:10.48550/ARXIV.2509.18004

work page doi:10.48550/arxiv.2509.18004 2025

[27] [27]

Alexandre Défossez, Laurent Mazaré, Manu Orsini, et al. 2024. Moshi: a speech- text foundation model for real-time dialogue.CoRRabs/2410.00037 (2024). arXiv:2410.00037 doi:10.48550/ARXIV.2410.00037

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.00037 2024

[28] [28]

Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, et al . 2025. MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs. InFindings of the Association for Computational Linguistics, ACL 2025, Vol. ACL 2025. Association for Computational Linguistics, 18632–18702. https://aclanthology.org/2025.findings...

2025

[29] [29]

Xinhan Di, Zihao Chen, Yunming Liang, et al . 2024. Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation. CoRRabs/2408.00284 (2024). arXiv:2408.00284 doi:10.48550/ARXIV.2408.00284

work page doi:10.48550/arxiv.2408.00284 2024

[30] [30]

Seungheon Doh, Keunwoo Choi, Jongpil Lee, et al. 2023. LP-MusicCaps: LLM- Based Pseudo Music Captioning. InInternational Society for Music Information Retrieval Conference, ISMIR 2023. 409–416. doi:10.5281/ZENODO.10265311

work page doi:10.5281/zenodo.10265311 2023

[31] [31]

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. 2020. Clotho: an Audio Captioning Dataset. In2020 IEEE International Conference on Acoustics. IEEE, 736–740. doi:10.1109/ICASSP40776.2020.9052990

work page doi:10.1109/icassp40776.2020.9052990 2020

[32] [32]

Jiayu Du, Xingyu Na, Xuechen Liu, et al . 2018. AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale.CoRRabs/1808.10583 (2018). arXiv:1808.10583 http://arxiv.org/abs/1808.10583

Pith/arXiv arXiv 2018

[33] [33]

Zhihao Du, Changfeng Gao, Yuxuan Wang, et al . 2025. CosyVoice 3: To- wards In-the-wild Speech Generation via Scaling-up and Post-training.CoRR abs/2505.17589 (2025). arXiv:2505.17589 doi:10.48550/ARXIV.2505.17589

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.17589 2025

[34] [34]

Qingkai Fang, Shoutao Guo, Yan Zhou, et al . 2025. LLaMA-Omni: Seamless Speech Interaction with Large Language Models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025. OpenReview.net. https: //openreview.net/forum?id=PYmrUQmMEw

2025

[35] [35]

Qingkai Fang, Yan Zhou, Shoutao Guo, et al. 2025. LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis. CoRRabs/2505.02625 (2025). arXiv:2505.02625 doi:10.48550/ARXIV.2505.02625

work page doi:10.48550/arxiv.2505.02625 2025

[36] [36]

Yuan Gong, Jin Yu, and James R. Glass. 2022. Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition. InIEEE International Confer- ence on Acoustics, Speech and Signal Processing, ICASSP 2022. IEEE, 151–155. doi:10.1109/ICASSP43922.2022.9746828

work page doi:10.1109/icassp43922.2022.9746828 2022

[37] [37]

Google DeepMind. 2025. Gemini 3 Flash Model Card. https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf

2025

[38] [38]

Jack Hong, Shilin Yan, Jiayin Cai, et al. 2025. WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs.CoRRabs/2502.04326 (2025). arXiv:2502.04326 doi:10.48550/ARXIV.2502.04326

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.04326 2025

[39] [39]

Chien-Yu Huang, Ke-Han Lu, Shih-Heng Wang, et al. 2024. Dynamic-Superb: Towards a Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark For Speech. InIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024. IEEE, 12136–12140. doi:10.1109/ICASSP48485. 2024.10448257

work page doi:10.1109/icassp48485 2024

[40] [40]

Hunt and Alan W

Andrew J. Hunt and Alan W. Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In1996 IEEE International Conference on Acoustics. IEEE Computer Society, 373–376. doi:10.1109/ICASSP. 1996.541110

work page doi:10.1109/icassp 1996

[41] [41]

Katsumi Ibaraki and David Chiang. 2025. Frustratingly Easy Data Augmentation for Low-Resource ASR.CoRRabs/2509.15373 (2025). arXiv:2509.15373 doi:10. 48550/ARXIV.2509.15373

arXiv 2025

[42] [42]

Shengpeng Ji, Yifu Chen, Minghui Fang, et al . 2024. WavChat: A Survey of Spoken Dialogue Models.CoRRabs/2411.13577 (2024). arXiv:2411.13577 doi:10. 48550/ARXIV.2411.13577

arXiv 2024

[43] [43]

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, et al . 2019. Audio- Caps: Generating Captions for Audios in The Wild. In2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 119–132. doi:10.18653/V1/N19-1011

work page doi:10.18653/v1/n19-1011 2019

[44] [44]

Jiyeon Kim, Mehul Kumar, Dhananjaya Gowda, et al. 2021. Semi-Supervised Transfer Learning for Language Expansion of End-to-End Speech Recognition Models to Low-Resource Languages. InIEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021. IEEE, 984–988. doi:10.1109/ASRU51503. 2021.9688019

work page doi:10.1109/asru51503 2021

[45] [45]

KimiTeam, Ding Ding, Zeqian Ju, et al . 2025. Kimi-Audio Technical Report. CoRRabs/2504.18425 (2025). arXiv:2504.18425 doi:10.48550/ARXIV.2504.18425

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.18425 2025

[46] [46]

Chia-Hsuan Li, Szu-Lin Wu, Chi-Liang Liu, et al. 2018. Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehen- sion. In19th Annual Conference of the International Speech Communication Associ- ation, Interspeech 2018. ISCA, 3459–3463. doi:10.21437/INTERSPEECH.2018-1714

work page doi:10.21437/interspeech.2018-1714 2018

[47] [47]

Guojian Li, Chengyou Wang, Hongfei Xue, et al . 2025. Easy Turn: Integrat- ing Acoustic and Linguistic Modalities for Robust Turn-Taking in Full-Duplex Spoken Dialogue Systems.CoRRabs/2509.23938 (2025). arXiv:2509.23938 doi:10.48550/ARXIV.2509.23938

work page doi:10.48550/arxiv.2509.23938 2025

[48] [48]

Longhao Li, Zhao Guo, Hongjie Chen, et al. 2025. WenetSpeech-Yue: A Large- scale Cantonese Speech Corpus with Multi-dimensional Annotation.CoRR abs/2509.03959 (2025). arXiv:2509.03959 doi:10.48550/ARXIV.2509.03959

work page doi:10.48550/arxiv.2509.03959 2025

[49] [49]

Tianpeng Li, Jun Liu, Tao Zhang, et al . 2025. Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction.CoRRabs/2502.17239 (2025). arXiv:2502.17239 doi:10.48550/ARXIV.2502.17239

work page doi:10.48550/arxiv.2502.17239 2025

[50] [50]

Xuechen Li, Tianyi Zhang, Yann Dubois, et al. 2023. AlpacaEval: An Automatic Evaluator of Instruction-following Models

2023

[51] [51]

Yizhi Li, Ge Zhang, Yinghao Ma, et al . 2024. OmniBench: Towards The Future of Universal Omni-Language Models.CoRRabs/2409.15272 (2024). arXiv:2409.15272 doi:10.48550/ARXIV.2409.15272

work page doi:10.48550/arxiv.2409.15272 2024

[52] [52]

Zhanxun Liu, Yifan Duan, Mengmeng Wang, et al. 2025. X-Talk: On the Un- derestimated Potential of Modular Speech-to-Speech Dialogue System.CoRR abs/2512.18706 (2025). arXiv:2512.18706 doi:10.48550/ARXIV.2512.18706

work page doi:10.48550/arxiv.2512.18706 2025

[53] [53]

Ziyang Ma, Yinghao Ma, Yanqiao Zhu, et al . 2025. MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix.CoRR abs/2505.13032 (2025). arXiv:2505.13032 doi:10.48550/ARXIV.2505.13032

work page doi:10.48550/arxiv.2505.13032 2025

[54] [54]

Ziyang Ma, Guanrou Yang, Wenxi Chen, et al. 2026. SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing.IEEE Journal of Selected Topics in Signal Processing(2026)

2026

[55] [55]

Xinhao Mei, Chutong Meng, Haohe Liu, et al . 2024. WavCaps: A ChatGPT- Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Mul- timodal Research.IEEE ACM Trans. Audio Speech Lang. Process.32 (2024), 3339–3354. doi:10.1109/TASLP.2024.3419446

work page doi:10.1109/taslp.2024.3419446 2024

[56] [56]

Yangyang Meng, Jinpeng Li, Guodong Lin, et al. 2025. Dolphin: A Large-Scale Au- tomatic Speech Recognition Model for Eastern Languages.CoRRabs/2503.20212 (2025). arXiv:2503.20212 doi:10.48550/ARXIV.2503.20212

work page doi:10.48550/arxiv.2503.20212 2025

[57] [57]

OpenAI. 2024. GPT-4o System Card.CoRRabs/2410.21276 (2024). arXiv:2410.21276 doi:10.48550/ARXIV.2410.21276

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.21276 2024

[58] [58]

OpenAI. 2025. gpt-audio-mini. https://platform.openai.com/docs/models/gpt- audio-mini

2025

[59] [59]

Vassil Panayotov, Guoguo Chen, Daniel Povey, et al . 2015. Librispeech: An ASR corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics. IEEE, 5206–5210. doi:10.1109/ICASSP.2015.7178964

work page doi:10.1109/icassp.2015.7178964 2015

[60] [60]

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, et al. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Conference of the Association for Computational Linguistics, ACL 2019. Associa- tion for Computational Linguistics, 527–536. doi:10.18653/V1/P19-1050

work page doi:10.18653/v1/p19-1050 2019

[61] [61]

Vineel Pratap, Andros Tjandra, Bowen Shi, et al. 2024. Scaling Speech Tech- nology to 1, 000+ Languages.J. Mach. Learn. Res.25 (2024), 97:1–97:52. https://jmlr.org/papers/v25/23-1318.html

2024

[62] [62]

Vineel Pratap, Qiantong Xu, Anuroop Sriram, et al. 2020. MLS: A Large-Scale Multilingual Dataset for Speech Research. In21st Annual Conference of the International Speech Communication Association, Interspeech 2020. ISCA, 2757–

2020

[63] [63]

doi:10.21437/INTERSPEECH.2020-2826

work page doi:10.21437/interspeech.2020-2826 2020

[64] [64]

Alec Radford, Jong Wook Kim, Tao Xu, et al. 2023. Robust Speech Recognition via Large-Scale Weak Supervision. InInternational Conference on Machine Learning, ICML 2023, Vol. 202. PMLR, 28492–28518. https://proceedings.mlr.press/v202/ radford23a.html

2023

[65] [65]

Rajarshi Roy, Jonathan Raiman, Sang-Gil Lee, et al. 2026. PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models.CoRR abs/2602.06053 (2026). arXiv:2602.06053 doi:10.48550/ARXIV.2602.06053

work page doi:10.48550/arxiv.2602.06053 2026

[66] [66]

Shulan Ruan, Huijie Liu, Zhao Chen, Bin Feng, Kun Zhang, Caleb Chen Cao, Enhong Chen, and Lei Chen. 2025. CPWS: Confident programmatic weak supervision for high-quality data labeling.ACM Transactions on Information Systems43, 4 (2025), 1–26

2025

[67] [67]

Sakshi, Utkarsh Tyagi, Sonal Kumar, et al

S. Sakshi, Utkarsh Tyagi, Sonal Kumar, et al. 2025. MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark. InThe Thirteenth International Conference on Learning Representations, ICLR 2025. OpenReview.net. https: //openreview.net/forum?id=TeVAZXr3yv

2025

[68] [68]

Jun Shi, Shulan Ruan, Ziqi Zhu, Minfan Zhao, Hong An, Xudong Xue, and Bing Yan. 2024. Predictive accuracy-based active learning for medical image PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea segmentation. InProceedings of the Thirty-Third Int...

2024

[69] [69]

Qundong Shi, Jie Zhou, Biyuan Lin, et al . 2026. UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models.arXiv preprint arXiv:2601.01373(2026)

arXiv 2026

[70] [70]

Xian Shi, Qiangze Feng, and Lei Xie. 2020. The ASRU 2019 Mandarin-English Code-Switching Speech Recognition Challenge: Open Datasets, Tracks, Methods and Results.CoRRabs/2007.05916 (2020). arXiv:2007.05916 https://arxiv.org/ abs/2007.05916

arXiv 2020

[71] [71]

Xian Shi, Xiong Wang, Zhifang Guo, et al. 2026. Qwen3-ASR Technical Report. arXiv preprint arXiv:2601.21337(2026)

Pith/arXiv arXiv 2026

[72] [72]

Shuzheng Si, Wentao Ma, Haoyu Gao, et al. 2023. SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023. http://papers.nips.cc/ paper_files/paper/2023/hash/7b16688a2b053a1b01474ab5c78ce662-Abstract- Data...

2023

[73] [73]

StepFun. 2026. StepAudio 2.5 TTS: Contextual TTS. https://platform.stepfun. com/docs/zh/guides/models/stepaudio-2.5-tts

2026

[74] [74]

Tianxiang Sun, Xiaotian Zhang, Zhengfu He, et al . 2024. MOSS: An Open Conversational Large Language Model.Mach. Intell. Res.21, 5 (2024), 888–905. doi:10.1007/S11633-024-1502-8

work page doi:10.1007/s11633-024-1502-8 2024

[75] [75]

Zhiyuan Tang, Dong Wang, Yanguang Xu, et al. 2021. KeSpeech: An Open Source Speech Dataset of Mandarin and Its Eight Subdialects. InNeural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021. https://datasets-benchmarks-proceedings.neurips.cc/paper/ 2021/hash/0336dcbab05b9d5ad24f4333c7658a0e-Abstract-round2.html

2021

[76] [76]

Llama Team. 2024. The Llama 3 Herd of Models.CoRRabs/2407.21783 (2024). arXiv:2407.21783 doi:10.48550/ARXIV.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024

[77] [77]

Qwen Team. 2026. Qwen3.5-Omni Technical Report.CoRRabs/2604.15804 (2026). arXiv:2604.15804 doi:10.48550/ARXIV.2604.15804

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.15804 2026

[78] [78]

Tongyi Fun Team, Qian Chen, Luyao Cheng, et al . 2025. Fun-Audio-Chat Technical Report.CoRRabs/2512.20156 (2025). arXiv:2512.20156 doi:10.48550/ ARXIV.2512.20156

arXiv 2025

[79] [79]

Zeyue Tian, Binxin Yang, Zhaoyang Liu, et al. 2026. Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing.CoRR abs/2604.10708 (2026). arXiv:2604.10708 doi:10.48550/ARXIV.2604.10708

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.10708 2026

[80] [80]

Changhan Wang, Morgane Rivière, Ann Lee, et al. 2021. VoxPopuli: A Large- Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. InMeeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021. Association for Computat...

2021