arxiv: 2604.22467 · v1 · submitted 2026-04-24 · 📡 eess.AS

Recognition: unknown

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

Li Li , Ming Cheng , Weixin Zhu , Yannan Wang , Juan Liu , Ming Li

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:14 UTC · model grok-4.3

classification 📡 eess.AS

keywords multi-speaker ASRspeaker diarizationlarge language modelsspeech recognitionmulti-turn dialogueword-level timestampsMandarin English benchmarks

0 comments

The pith

DM-ASR reformulates multi-speaker ASR as a sequence of speaker- and time-conditioned queries to large language models using diarization outputs as priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to simplify multi-speaker automatic speech recognition by feeding reliable speaker diarization results directly into a large language model. It recasts the transcription task as a multi-turn dialogue in which each query targets the words spoken by one speaker during one time segment. This structure separates the speaker and timing information from the linguistic content so the LLM can focus on accurate word prediction. Experiments on Mandarin and English benchmarks indicate the method delivers competitive accuracy using smaller models and less training data than approaches that attempt to learn speaker attribution, timing, and words in a single joint process.

Core claim

DM-ASR decomposes a multi-speaker audio chunk into a series of structured queries, each conditioned on one speaker identity and time segment from the diarization output, and generates the corresponding transcription as a dialogue turn; an optional interleaving of word tokens with timestamp tokens further enriches the output while improving overall transcription quality.

What carries the argument

The diarization-aware multi-turn dialogue formulation that converts the audio into per-speaker, per-segment queries so the LLM handles only linguistic generation.

If this is right

Diarization systems supply reliable structure while LLMs supply linguistic modeling, demonstrating complementary roles.
Interleaving word and timestamp tokens produces richer outputs and measurably better transcription quality.
The framework reaches strong performance on Mandarin and English benchmarks with smaller models and limited training data.
It remains competitive with or exceeds existing unified models that jointly learn everything from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid pipelines that pre-extract structure may require less data and compute than fully end-to-end learned systems for conversational audio tasks.
The same query-based decomposition could be applied to related problems such as meeting summarization or speaker-specific information extraction.
Evaluating performance across a range of diarization error rates would quantify the minimum reliability needed from the prior.

Load-bearing premise

Diarization systems supply accurate enough speaker labels and segment boundaries to serve as reliable structural priors that separate timing and identity from word content.

What would settle it

Running the system on the same benchmarks but with deliberately degraded diarization inputs and observing that transcription accuracy falls below strong unified baselines would show the priors are not sufficiently robust.

Figures

Figures reproduced from arXiv: 2604.22467 by Juan Liu, Li Li, Ming Cheng, Ming Li, Weixin Zhu, Yannan Wang.

**Figure 1.** Figure 1: Overall framework of the proposed Diarization-aware Multi-speaker ASR (DM-ASR) framework. Speaker labels and view at source ↗

**Figure 2.** Figure 2: Examples of multi-turn prompts and responses in DM-ASR. view at source ↗

**Figure 3.** Figure 3: Performance comparison across different evalua view at source ↗

read the original abstract

Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potential of unified modeling for this task, but jointly learning speaker attribution, temporal structure, and lexical recognition remains difficult and data-intensive. At the current stage, leveraging reliable speaker diarization as an explicit structural prior provides a practical and efficient way to simplify this task. To effectively exploit such priors, we propose DM-ASR, a diarization-aware multi-speaker ASR framework that reformulates the task as a multi-turn dialogue generation process. Given an audio chunk and diarization results, DM-ASR decomposes transcription into a sequence of speaker- and time-conditioned queries, each corresponding to one speaker in one time segment. This formulation converts multi-speaker recognition into a series of structured sub-tasks, explicitly decoupling speaker-temporal structure from linguistic content and enabling effective integration of diarization cues with the reasoning capability of large language models. We further introduce an optional word-level timestamp prediction mechanism that interleaves word and timestamp tokens, yielding richer structured outputs and better transcription quality. Our analysis shows that diarization systems provide more reliable speaker identities and segment-level boundaries, while LLMs excel at modeling linguistic content and long-range dependencies, demonstrating their complementary strengths. Experiments on Mandarin and English benchmarks show that the proposed approach achieves strong performance with relatively small models and training data, while remaining competitive with or outperforming existing unified approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DM-ASR gives a clean practical split of multi-speaker ASR by conditioning LLM queries on diarization outputs, but the performance claims rest on unexamined diarization error propagation.

read the letter

The paper's main move is to treat multi-speaker ASR as a sequence of separate LLM prompts, each tied to one speaker segment from a diarizer. This turns the joint problem of who-spoke-when and what-was-said into simpler sub-tasks. They also add an optional interleaving of word and timestamp tokens. That decomposition is the clearest new piece relative to the unified Speech-LLM baselines they cite. It lets them claim competitive results on Mandarin and English data with smaller models and less training data, which is the practical payoff they emphasize. The framing exploits the fact that current diarizers already do a decent job on speaker IDs and boundaries while LLMs handle the language modeling, so the split feels like a reasonable engineering choice rather than a forced unification. The abstract is upfront that this is leveraging existing components rather than learning everything end-to-end. The soft spot is exactly what the stress-test note flags: there is no reported check on how the system behaves when the diarizer is imperfect. Real meeting audio often has speaker swaps or boundary shifts of a few hundred milliseconds, and the paper gives no numbers on whether those errors cascade into bad transcriptions or whether the LLM can correct them. Without that, the end-to-end numbers are hard to interpret. The work is aimed at people already running diarization pipelines for meeting transcription or voice assistants who want to plug in an LLM for the text part. A reader who needs a drop-in way to add speaker attribution without retraining a giant joint model could find it useful. It is coherent enough on its own terms to deserve a serious referee, mainly to verify the experimental details and ask for the missing robustness tests. I would send it to review but flag the diarization sensitivity as the key point to examine.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes DM-ASR, a diarization-aware multi-speaker ASR framework that reformulates the task as a multi-turn dialogue generation process for LLMs. Given audio chunks and diarization outputs, it decomposes transcription into speaker- and time-conditioned queries, explicitly decoupling speaker-temporal structure from linguistic content. An optional word-level timestamp prediction mechanism interleaves word and timestamp tokens. The authors claim that this leverages complementary strengths of diarization (reliable identities/boundaries) and LLMs (linguistic modeling), achieving strong performance on Mandarin and English benchmarks with relatively small models and limited training data while remaining competitive with or outperforming existing unified approaches.

Significance. If the empirical claims hold, the work offers a practical, modular alternative to fully joint Speech-LLM modeling for multi-speaker ASR by exploiting off-the-shelf diarization as an explicit structural prior. This could reduce the data and compute burden of training while enabling richer structured outputs via timestamp interleaving. The formulation highlights a clean separation of concerns that aligns with real-world pipelines where diarization is already performed upstream.

major comments (1)

[Abstract] Abstract: The central empirical claim that the approach 'achieves strong performance with relatively small models and training data, while remaining competitive with or outperforming existing unified approaches' rests on the premise that diarization supplies reliable speaker identities and segment-level boundaries. However, no quantitative analysis of error propagation is described (e.g., performance under speaker swaps or boundary shifts of a few hundred milliseconds that would misalign the multi-turn queries), leaving the robustness of the reported gains untested on the same Mandarin and English test sets.

minor comments (1)

The abstract would be strengthened by including at least one key quantitative result (e.g., WER or speaker-attributed WER on a named benchmark) to ground the qualitative claims of 'strong performance' and competitiveness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the robustness of our empirical claims. We address the concern regarding error propagation from diarization below.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim that the approach 'achieves strong performance with relatively small models and training data, while remaining competitive with or outperforming existing unified approaches' rests on the premise that diarization supplies reliable speaker identities and segment-level boundaries. However, no quantitative analysis of error propagation is described (e.g., performance under speaker swaps or boundary shifts of a few hundred milliseconds that would misalign the multi-turn queries), leaving the robustness of the reported gains untested on the same Mandarin and English test sets.

Authors: We agree that a dedicated quantitative analysis of diarization error propagation would strengthen the manuscript. Our current evaluations rely on off-the-shelf diarization systems applied to the standard Mandarin and English benchmarks, which already incorporate realistic error patterns, and the reported gains hold under these conditions. To directly address the referee's point, we will add controlled experiments in the revision that inject simulated speaker swaps and boundary shifts (e.g., perturbations of 200-500 ms) into the same test sets, measuring the resulting impact on transcription accuracy and demonstrating the framework's tolerance to typical upstream diarization inaccuracies. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural reformulation and empirical claims rest on external components

full rationale

The paper proposes DM-ASR as a framework that reformulates multi-speaker ASR as multi-turn LLM dialogue conditioned on external diarization outputs. No equations, parameter fittings, or derivations appear in the provided text. The central mechanism (decomposing into speaker/time-conditioned queries) is a design choice, not a self-referential definition or fitted prediction. Claims of performance rely on benchmarks and the external premise that diarization supplies reliable priors; this assumption is stated but not derived internally or justified via self-citation chains. No load-bearing self-citations, ansatzes, or renamings of known results are present. The derivation chain is self-contained as an engineering integration of independent modules (diarization + LLM), with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on two domain assumptions about external components rather than new free parameters or invented entities.

axioms (2)

domain assumption Diarization systems provide reliable speaker identities and segment-level boundaries
Invoked explicitly as the structural prior that simplifies the joint learning problem
domain assumption Large language models can integrate diarization cues with linguistic reasoning to produce accurate transcripts
Core premise enabling the multi-turn dialogue formulation

pith-pipeline@v0.9.0 · 5597 in / 1286 out tokens · 47675 ms · 2026-05-08T09:14:52.799090+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Christoph Boeddeker, Aswin Shanmugam Subramanian, Gordon Wichern, Rein- hold Haeb-Umbach, and Jonathan Le Roux. 2024. TS-SEP: Joint diarization and separation conditioned on estimated speaker embeddings.IEEE/ACM Transac- tions on Audio, Speech, and Language Processing32 (2024), 1185–1197

2024
[2]

Hervé Bredin. 2023. pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. InProceedings of the Conference of the International Speech Communication Association (Interspeech). ISCA, 1983–1987

2023
[3]

Hervé Bredin, Ruiqing Yin, Juan Manuel Coria, Gregory Gelly, Pavel Korshunov, Marvin Lavechin, Diego Fustes, Hadrien Titeux, Wassim Bouaziz, and Marie- Philippe Gill. 2020. Pyannote. audio: neural building blocks for speaker diariza- tion. InProceedings of the International conference on acoustics, speech and signal processing (ICASSP). IEEE, 7124–7128

2020
[4]

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al . 2024. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology15, 3 (2024), 1–45

2024
[5]

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al . 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing16, 6 (2022), 1505–1518

2022
[6]

Ming Cheng, Yuke Lin, and Ming Li. 2025. Sequence-to-Sequence Neural Diariza- tion with Automatic Speaker Detection and Representation.IEEE Transactions on Audio, Speech and Language Processing33 (2025), 2719–2734

2025
[7]

Ming Cheng, Fei Su, Cancan Li, Juan Liu, and Ming Li. 2025. Multi-Channel Sequence-to-Sequence Neural Diarization: Experimental Results for The MISP 2025 Challenge. InProceedings of the Conference of the International Speech Communication Association (Interspeech)

2025
[8]

Ming Cheng, Weiqing Wang, Xiaoyi Qin, Yuke Lin, et al. 2024. The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023. In Proceedings of the National Conference on Man-Machine Speech Communication. 330–337

2024
[9]

Christopher Cieri, David Miller, and Kevin Walker. 2004. The Fisher corpus: A resource for the next generations of speech-to-text.. InLREC, Vol. 4. 69–71

2004
[10]

Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, et al. 2023. Seamless: Multilingual Expressive and Streaming Speech Translation. arXiv:2312.05187 [cs.CL]

work page arXiv 2023
[11]

Samuele Cornell, Jee-weon Jung, Shinji Watanabe, and Stefano Squartini. 2024. One model to rule them all? towards end-to-end joint speaker diarization and speech recognition. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 11856–11860

2024
[12]

Samuele Cornell, Matthew S Wiesner, Shinji Watanabe, Desh Raj, Xuankai Chang, Paola Garcia, Yoshiki Masuyam, Zhong-Qiu Wang, Stefano Squartini, and Sanjeev Khudanpur. 2023. The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios. InProceedings of the CHiME 2023. 1–6

2023
[13]

Georgios Efstathiadis, Vijay Yadav, and Anzar Abbas. 2025. LLM-based speaker diarization correction: A generalizable approach.Speech Communication170 (2025), 103224

2025
[14]

Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, et al. 2021. AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenarioe. InProceedings of the Conference of the International Speech Communication Association (Interspeech). 3665–3669

2021
[15]

Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, et al. 2019. End-to-End Neural Speaker Diarization with Permutation-Free Objectives. InProceedings of the Conference of the International Speech Communication Association (Interspeech). 4300–4304

2019
[16]

Ming Gao, Shilong Wu, Hang Chen, Jun Du, Chin-Hui Lee, Shinji Watanabe, Jingdong Chen, Siniscalchi Sabato Marco, and Odette Scharenborg. 2025. The multimodal information based speech processing (misp) 2025 challenge: Audio- visual diarization and recognition.arXiv preprint arXiv:2505.13971(2025)

work page arXiv 2025
[17]

Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, and Lukáš Burget. 2025. Leveraging self-supervised learning for speaker diarization. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

2025
[18]

Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, Jan Cer- nocky, and Lukas Burget. 2025. Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization. InPro- ceedings of the Conference of the International Speech Communication Association (Interspeech)

2025
[19]

Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, and Paola García
[20]

IEEE/ACM Transactions on Audio, Speech, and Language Processing30 (2022), 1493–1507

Encoder-Decoder Based Attractors for End-to-End Neural Diarization. IEEE/ACM Transactions on Audio, Speech, and Language Processing30 (2022), 1493–1507

2022
[21]

Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, and Kenji Naga- matsu. 2020. End-to-End Speaker Diarization for an Unknown Number of Speak- ers with Encoder-Decoder Based Attractors. InProceedings of the Conference of the International Speech Communication Association (Interspeech). 269–273

2020
[22]

Mingyue Huo, Yiwen Shao, and Yuheng Zhang. 2026. TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding. arXiv preprint arXiv:2601.06896(2026)

work page arXiv 2026
[23]

Janin, D

A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters. 2003. The ICSI Meeting Corpus. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1. 364–367

2003
[24]

Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, et al. 2020. Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of any Number of Speakers. InProceedings of the Conference of the International Speech Communication Association (Interspeech). 36–40

2020
[25]

Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, and Takuya Yoshioka
[26]

InProceedings of the Conference of the International Speech Communication Asso- ciation (Interspeech)

Serialized output training for end-to-end overlapped speech recognition. InProceedings of the Conference of the International Speech Communication Asso- ciation (Interspeech). 2797–2801
[27]

Naoyuki Kanda, Guoli Ye, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, and Takuya Yoshioka. 2021. End-to-end speaker-attributed asr with transformer. InProceedings of the Conference of the International Speech Communication Asso- ciation (Interspeech). 4413–4417

2021
[28]

Kraaij, T

W. Kraaij, T. Hain, M. Lincoln, and W. Post. 2005. The AMI meeting corpus. InProceedings of the International Conference on Methods and Techniques in Behavioral Research. 1–4

2005
[29]

Federico Landini, Ján Profant, Mireia Diez, and Lukáš Burget. 2022. Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: Theory, implementation and analysis on standard tasks.Computer Speech & Language 71 (2022)

2022
[30]

Ze Li, Ming Cheng, and Ming Li. 2026. Enhancing Speaker Verification with W2v-Bert 2.0 and Knowledge Distillation Guided Pruning. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)

2026
[31]

Yuhao Liang, Mohan Shi, Fan Yu, Yangze Li, Shiliang Zhang, Zhihao Du, Qian Chen, Lei Xie, Yanmin Qian, Jian Wu, et al. 2023. The Second Multi-Channel Multi-Party Meeting Transcription Challenge (M2MeT 2.0): A Benchmark for Speaker-Attributed ASR. InProceedings of the Automatic Speech Recognition and Understanding Workshop. 1–8

2023
[32]

Qingjian Lin, Ruiqing Yin, Ming Li, Hervé Bredin, and Claude Barras. 2019. LSTM Based Similarity Measurement with Spectral Clustering for Speaker Diarization. InProceedings of the Conference of the International Speech Communication Asso- ciation (Interspeech). 366–370

2019
[33]

Yuke Lin, Ming Cheng, Ze Li, Beilong Tang, and Ming Li. 2025. Diarization- Aware Multi-Speaker Automatic Speech Recognition via Large Language Models. Li et al. arXiv:2506.05796 [eess.AS]

work page arXiv 2025
[34]

Yi Liu, Pascale Fung, Yongsheng Yang, Christopher Cieri, Shudong Huang, and David Graff. 2006. HKUST/MTS: A very large scale Mandarin telephone speech corpus. InProceedings of the International Symposium on Chinese Spoken Language Processing (ISCSLP). 724–735

2006
[35]

Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. InProceedings of the Conference of the International Speech Commu- nication Association (Interspeech). 498–502

2017
[36]

Ivan Medennikov, Maxim Korenevsky, Tatiana Prisyach, Yuri Khokhlov, et al
[37]

InProceedings of the Conference of the International Speech Communication Association (Interspeech)

Target-Speaker Voice Activity Detection: A Novel Approach for Multi- Speaker Diarization in a Dinner Party Scenario. InProceedings of the Conference of the International Speech Communication Association (Interspeech). 274–278
[38]

Lingwei Meng, Shujie Hu, Jiawen Kang, Zhaoqing Li, Yuejiao Wang, Wenxuan Wu, Xixin Wu, Xunying Liu, and Helen Meng. 2025. Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5

2025
[39]

Lingwei Meng, Shujie Hu, Jiawen Kang, Zhaoqing Li, Yuejiao Wang, Wenxuan Wu, Xixin Wu, Xunying Liu, and Helen Meng. 2025. Large language model can transcribe speech in multi-talker scenarios with versatile instructions. InPro- ceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

2025
[40]

Bingshen Mu, Pengcheng Guo, Zhaokai Sun, Shuai Wang, Hexin Liu, Mingchen Shao, Lei Xie, Eng Siong Chng, Longshuai Xiao, Qiangze Feng, et al. 2025. Sum- mary on The Multilingual Conversational Speech Language Model Challenge: Datasets, Tasks, Baselines, and Methods.arXiv preprint arXiv:2509.13785(2025)

work page arXiv 2025
[41]

Douglas O’Shaughnessy. 2025. Speaker Diarization: A Review of Objectives and Methods.Applied Sciences15, 4 (2025), 2002

2025
[42]

Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C Puvvada, Jagadeesh Balam, and Boris Gins- burg. 2025. Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems. InProceedings of the Forty-second Inter- national Conference on Machine Learning

2025
[43]

Tae Jin Park, Kunal Dhawan, Nithin Koluguri, and Jagadeesh Balam. 2024. En- hancing speaker diarization with large language models: A contextual beam search approach. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 10861–10865

2024
[44]

Han, Shinji Watanabe, et al

Tae Jin Park, Naoyuki Kanda, Dimitrios Dimitriadis, Kyu J. Han, Shinji Watanabe, et al. 2022. A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language72 (2022)

2022
[45]

Zhiliang Peng, Jianwei Yu, Yaoyao Chang, Zilong Wang, Li Dong, Yingbo Hao, Yujie Tu, Chenyu Yang, Wenhui Wang, Songchen Xu, Yutao Sun, Hangbo Bao, Weijiang Xu, Yi Zhu, Zehua Wang, Ting Song, Yan Xia, Zewen Chi, Shaohan Huang, Liang Wang, Chuang Ding, Shuai Wang, Xie Chen, and Furu Wei. 2026. VIBEVOICE-ASR Technical Report. arXiv:2601.18184 [cs.SD]

work page arXiv 2026
[46]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. InProceedings of the International conference on machine learning. PMLR, 28492– 28518

2023
[47]

Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, et al. 2021. Integration of speech separation, diarization, and recognition for multi-speaker meetings: Sys- tem description, comparison, and analysis. InProceedings of the spoken language technology workshop (SLT). IEEE, 897–904

2021
[48]

Mohan Shi, Xiong Xiao, Ruchao Fan, Shaoshi Ling, and Jinyu Li. 2025. Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio.arXiv preprint arXiv:2511.16046(2025)

work page arXiv 2025
[49]

Jinhan Wang, Weiqing Wang, Kunal Dhawan, Taejin Park, Myungjong Kim, Ivan Medennikov, He Huang, Nithin Koluguri, Jagadeesh Balam, and Boris Ginsburg
[50]

InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)

META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5
[51]

Quan Wang, Carlton Downey, Li Wan, Mansfield, et al. 2018. Speaker Diarization with LSTM. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5239–5243

2018
[52]

Quan Wang, Yiling Huang, Guanlong Zhao, Evan Clark, Wei Xia, and Hank Liao
[53]

DiarizationLM: Speaker diarization post-processing with large language models.arXiv preprint arXiv:2401.03506(2024)

work page arXiv 2024
[54]

Weiqing Wang, Danwei Cai, Qingjian Lin, et al. 2021. The dku-dukeece-lenovo system for the diarization task of the 2021 voxceleb speaker recognition challenge. arXiv preprint arXiv:2109.02002(2021)

work page arXiv 2021
[55]

Weiqing Wang, Qingjian Lin, Danwei Cai, and Ming Li. 2022. Similarity Measure- ment of Segment-Level Speaker Embeddings in Speaker Diarization.IEEE/ACM Transactions on Audio, Speech, and Language Processing30 (2022), 2645–2658

2022
[56]

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al
[57]

Emergent Abilities of Large Language Models.Transactions on Machine Learning Research(2022)

2022
[58]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. 2025. Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215(2025)

work page internal anchor Pith review arXiv 2025
[59]

Zehui Yang, Yifan Chen, Lei Luo, Runyan Yang, Lingxuan Ye, Gaofeng Cheng, Ji Xu, Yaohui Jin, Qingqing Zhang, Pengyuan Zhang, Lei Xie, and Yonghong Yan
[60]

InProceedings of the Conference of the International Speech Communication Association (Interspeech)

Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversa- tional(RAMC) Speech Dataset. InProceedings of the Conference of the International Speech Communication Association (Interspeech)
[61]

Han Yin, Yafeng Chen, Chong Deng, Luyao Cheng, Hui Wang, Chao-Hong Tan, Qian Chen, Wen Wang, and Xiangang Li. 2025. SpeakerLM: End-to-end versatile speaker diarization and recognition with multimodal large language models. arXiv preprint arXiv:2508.06372(2025)

work page arXiv 2025
[62]

Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Zhaoye Fei, Hanfu Chen, Jingqi Chen, Ke Chen, Qinyuan Cheng, Liwei Fan, et al . 2026. MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization.arXiv preprint arXiv:2601.01554(2026)

work page arXiv 2026
[63]

F. Yu, S. Zhang, Y. Fu, L. Xie, S. Zheng, Z. Du, et al. 2022. M2met: The icassp 2022 multi-channel multi-party meeting transcription challenge. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6167–6171

2022
[64]

Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, Siqi Zheng, Weilong Huang, Lei Xie, Zheng-Hua Tan, DeLiang Wang, et al. 2022. Summary on the ICASSP 2022 multi-channel multi-party meeting transcription grand challenge. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 9156–9160

2022
[65]

Zhang, X

W. Zhang, X. Chang, Y. Qian, and S. Watanabe. 2020. Improving end-to-end single-channel multi-talker speech recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing28 (2020), 1385–1394

2020
[66]

Xianrui Zheng, Chao Zhang, and Phil Woodland. 2025. DNCASR: End-to-End Training for Speaker-Attributed ASR. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 18369– 18383

2025