Recognition: unknown
DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models
Pith reviewed 2026-05-08 09:14 UTC · model grok-4.3
The pith
DM-ASR reformulates multi-speaker ASR as a sequence of speaker- and time-conditioned queries to large language models using diarization outputs as priors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DM-ASR decomposes a multi-speaker audio chunk into a series of structured queries, each conditioned on one speaker identity and time segment from the diarization output, and generates the corresponding transcription as a dialogue turn; an optional interleaving of word tokens with timestamp tokens further enriches the output while improving overall transcription quality.
What carries the argument
The diarization-aware multi-turn dialogue formulation that converts the audio into per-speaker, per-segment queries so the LLM handles only linguistic generation.
If this is right
- Diarization systems supply reliable structure while LLMs supply linguistic modeling, demonstrating complementary roles.
- Interleaving word and timestamp tokens produces richer outputs and measurably better transcription quality.
- The framework reaches strong performance on Mandarin and English benchmarks with smaller models and limited training data.
- It remains competitive with or exceeds existing unified models that jointly learn everything from scratch.
Where Pith is reading between the lines
- Hybrid pipelines that pre-extract structure may require less data and compute than fully end-to-end learned systems for conversational audio tasks.
- The same query-based decomposition could be applied to related problems such as meeting summarization or speaker-specific information extraction.
- Evaluating performance across a range of diarization error rates would quantify the minimum reliability needed from the prior.
Load-bearing premise
Diarization systems supply accurate enough speaker labels and segment boundaries to serve as reliable structural priors that separate timing and identity from word content.
What would settle it
Running the system on the same benchmarks but with deliberately degraded diarization inputs and observing that transcription accuracy falls below strong unified baselines would show the priors are not sufficiently robust.
Figures
read the original abstract
Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potential of unified modeling for this task, but jointly learning speaker attribution, temporal structure, and lexical recognition remains difficult and data-intensive. At the current stage, leveraging reliable speaker diarization as an explicit structural prior provides a practical and efficient way to simplify this task. To effectively exploit such priors, we propose DM-ASR, a diarization-aware multi-speaker ASR framework that reformulates the task as a multi-turn dialogue generation process. Given an audio chunk and diarization results, DM-ASR decomposes transcription into a sequence of speaker- and time-conditioned queries, each corresponding to one speaker in one time segment. This formulation converts multi-speaker recognition into a series of structured sub-tasks, explicitly decoupling speaker-temporal structure from linguistic content and enabling effective integration of diarization cues with the reasoning capability of large language models. We further introduce an optional word-level timestamp prediction mechanism that interleaves word and timestamp tokens, yielding richer structured outputs and better transcription quality. Our analysis shows that diarization systems provide more reliable speaker identities and segment-level boundaries, while LLMs excel at modeling linguistic content and long-range dependencies, demonstrating their complementary strengths. Experiments on Mandarin and English benchmarks show that the proposed approach achieves strong performance with relatively small models and training data, while remaining competitive with or outperforming existing unified approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DM-ASR, a diarization-aware multi-speaker ASR framework that reformulates the task as a multi-turn dialogue generation process for LLMs. Given audio chunks and diarization outputs, it decomposes transcription into speaker- and time-conditioned queries, explicitly decoupling speaker-temporal structure from linguistic content. An optional word-level timestamp prediction mechanism interleaves word and timestamp tokens. The authors claim that this leverages complementary strengths of diarization (reliable identities/boundaries) and LLMs (linguistic modeling), achieving strong performance on Mandarin and English benchmarks with relatively small models and limited training data while remaining competitive with or outperforming existing unified approaches.
Significance. If the empirical claims hold, the work offers a practical, modular alternative to fully joint Speech-LLM modeling for multi-speaker ASR by exploiting off-the-shelf diarization as an explicit structural prior. This could reduce the data and compute burden of training while enabling richer structured outputs via timestamp interleaving. The formulation highlights a clean separation of concerns that aligns with real-world pipelines where diarization is already performed upstream.
major comments (1)
- [Abstract] Abstract: The central empirical claim that the approach 'achieves strong performance with relatively small models and training data, while remaining competitive with or outperforming existing unified approaches' rests on the premise that diarization supplies reliable speaker identities and segment-level boundaries. However, no quantitative analysis of error propagation is described (e.g., performance under speaker swaps or boundary shifts of a few hundred milliseconds that would misalign the multi-turn queries), leaving the robustness of the reported gains untested on the same Mandarin and English test sets.
minor comments (1)
- The abstract would be strengthened by including at least one key quantitative result (e.g., WER or speaker-attributed WER on a named benchmark) to ground the qualitative claims of 'strong performance' and competitiveness.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the robustness of our empirical claims. We address the concern regarding error propagation from diarization below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central empirical claim that the approach 'achieves strong performance with relatively small models and training data, while remaining competitive with or outperforming existing unified approaches' rests on the premise that diarization supplies reliable speaker identities and segment-level boundaries. However, no quantitative analysis of error propagation is described (e.g., performance under speaker swaps or boundary shifts of a few hundred milliseconds that would misalign the multi-turn queries), leaving the robustness of the reported gains untested on the same Mandarin and English test sets.
Authors: We agree that a dedicated quantitative analysis of diarization error propagation would strengthen the manuscript. Our current evaluations rely on off-the-shelf diarization systems applied to the standard Mandarin and English benchmarks, which already incorporate realistic error patterns, and the reported gains hold under these conditions. To directly address the referee's point, we will add controlled experiments in the revision that inject simulated speaker swaps and boundary shifts (e.g., perturbations of 200-500 ms) into the same test sets, measuring the resulting impact on transcription accuracy and demonstrating the framework's tolerance to typical upstream diarization inaccuracies. revision: yes
Circularity Check
No circularity: architectural reformulation and empirical claims rest on external components
full rationale
The paper proposes DM-ASR as a framework that reformulates multi-speaker ASR as multi-turn LLM dialogue conditioned on external diarization outputs. No equations, parameter fittings, or derivations appear in the provided text. The central mechanism (decomposing into speaker/time-conditioned queries) is a design choice, not a self-referential definition or fitted prediction. Claims of performance rely on benchmarks and the external premise that diarization supplies reliable priors; this assumption is stated but not derived internally or justified via self-citation chains. No load-bearing self-citations, ansatzes, or renamings of known results are present. The derivation chain is self-contained as an engineering integration of independent modules (diarization + LLM), with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Diarization systems provide reliable speaker identities and segment-level boundaries
- domain assumption Large language models can integrate diarization cues with linguistic reasoning to produce accurate transcripts
Reference graph
Works this paper leans on
-
[1]
Christoph Boeddeker, Aswin Shanmugam Subramanian, Gordon Wichern, Rein- hold Haeb-Umbach, and Jonathan Le Roux. 2024. TS-SEP: Joint diarization and separation conditioned on estimated speaker embeddings.IEEE/ACM Transac- tions on Audio, Speech, and Language Processing32 (2024), 1185–1197
2024
-
[2]
Hervé Bredin. 2023. pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. InProceedings of the Conference of the International Speech Communication Association (Interspeech). ISCA, 1983–1987
2023
-
[3]
Hervé Bredin, Ruiqing Yin, Juan Manuel Coria, Gregory Gelly, Pavel Korshunov, Marvin Lavechin, Diego Fustes, Hadrien Titeux, Wassim Bouaziz, and Marie- Philippe Gill. 2020. Pyannote. audio: neural building blocks for speaker diariza- tion. InProceedings of the International conference on acoustics, speech and signal processing (ICASSP). IEEE, 7124–7128
2020
-
[4]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al . 2024. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology15, 3 (2024), 1–45
2024
-
[5]
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al . 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing16, 6 (2022), 1505–1518
2022
-
[6]
Ming Cheng, Yuke Lin, and Ming Li. 2025. Sequence-to-Sequence Neural Diariza- tion with Automatic Speaker Detection and Representation.IEEE Transactions on Audio, Speech and Language Processing33 (2025), 2719–2734
2025
-
[7]
Ming Cheng, Fei Su, Cancan Li, Juan Liu, and Ming Li. 2025. Multi-Channel Sequence-to-Sequence Neural Diarization: Experimental Results for The MISP 2025 Challenge. InProceedings of the Conference of the International Speech Communication Association (Interspeech)
2025
-
[8]
Ming Cheng, Weiqing Wang, Xiaoyi Qin, Yuke Lin, et al. 2024. The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023. In Proceedings of the National Conference on Man-Machine Speech Communication. 330–337
2024
-
[9]
Christopher Cieri, David Miller, and Kevin Walker. 2004. The Fisher corpus: A resource for the next generations of speech-to-text.. InLREC, Vol. 4. 69–71
2004
- [10]
-
[11]
Samuele Cornell, Jee-weon Jung, Shinji Watanabe, and Stefano Squartini. 2024. One model to rule them all? towards end-to-end joint speaker diarization and speech recognition. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 11856–11860
2024
-
[12]
Samuele Cornell, Matthew S Wiesner, Shinji Watanabe, Desh Raj, Xuankai Chang, Paola Garcia, Yoshiki Masuyam, Zhong-Qiu Wang, Stefano Squartini, and Sanjeev Khudanpur. 2023. The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios. InProceedings of the CHiME 2023. 1–6
2023
-
[13]
Georgios Efstathiadis, Vijay Yadav, and Anzar Abbas. 2025. LLM-based speaker diarization correction: A generalizable approach.Speech Communication170 (2025), 103224
2025
-
[14]
Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, et al. 2021. AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenarioe. InProceedings of the Conference of the International Speech Communication Association (Interspeech). 3665–3669
2021
-
[15]
Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, et al. 2019. End-to-End Neural Speaker Diarization with Permutation-Free Objectives. InProceedings of the Conference of the International Speech Communication Association (Interspeech). 4300–4304
2019
-
[16]
Ming Gao, Shilong Wu, Hang Chen, Jun Du, Chin-Hui Lee, Shinji Watanabe, Jingdong Chen, Siniscalchi Sabato Marco, and Odette Scharenborg. 2025. The multimodal information based speech processing (misp) 2025 challenge: Audio- visual diarization and recognition.arXiv preprint arXiv:2505.13971(2025)
-
[17]
Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, and Lukáš Burget. 2025. Leveraging self-supervised learning for speaker diarization. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5
2025
-
[18]
Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, Jan Cer- nocky, and Lukas Burget. 2025. Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization. InPro- ceedings of the Conference of the International Speech Communication Association (Interspeech)
2025
-
[19]
Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, and Paola García
-
[20]
IEEE/ACM Transactions on Audio, Speech, and Language Processing30 (2022), 1493–1507
Encoder-Decoder Based Attractors for End-to-End Neural Diarization. IEEE/ACM Transactions on Audio, Speech, and Language Processing30 (2022), 1493–1507
2022
-
[21]
Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, and Kenji Naga- matsu. 2020. End-to-End Speaker Diarization for an Unknown Number of Speak- ers with Encoder-Decoder Based Attractors. InProceedings of the Conference of the International Speech Communication Association (Interspeech). 269–273
2020
- [22]
-
[23]
Janin, D
A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters. 2003. The ICSI Meeting Corpus. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1. 364–367
2003
-
[24]
Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, et al. 2020. Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of any Number of Speakers. InProceedings of the Conference of the International Speech Communication Association (Interspeech). 36–40
2020
-
[25]
Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, and Takuya Yoshioka
-
[26]
InProceedings of the Conference of the International Speech Communication Asso- ciation (Interspeech)
Serialized output training for end-to-end overlapped speech recognition. InProceedings of the Conference of the International Speech Communication Asso- ciation (Interspeech). 2797–2801
-
[27]
Naoyuki Kanda, Guoli Ye, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, and Takuya Yoshioka. 2021. End-to-end speaker-attributed asr with transformer. InProceedings of the Conference of the International Speech Communication Asso- ciation (Interspeech). 4413–4417
2021
-
[28]
Kraaij, T
W. Kraaij, T. Hain, M. Lincoln, and W. Post. 2005. The AMI meeting corpus. InProceedings of the International Conference on Methods and Techniques in Behavioral Research. 1–4
2005
-
[29]
Federico Landini, Ján Profant, Mireia Diez, and Lukáš Burget. 2022. Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: Theory, implementation and analysis on standard tasks.Computer Speech & Language 71 (2022)
2022
-
[30]
Ze Li, Ming Cheng, and Ming Li. 2026. Enhancing Speaker Verification with W2v-Bert 2.0 and Knowledge Distillation Guided Pruning. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
2026
-
[31]
Yuhao Liang, Mohan Shi, Fan Yu, Yangze Li, Shiliang Zhang, Zhihao Du, Qian Chen, Lei Xie, Yanmin Qian, Jian Wu, et al. 2023. The Second Multi-Channel Multi-Party Meeting Transcription Challenge (M2MeT 2.0): A Benchmark for Speaker-Attributed ASR. InProceedings of the Automatic Speech Recognition and Understanding Workshop. 1–8
2023
-
[32]
Qingjian Lin, Ruiqing Yin, Ming Li, Hervé Bredin, and Claude Barras. 2019. LSTM Based Similarity Measurement with Spectral Clustering for Speaker Diarization. InProceedings of the Conference of the International Speech Communication Asso- ciation (Interspeech). 366–370
2019
- [33]
-
[34]
Yi Liu, Pascale Fung, Yongsheng Yang, Christopher Cieri, Shudong Huang, and David Graff. 2006. HKUST/MTS: A very large scale Mandarin telephone speech corpus. InProceedings of the International Symposium on Chinese Spoken Language Processing (ISCSLP). 724–735
2006
-
[35]
Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. InProceedings of the Conference of the International Speech Commu- nication Association (Interspeech). 498–502
2017
-
[36]
Ivan Medennikov, Maxim Korenevsky, Tatiana Prisyach, Yuri Khokhlov, et al
-
[37]
InProceedings of the Conference of the International Speech Communication Association (Interspeech)
Target-Speaker Voice Activity Detection: A Novel Approach for Multi- Speaker Diarization in a Dinner Party Scenario. InProceedings of the Conference of the International Speech Communication Association (Interspeech). 274–278
-
[38]
Lingwei Meng, Shujie Hu, Jiawen Kang, Zhaoqing Li, Yuejiao Wang, Wenxuan Wu, Xixin Wu, Xunying Liu, and Helen Meng. 2025. Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5
2025
-
[39]
Lingwei Meng, Shujie Hu, Jiawen Kang, Zhaoqing Li, Yuejiao Wang, Wenxuan Wu, Xixin Wu, Xunying Liu, and Helen Meng. 2025. Large language model can transcribe speech in multi-talker scenarios with versatile instructions. InPro- ceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5
2025
-
[40]
Bingshen Mu, Pengcheng Guo, Zhaokai Sun, Shuai Wang, Hexin Liu, Mingchen Shao, Lei Xie, Eng Siong Chng, Longshuai Xiao, Qiangze Feng, et al. 2025. Sum- mary on The Multilingual Conversational Speech Language Model Challenge: Datasets, Tasks, Baselines, and Methods.arXiv preprint arXiv:2509.13785(2025)
-
[41]
Douglas O’Shaughnessy. 2025. Speaker Diarization: A Review of Objectives and Methods.Applied Sciences15, 4 (2025), 2002
2025
-
[42]
Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C Puvvada, Jagadeesh Balam, and Boris Gins- burg. 2025. Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems. InProceedings of the Forty-second Inter- national Conference on Machine Learning
2025
-
[43]
Tae Jin Park, Kunal Dhawan, Nithin Koluguri, and Jagadeesh Balam. 2024. En- hancing speaker diarization with large language models: A contextual beam search approach. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 10861–10865
2024
-
[44]
Han, Shinji Watanabe, et al
Tae Jin Park, Naoyuki Kanda, Dimitrios Dimitriadis, Kyu J. Han, Shinji Watanabe, et al. 2022. A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language72 (2022)
2022
-
[45]
Zhiliang Peng, Jianwei Yu, Yaoyao Chang, Zilong Wang, Li Dong, Yingbo Hao, Yujie Tu, Chenyu Yang, Wenhui Wang, Songchen Xu, Yutao Sun, Hangbo Bao, Weijiang Xu, Yi Zhu, Zehua Wang, Ting Song, Yan Xia, Zewen Chi, Shaohan Huang, Liang Wang, Chuang Ding, Shuai Wang, Xie Chen, and Furu Wei. 2026. VIBEVOICE-ASR Technical Report. arXiv:2601.18184 [cs.SD]
-
[46]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. InProceedings of the International conference on machine learning. PMLR, 28492– 28518
2023
-
[47]
Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, et al. 2021. Integration of speech separation, diarization, and recognition for multi-speaker meetings: Sys- tem description, comparison, and analysis. InProceedings of the spoken language technology workshop (SLT). IEEE, 897–904
2021
- [48]
-
[49]
Jinhan Wang, Weiqing Wang, Kunal Dhawan, Taejin Park, Myungjong Kim, Ivan Medennikov, He Huang, Nithin Koluguri, Jagadeesh Balam, and Boris Ginsburg
-
[50]
InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)
META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5
-
[51]
Quan Wang, Carlton Downey, Li Wan, Mansfield, et al. 2018. Speaker Diarization with LSTM. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5239–5243
2018
-
[52]
Quan Wang, Yiling Huang, Guanlong Zhao, Evan Clark, Wei Xia, and Hank Liao
- [53]
- [54]
-
[55]
Weiqing Wang, Qingjian Lin, Danwei Cai, and Ming Li. 2022. Similarity Measure- ment of Segment-Level Speaker Embeddings in Speaker Diarization.IEEE/ACM Transactions on Audio, Speech, and Language Processing30 (2022), 2645–2658
2022
-
[56]
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al
-
[57]
Emergent Abilities of Large Language Models.Transactions on Machine Learning Research(2022)
2022
-
[58]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. 2025. Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215(2025)
work page internal anchor Pith review arXiv 2025
-
[59]
Zehui Yang, Yifan Chen, Lei Luo, Runyan Yang, Lingxuan Ye, Gaofeng Cheng, Ji Xu, Yaohui Jin, Qingqing Zhang, Pengyuan Zhang, Lei Xie, and Yonghong Yan
-
[60]
InProceedings of the Conference of the International Speech Communication Association (Interspeech)
Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversa- tional(RAMC) Speech Dataset. InProceedings of the Conference of the International Speech Communication Association (Interspeech)
- [61]
- [62]
-
[63]
F. Yu, S. Zhang, Y. Fu, L. Xie, S. Zheng, Z. Du, et al. 2022. M2met: The icassp 2022 multi-channel multi-party meeting transcription challenge. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6167–6171
2022
-
[64]
Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, Siqi Zheng, Weilong Huang, Lei Xie, Zheng-Hua Tan, DeLiang Wang, et al. 2022. Summary on the ICASSP 2022 multi-channel multi-party meeting transcription grand challenge. InProceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 9156–9160
2022
-
[65]
Zhang, X
W. Zhang, X. Chang, Y. Qian, and S. Watanabe. 2020. Improving end-to-end single-channel multi-talker speech recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing28 (2020), 1385–1394
2020
-
[66]
Xianrui Zheng, Chao Zhang, and Phil Woodland. 2025. DNCASR: End-to-End Training for Speaker-Attributed ASR. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 18369– 18383
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.