Train short, infer long: Speech-llm enables zero-shot streamable joint asr and di- arization on long audio

· 2025 · arXiv 2511.16046

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

eess.AS · 2026-04-03 · unverdicted · novelty 7.0

Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

eess.AS · 2026-04-24 · unverdicted · novelty 6.0

DM-ASR reformulates multi-speaker ASR as multi-turn dialogue generation conditioned on diarization results, achieving competitive benchmark performance with relatively small models and limited data.

SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription

eess.AS · 2026-06-01 · unverdicted · novelty 4.0

SoulX-Transcriber is a unified LLM framework for end-to-end multi-speaker transcription using two-stage training (speaker-aware pre-training then supervised fine-tuning) that reports strong results on AliMeeting, AISHELL-4, and AMI.

citing papers explorer

Showing 3 of 3 citing papers.

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR eess.AS · 2026-04-03 · unverdicted · none · ref 23
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.
DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models eess.AS · 2026-04-24 · unverdicted · none · ref 48
DM-ASR reformulates multi-speaker ASR as multi-turn dialogue generation conditioned on diarization results, achieving competitive benchmark performance with relatively small models and limited data.
SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription eess.AS · 2026-06-01 · unverdicted · none · ref 1
SoulX-Transcriber is a unified LLM framework for end-to-end multi-speaker transcription using two-stage training (speaker-aware pre-training then supervised fine-tuning) that reports strong results on AliMeeting, AISHELL-4, and AMI.

Train short, infer long: Speech-llm enables zero-shot streamable joint asr and di- arization on long audio

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer