Train short, infer long: Speech-llm enables zero-shot streamable joint asr and di- arization on long audio

· 2025 · arXiv 2511.16046

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

eess.AS · 2026-04-03 · unverdicted · novelty 7.0

Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

eess.AS · 2026-04-24 · unverdicted · novelty 6.0

DM-ASR reformulates multi-speaker ASR as multi-turn dialogue generation conditioned on diarization results, achieving competitive benchmark performance with relatively small models and limited data.

Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving

cs.CL · 2026-07-02 · unverdicted · novelty 4.0

JSTIP interleaves speech and text sequences during pretraining on 38k hours of ASR data to improve entity accuracy over ASR-only and simple joint-training baselines while matching performance from domain text.

Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition

eess.AS · 2026-06-11 · unverdicted · novelty 4.0

LLM-based multi-talker ASR with dual-encoder, feature interleaving, length-aware speaker loss, and adaptive ASR threshold achieves 18% and 24% relative gains over baselines on AliMeeting and Aishell4.

SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription

eess.AS · 2026-06-01 · unverdicted · novelty 4.0

SoulX-Transcriber is a unified LLM framework for end-to-end multi-speaker transcription using two-stage training (speaker-aware pre-training then supervised fine-tuning) that reports strong results on AliMeeting, AISHELL-4, and AMI.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Train short, infer long: Speech-llm enables zero-shot streamable joint asr and di- arization on long audio

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer