Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.
VIBEVOICE-ASR technical report
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 8roles
background 1polarities
background 1representative citing papers
Murmur matches single-pass long-context ASR accuracy on AMI-IHM while cutting latency 4.2x by tuning chunk size and using intra-chunk attention sparsity via KV eviction.
DM-ASR reformulates multi-speaker ASR as multi-turn dialogue generation conditioned on diarization results, achieving competitive benchmark performance with relatively small models and limited data.
SURE is a new standardized framework for evaluating and training speech foundation models and Speech LLMs to improve comparability and reproducibility under realistic conditions.
StepAudio 2.5 is a unified audio-language foundation model that reaches state-of-the-art results on ASR, TTS, and realtime interaction by using task-tailored RLHF on a shared backbone.
SoulX-Transcriber is a unified LLM framework for end-to-end multi-speaker transcription using two-stage training (speaker-aware pre-training then supervised fine-tuning) that reports strong results on AliMeeting, AISHELL-4, and AMI.
The 2026 PVUW Challenge introduces a new audio track and evaluates top multimodal methods on challenging video datasets for pixel-level understanding.
A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-VOS into sequential verification and refinement steps.
citing papers explorer
-
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.
-
MURMUR: An Efficient Inference System for Long-Form ASR
Murmur matches single-pass long-context ASR accuracy on AMI-IHM while cutting latency 4.2x by tuning chunk size and using intra-chunk attention sparsity via KV eviction.
-
DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models
DM-ASR reformulates multi-speaker ASR as multi-turn dialogue generation conditioned on diarization results, achieving competitive benchmark performance with relatively small models and limited data.
-
A Unified and Reproducible Experimentation Framework for Speech Understanding
SURE is a new standardized framework for evaluating and training speech foundation models and Speech LLMs to improve comparability and reproducibility under realistic conditions.
-
StepAudio 2.5 Technical Report
StepAudio 2.5 is a unified audio-language foundation model that reaches state-of-the-art results on ASR, TTS, and realtime interaction by using task-tailored RLHF on a shared backbone.
-
SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription
SoulX-Transcriber is a unified LLM framework for end-to-end multi-speaker transcription using two-stage training (speaker-aware pre-training then supervised fine-tuning) that reports strong results on AliMeeting, AISHELL-4, and AMI.
-
Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding
The 2026 PVUW Challenge introduces a new audio track and evaluates top multimodal methods on challenging video datasets for pixel-level understanding.
-
APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track
A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-VOS into sequential verification and refinement steps.