VIBEVOICE-ASR technical re- port

· 2026 · arXiv 2601.18184

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

eess.AS · 2026-04-03 · unverdicted · novelty 7.0

Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

eess.AS · 2026-04-24 · unverdicted · novelty 6.0

DM-ASR reformulates multi-speaker ASR as multi-turn dialogue generation conditioned on diarization results, achieving competitive benchmark performance with relatively small models and limited data.

Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding

cs.CV · 2026-04-28 · unverdicted · novelty 4.0

The 2026 PVUW Challenge introduces a new audio track and evaluates top multimodal methods on challenging video datasets for pixel-level understanding.

APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track

cs.SD · 2026-04-20 · unverdicted · novelty 3.0

A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-VOS into sequential verification and refinement steps.

citing papers explorer

Showing 4 of 4 citing papers.

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR eess.AS · 2026-04-03 · unverdicted · none · ref 21
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.
DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models eess.AS · 2026-04-24 · unverdicted · none · ref 45
DM-ASR reformulates multi-speaker ASR as multi-turn dialogue generation conditioned on diarization results, achieving competitive benchmark performance with relatively small models and limited data.
Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding cs.CV · 2026-04-28 · unverdicted · none · ref 24
The 2026 PVUW Challenge introduces a new audio track and evaluates top multimodal methods on challenging video datasets for pixel-level understanding.
APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track cs.SD · 2026-04-20 · unverdicted · none · ref 18
A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-VOS into sequential verification and refinement steps.

VIBEVOICE-ASR technical re- port

fields

years

verdicts

representative citing papers

citing papers explorer