hub

WhisperX: Time-accurate speech transcription of long-form audio

· 2023 · cs.SD · arXiv 2303.00747

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

open full Pith review browse 19 citing papers arXiv PDF

abstract

Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination & repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are not available out-of-the-box. To overcome these challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut & Merge strategy improves transcription quality and enables a twelve-fold transcription speedup via batched inference.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Audio-Oscar: A Multi-Agent System for Complex Audio Scene Generation, Orchestration, and Refinement

cs.SD · 2026-06-05 · unverdicted · novelty 7.0

Audio-Oscar is a multi-agent system that coordinates specialist agents for generating audio from complex scene descriptions and introduces the ASG-Bench benchmark for evaluation.

A Benchmark and Multi-Agent System for Instruction-driven Cinematic Video Compilation

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

CineAgents is a multi-agent system that builds hierarchical narrative memory via script reverse-engineering and uses iterative planning to produce instruction-driven cinematic video compilations with better coherence than prior methods.

AI-Driven Analytics of Team-Teaching Talk: Acoustic Patterns across Experience, Cohorts and the Learning Design

cs.HC · 2026-04-19 · conditional · novelty 6.0

Automated acoustic analysis of 36 team-teaching sessions reveals systematic loudness variation differences across teacher experience, student cohorts, and learning task design.

UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

cs.CV · 2026-06-19 · unverdicted · novelty 6.0

UnityShots uses fixed LTM and STM memory slots with boundary-conditioned gating and speaker tokens to achieve coherent multi-shot audio-video generation, leading open-source baselines on cross-shot coherence metrics.

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

eess.AS · 2026-05-29 · unverdicted · novelty 6.0

SwanVoice is a zero-shot TTS system for 1-4 speakers that reports higher richness and hierarchy scores than open-source baselines on monologue and dialogue tasks via mixed training and DiffusionNFT post-training.

SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

SurgOnAir introduces a streaming vision-language model trained on a hierarchical surgical dataset to generate real-time, multi-level narrations with explicit transition tokens.

From Text Metrics to Model Internals: A Study of Whisper ASR Hallucination Detection

cs.SD · 2026-06-22 · unverdicted · novelty 5.0

Internal decoder probing of Whisper yields strongest hallucination detection without references, with late fusion of text and internal features performing best overall.

Deviance from a pink noise regime in the temporal organization of semantic relations in psychosis

cond-mat.stat-mech · 2026-06-22 · unverdicted · novelty 5.0

Patients with psychosis exhibit elevated DFA scaling exponents in BERT-derived semantic similarity time series from transcripts, indicating excessive persistence in semantic fluctuations.

Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining

cs.CV · 2026-05-29 · unverdicted · novelty 5.0

VISTA mines multi-level event semantics via visual prompts, knowledge-enhanced retrieval, and proposal integration to improve long-video event prediction over existing LVLMs.

Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models

cs.AI · 2026-05-21 · unverdicted · novelty 5.0

Multimodal LLM analysis correlates better with TRUST-Pathos than acoustic SER models in a case study of one Bundestag speech, while acoustic features help with arousal.

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

cs.CL · 2026-04-28 · unverdicted · novelty 5.0

WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.

Scaling Properties of Continuous Diffusion Spoken Language Models

cs.CL · 2026-04-27 · unverdicted · novelty 5.0

Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.

Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

cs.CL · 2026-04-25 · conditional · novelty 5.0

Adapting Moshi to Hindi with a custom tokenizer and 26k hours of real conversations yields the first open full-duplex spoken dialogue system for an Indian language.

AudioKV: KV Cache Eviction in Efficient Large Audio Language Models

cs.SD · 2026-04-08 · unverdicted · novelty 5.0

AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

cs.CL · 2026-06-16 · unverdicted · novelty 4.0

MFA version 3.0 reaches state-of-the-art or near state-of-the-art results on forced alignment benchmarks for English, Japanese, and Korean with average boundary errors under 15 milliseconds.

Audio-Mind: An Auditable Agentic Framework for Audio Understanding

eess.AS · 2026-05-27 · unverdicted · novelty 4.0

Audio-Mind introduces a conditional, auditable agentic framework for audio understanding that preserves frontend judgment and acquires bounded external evidence only when needed, reporting 80.4% on MMAR and 82.8% on MSU-Bench.

MedASR: An Open-Source Model for High-Accuracy Medical Dictation

eess.AS · 2026-05-15 · unverdicted · novelty 4.0

MedASR is an open-source 105M-parameter ASR model achieving 58% relative WER reduction versus Whisper Large-v3 on medical dictation.

Quantifying the Cost of Manual Navigation: A Comparison of Gesture-Based Magnification versus Direct Access Reading in Digital Layout-based Documents

cs.HC · 2026-04-29 · unverdicted · novelty 4.0

Large-print editions of layout-based documents outperform gesture-based magnification by 18% in reading speed and 30% in target location speed while restoring natural reading strategies and reducing workload.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

WhisperX: Time-accurate speech transcription of long-form audio

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer