pith. sign in

hub

WhisperX: Time-accurate speech transcription of long-form audio

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it
abstract

Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination & repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are not available out-of-the-box. To overcome these challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut & Merge strategy improves transcription quality and enables a twelve-fold transcription speedup via batched inference.

hub tools

citation-role summary

background 1 method 1

citation-polarity summary

years

2026 19

clear filters

representative citing papers

Audio-Mind: An Auditable Agentic Framework for Audio Understanding

eess.AS · 2026-05-27 · unverdicted · novelty 4.0

Audio-Mind introduces a conditional, auditable agentic framework for audio understanding that preserves frontend judgment and acquires bounded external evidence only when needed, reporting 80.4% on MMAR and 82.8% on MSU-Bench.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.