RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
hub
WhisperX: Time-accurate speech transcription of long-form audio
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 13representative citing papers
CineAgents is a multi-agent system that builds hierarchical narrative memory via script reverse-engineering and uses iterative planning to produce instruction-driven cinematic video compilations with better coherence than prior methods.
SurgOnAir introduces a streaming vision-language model trained on a hierarchical surgical dataset to generate real-time, multi-level narrations with explicit transition tokens.
Patients with psychosis exhibit elevated DFA scaling exponents in BERT-derived semantic similarity time series from transcripts, indicating excessive persistence in semantic fluctuations.
VISTA mines multi-level event semantics via visual prompts, knowledge-enhanced retrieval, and proposal integration to improve long-video event prediction over existing LVLMs.
Multimodal LLM analysis correlates better with TRUST-Pathos than acoustic SER models in a case study of one Bundestag speech, while acoustic features help with arousal.
WhisperPipe delivers 89 ms median latency and 48% lower peak GPU memory than standard Whisper while keeping word error rate within 2% of the offline model.
Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.
AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.
Audio-Mind introduces a conditional, auditable agentic framework for audio understanding that preserves frontend judgment and acquires bounded external evidence only when needed, reporting 80.4% on MMAR and 82.8% on MSU-Bench.
MedASR is an open-source 105M-parameter ASR model achieving 58% relative WER reduction versus Whisper Large-v3 on medical dictation.
Large-print editions of layout-based documents outperform gesture-based magnification by 18% in reading speed and 30% in target location speed while restoring natural reading strategies and reducing workload.
citing papers explorer
No citing papers match the current filters.