ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
hub
Audioclip: Extending clip to image, text and audio
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
KL regularization aligning model predictions with empirical transition patterns improves macro-F1 by 9-42% in next dialogue act prediction on German counselling data and transfers to other datasets.
Introduces the LDD task, ListenForge dataset built from five listening head generation methods, and MANet model that detects listening forgeries via motion inconsistencies guided by audio semantics.
LLMs identify mental states in dialogues well but mostly fail to forecast state-consistent future trajectories, except Gemini 3 Pro, with only weak overlap to human inferences.
SCALE disentangles emotion and cause representations in conversations and uses optimal transport for many-to-many global alignment, achieving SOTA on ECPEC benchmarks.
PRISM-CTG is the first large-scale foundation model for cardiotocography that uses multi-view self-supervised learning on unlabeled data to learn transferable representations, outperforming baselines on seven downstream tasks with external validation.
Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
The RADAR Challenge 2026 provides a multilingual benchmark for audio deepfake detection under media transformations and finds that robust performance remains an open problem.
WorldSpeech supplies 65k hours of multilingual aligned speech data across 76 languages and delivers 63.5% average relative WER reduction after fine-tuning ASR models on 11 typologically diverse languages.
GaborNet replaces sinc functions with Gabor filters in raw-audio neural networks and is tested for audio spoof detection with augmentations in RawNet2 and RawGAT-ST.
R-FLoRA combines Laplacian residual statistics with a frozen vision transformer via gated low-rank adapters, residual fusion, and contrastive alignment to achieve better accuracy and generalization than prior single-image face morphing attack detectors.
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.
citing papers explorer
-
ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval
ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
-
Transition-Matrix Regularization for Next Dialogue Act Prediction in Counselling Conversations
KL regularization aligning model predictions with empirical transition patterns improves macro-F1 by 9-42% in next dialogue act prediction on German counselling data and transfers to other datasets.
-
Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis
Introduces the LDD task, ListenForge dataset built from five listening head generation methods, and MANet model that detects listening forgeries via motion inconsistencies guided by audio semantics.
-
DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories
LLMs identify mental states in dialogues well but mostly fail to forecast state-consistent future trajectories, except Gemini 3 Pro, with only weak overlap to human inferences.
-
Emotion-Cause Pair Extraction in Conversations via Semantic Decoupling and Graph Alignment
SCALE disentangles emotion and cause representations in conversations and uses optimal transport for many-to-many global alignment, achieving SOTA on ECPEC benchmarks.
-
PRISM-CTG: A Foundation Model for Cardiotocography Analysis with Multi-View SSL
PRISM-CTG is the first large-scale foundation model for cardiotocography that uses multi-view self-supervised learning on unlabeled data to learn transferable representations, outperforming baselines on seven downstream tasks with external validation.
-
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
-
RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations
The RADAR Challenge 2026 provides a multilingual benchmark for audio deepfake detection under media transformations and finds that robust performance remains an open problem.
-
WorldSpeech: A Multilingual Speech Corpus from Around the World
WorldSpeech supplies 65k hours of multilingual aligned speech data across 76 languages and delivers 63.5% average relative WER reduction after fine-tuning ASR models on 11 typologically diverse languages.
-
Audio Spoof Detection with GaborNet
GaborNet replaces sinc functions with Gabor filters in raw-audio neural networks and is tested for audio spoof detection with augmentations in RawNet2 and RawGAT-ST.
-
R-FLoRA: Residual-Statistic-Gated Low-Rank Adaptation for Single-Image Face Morphing Attack Detection
R-FLoRA combines Laplacian residual statistics with a frozen vision transformer via gated low-rank adapters, residual fusion, and contrastive alignment to achieve better accuracy and generalization than prior single-image face morphing attack detectors.
-
Qwen3.5-Omni Technical Report
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
-
Qwen2-Audio Technical Report
Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.
- TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning