hub

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo · 2024 · eess.AS · arXiv 2407.10759

58 Pith papers cite this work. Polarity classification is still indexing.

58 Pith papers citing it

open full Pith review browse 58 citing papers arXiv PDF

abstract

We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis. In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input. In the audio analysis mode, users could provide audio and text instructions for analysis during the interaction. Note that we do not use any system prompts to switch between voice chat and audio analysis modes. Qwen2-Audio is capable of intelligently comprehending the content within audio and following voice commands to respond appropriately. For instance, in an audio segment that simultaneously contains sounds, multi-speaker conversations, and a voice command, Qwen2-Audio can directly understand the command and provide an interpretation and response to the audio. Additionally, DPO has optimized the model's performance in terms of factuality and adherence to desired behavior. According to the evaluation results from AIR-Bench, Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests focused on audio-centric instruction-following capabilities. Qwen2-Audio is open-sourced with the aim of fostering the advancement of the multi-modal language community.

hub tools

JSON dossier citing papers JSON arXiv source

claims ledger

abstract We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio an

co-cited works

representative citing papers

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

cs.AI · 2026-05-05 · unverdicted · novelty 8.0 · 2 refs

ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

SpurAudio benchmark shows state-of-the-art few-shot audio classifiers suffer large performance drops when background correlations are disrupted, even in large pretrained models.

NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

cs.SD · 2026-05-13 · unverdicted · novelty 7.0

NAACA uses a neuro-inspired oscillatory working memory to gate attention in audio language models, raising AudioQwen's average precision from 53.5% to 70.6% on XD-Violence while cutting unnecessary calls.

How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

cs.CR · 2026-05-06 · conditional · novelty 7.0

Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

eess.AS · 2026-04-21 · unverdicted · novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.

Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.

HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

eess.AS · 2026-04-13 · unverdicted · novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.

Ti-Audio: The First Multi-Dialectal End-to-End Speech LLM for Tibetan

cs.SD · 2026-04-13 · unverdicted · novelty 7.0

Ti-Audio is the first multi-dialectal end-to-end Speech-LLM for Tibetan that achieves state-of-the-art performance on ASR and speech translation benchmarks via a Dynamic Q-Former Adapter and cross-dialect cooperation.

Unified Multimodal Uncertain Inference

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Introduces UMUI task for fine-grained multimodal probabilistic inference and CLUE calibration method, where a 3B model matches larger baselines.

Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering

cs.IR · 2026-04-08 · unverdicted · novelty 7.0

Jamendo-MT-QA is a new dataset and benchmark for multi-track comparative music question answering, constructed via an LLM-assisted pipeline from Creative Commons Jamendo tracks and used to evaluate audio-language models.

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

eess.AS · 2026-04-03 · unverdicted · novelty 7.0

Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.

KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness

cs.CL · 2026-03-30 · unverdicted · novelty 7.0

KoALa-Bench is a new public benchmark with six tasks that tests Korean speech recognition, translation, question answering, instruction following, and faithfulness in large audio language models.

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.

Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

A sequence-tagger-guided LLM with contrastive objective corrects disfluencies in Hindi, Bengali, and Marathi ASR transcripts, outperforming removal-only baselines.

Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model

eess.AS · 2026-05-12 · unverdicted · novelty 6.0

A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.

VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

cs.SD · 2026-05-06 · unverdicted · novelty 6.0

VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

eess.AS · 2026-05-06 · unverdicted · novelty 6.0

JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.

citing papers explorer

Showing 50 of 58 citing papers.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CV · 2026-05-08 · unverdicted · none · ref 45 · internal anchor
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval cs.AI · 2026-05-05 · unverdicted · none · ref 8 · 2 links · internal anchor
ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models cs.SD · 2026-04-21 · unverdicted · none · ref 4 · internal anchor
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues cs.AI · 2026-04-09 · unverdicted · none · ref 5 · internal anchor
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification cs.CV · 2026-05-13 · unverdicted · none · ref 11 · internal anchor
SpurAudio benchmark shows state-of-the-art few-shot audio classifiers suffer large performance drops when background correlations are disrupted, even in large pretrained models.
NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating cs.SD · 2026-05-13 · unverdicted · none · ref 26 · internal anchor
NAACA uses a neuro-inspired oscillatory working memory to gate attention in audio language models, raising AudioQwen's average precision from 53.5% to 70.6% on XD-Violence while cutting unnecessary calls.
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue cs.CL · 2026-05-11 · unverdicted · none · ref 6 · internal anchor
Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes cs.CL · 2026-05-07 · unverdicted · none · ref 53 · internal anchor
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 56 · internal anchor
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization cs.CR · 2026-05-06 · conditional · none · ref 9 · internal anchor
Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages eess.AS · 2026-04-21 · unverdicted · none · ref 192 · internal anchor
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection cs.CR · 2026-04-16 · unverdicted · none · ref 38 · internal anchor
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning cs.LG · 2026-04-15 · unverdicted · none · ref 7 · internal anchor
RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models eess.AS · 2026-04-13 · unverdicted · none · ref 10 · internal anchor
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.
Ti-Audio: The First Multi-Dialectal End-to-End Speech LLM for Tibetan cs.SD · 2026-04-13 · unverdicted · none · ref 25 · internal anchor
Ti-Audio is the first multi-dialectal end-to-end Speech-LLM for Tibetan that achieves state-of-the-art performance on ASR and speech translation benchmarks via a Dynamic Q-Former Adapter and cross-dialect cooperation.
Unified Multimodal Uncertain Inference cs.CV · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
Introduces UMUI task for fine-grained multimodal probabilistic inference and CLUE calibration method, where a 3B model matches larger baselines.
Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering cs.IR · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
Jamendo-MT-QA is a new dataset and benchmark for multi-track comparative music question answering, constructed via an LLM-assisted pipeline from Creative Commons Jamendo tracks and used to evaluate audio-language models.
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR eess.AS · 2026-04-03 · unverdicted · none · ref 15 · internal anchor
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.
KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness cs.CL · 2026-03-30 · unverdicted · none · ref 1 · internal anchor
KoALa-Bench is a new public benchmark with six tasks that tests Korean speech recognition, translation, question answering, instruction following, and faithfulness in large audio language models.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark cs.CV · 2026-03-28 · unverdicted · none · ref 8 · internal anchor
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs cs.CL · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
A sequence-tagger-guided LLM with contrastive objective corrects disfluencies in Hindi, Bengali, and Marathi ASR transcripts, outperforming removal-only baselines.
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model eess.AS · 2026-05-12 · unverdicted · none · ref 30 · internal anchor
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models cs.SD · 2026-05-06 · unverdicted · none · ref 3 · internal anchor
VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions eess.AS · 2026-05-06 · unverdicted · none · ref 52 · internal anchor
JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition cs.AI · 2026-05-04 · unverdicted · none · ref 13 · internal anchor
Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time cs.LG · 2026-05-03 · unverdicted · none · ref 8 · internal anchor
LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.
EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness cs.CV · 2026-05-01 · unverdicted · none · ref 5 · internal anchor
EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation cs.SD · 2026-04-27 · unverdicted · none · ref 9 · internal anchor
Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis eess.AS · 2026-04-24 · unverdicted · none · ref 38 · internal anchor
CROTTC-IF is a prompt-free MDD system with monotonic frame-level alignment and implicit knowledge transfer that reaches 71.77% F1 on L2-ARCTIC and 71.70% on Iqra'Eval2.
MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis q-bio.NC · 2026-04-22 · unverdicted · none · ref 5 · internal anchor
MoDAl discovers complementary neurolinguistic modalities via contrastive-decorrelation objectives, cutting brain-to-text word error rate from 26.3% to 21.6% by incorporating area 44 signals.
Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages cs.CL · 2026-04-20 · conditional · none · ref 4 · internal anchor
Phoneme-level analysis of ASR on Archi and Rutul shows data scarcity explains recognition errors better than phonological complexity, with language-specific adaptations improving wav2vec2 performance.
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech eess.AS · 2026-04-19 · unverdicted · none · ref 40 · internal anchor
VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding cs.SD · 2026-04-14 · unverdicted · none · ref 6 · internal anchor
SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization eess.AS · 2026-04-13 · unverdicted · none · ref 41 · internal anchor
A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on downstream tasks.
LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation cs.SD · 2026-04-13 · unverdicted · none · ref 7 · internal anchor
LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestration over prior baselines.
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models cs.CL · 2026-04-11 · unverdicted · none · ref 46 · internal anchor
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking cs.SD · 2026-04-10 · unverdicted · none · ref 5 · internal anchor
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs cs.SD · 2026-04-10 · unverdicted · none · ref 26 · internal anchor
NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs eess.AS · 2026-04-09 · unverdicted · none · ref 4 · internal anchor
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection cs.CV · 2026-04-08 · unverdicted · none · ref 6 · internal anchor
RASR retrieves cross-instance semantic evidence and uses domain priors to drive multimodal LLM reasoning for improved fake news video detection on FakeSV and FakeTT datasets.
FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection cs.SD · 2026-04-02 · unverdicted · none · ref 6 · internal anchor
FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.
Qwen3-Omni Technical Report cs.CL · 2025-09-22 · unverdicted · none · ref 5 · internal anchor
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models eess.AS · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM cs.CL · 2026-05-07 · unverdicted · none · ref 62 · 2 links · internal anchor
TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA cs.CL · 2026-04-23 · unverdicted · none · ref 3 · internal anchor
AUDITA is a challenging audio QA benchmark where humans score 32% accuracy on average while state-of-the-art models score below 9%, using IRT to reveal systematic model deficiencies.
Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps cs.CL · 2026-04-21 · unverdicted · none · ref 4 · internal anchor
Four attention metrics enable logistic regression classifiers that detect hallucinations in SpeechLLMs with up to +0.23 PR-AUC gains over baselines on ASR and translation tasks.
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models cs.SD · 2026-04-20 · unverdicted · none · ref 12 · internal anchor
A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.
FreezeEmpath: Efficient Training for Empathetic Spoken Chatbots with Frozen LLMs cs.CL · 2026-04-20 · unverdicted · none · ref 2 · internal anchor
FreezeEmpath achieves emotionally expressive speech output and strong performance on empathetic dialogue, speech emotion recognition, and spoken QA tasks by training with a frozen LLM on existing speech datasets.
TinyMU: A Compact Audio-Language Model for Music Understanding cs.SD · 2026-04-17 · unverdicted · none · ref 8 · internal anchor
TinyMU is a 229M-parameter compact music understanding model that achieves 82% of state-of-the-art large audio-language model performance on the MuChoMusic benchmark while being 35 times smaller.
Qwen3.5-Omni Technical Report cs.CL · 2026-04-17 · unverdicted · none · ref 7 · internal anchor
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.

Qwen2-Audio Technical Report

hub tools

claims ledger

co-cited works

fields

years

verdicts

representative citing papers

citing papers explorer