arxiv: 2407.10759 · v1 · submitted 2024-07-15 · 📡 eess.AS · cs.CL· cs.LG

Recognition: 1 theorem link

· Lean Theorem

Qwen2-Audio Technical Report

Chang Zhou, Haojie Wei, Jingren Zhou, Jin Xu, Jinzheng He, Junyang Lin, Qian Yang, Xipin Wei, Yichong Leng, Yuanjun Lv, Yunfei Chu, Zhifang Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:10 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.LG

keywords audio-language modelinstruction followingvoice chataudio analysismultimodal AIQwen2-AudioAIR-Benchopen-source model

0 comments

The pith

Qwen2-Audio processes mixed audio inputs like sounds and conversations while following spoken commands, outperforming prior models such as Gemini-1.5-pro on audio instruction benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Qwen2-Audio as a large audio-language model that takes various audio signals and produces direct text responses to speech instructions. Training uses natural language prompts for different tasks instead of complex tags and draws on more data overall. This setup creates two interaction styles: free voice chat without any text and combined audio-plus-text analysis, with no system prompts needed to switch between them. The model understands overlapping elements in audio such as background sounds, multiple speakers, and embedded commands, then answers appropriately. After additional optimization for accuracy and behavior, benchmark results place it ahead of earlier leading systems on tests of audio-focused instruction following, and the full model is released openly.

Core claim

Qwen2-Audio accepts diverse audio inputs and responds to speech instructions in either voice-chat or audio-analysis mode without requiring explicit system prompts to change behavior. It directly interprets commands embedded in complex audio containing sounds and multi-speaker dialogue, delivering relevant interpretations and replies. Training simplification through natural language prompts across expanded datasets, followed by DPO tuning for factuality, produces stronger instruction adherence than earlier top models on AIR-Bench audio-centric evaluations.

What carries the argument

The dual interaction capability of Qwen2-Audio, where natural language prompts during training enable seamless handling of voice chat and audio analysis without mode-switching prompts.

If this is right

Users can speak freely to the model and receive context-aware replies even when audio contains overlapping speech and noises.
The same model instance supports both casual voice dialogue and detailed audio examination in one session.
Expanded prompt-based training data improves the model's ability to follow instructions across varied audio scenarios.
DPO tuning raises factuality so replies stay closer to actual audio content and avoid unwanted behaviors.
Open release allows others to test and extend the model for new audio-language tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The removal of mode-switching prompts may point to a general pattern where models learn to infer intent from raw input combinations alone.
If the training simplification works here, similar prompt-only methods could shorten development cycles for other audio or video models.
Widespread use of such open audio models could improve voice interfaces in devices that must handle noisy or multi-speaker environments.
Future benchmarks might need to include live, unscripted audio to check whether the reported gains hold outside controlled test sets.

Load-bearing premise

The AIR-Bench tests used to measure outperformance accurately capture real-world audio instruction following without biases from how the tests were built or which data was chosen.

What would settle it

A controlled comparison on new audio recordings that mix sounds, conversations, and commands, showing Qwen2-Audio does not exceed Gemini-1.5-pro accuracy or relevance in user-rated responses.

read the original abstract

We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis. In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input. In the audio analysis mode, users could provide audio and text instructions for analysis during the interaction. Note that we do not use any system prompts to switch between voice chat and audio analysis modes. Qwen2-Audio is capable of intelligently comprehending the content within audio and following voice commands to respond appropriately. For instance, in an audio segment that simultaneously contains sounds, multi-speaker conversations, and a voice command, Qwen2-Audio can directly understand the command and provide an interpretation and response to the audio. Additionally, DPO has optimized the model's performance in terms of factuality and adherence to desired behavior. According to the evaluation results from AIR-Bench, Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests focused on audio-centric instruction-following capabilities. Qwen2-Audio is open-sourced with the aim of fostering the advancement of the multi-modal language community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Qwen2-Audio is an incremental scaling step from the prior version with added modes and open release, but the AIR-Bench outperformance claim is hard to assess without evaluation protocol details.

read the letter

Qwen2-Audio is an incremental update to the first Qwen-Audio model. The main changes are expanding the training data, switching to natural language prompts instead of complex tags, adding DPO for factuality, and introducing two audio interaction modes that switch without system prompts. The report does well by open-sourcing the model and describing practical voice chat and analysis capabilities that handle mixed audio inputs like sounds, conversations, and commands in one go. The soft spots are clear in the evaluation claims. The headline result that Qwen2-Audio beats previous SOTAs including Gemini-1.5-pro on audio-centric instruction-following from AIR-Bench lacks supporting information. The report supplies no model sizes, no training data specifics, no description of the AIR-Bench subset, and no account of how responses were elicited from the closed models. Without those controls, the performance gap could stem from evaluation differences instead of genuine advances. The use of DPO is mentioned but not analyzed in depth. This work is for researchers and developers building open audio-language systems who need a new baseline or starting point for voice applications. It won't shift core theory but provides another data point in the open model space. If the full paper includes the missing methodological details, it should go to peer review. Otherwise the current version reads as a model release note more than a complete technical paper.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Qwen2-Audio, a large-scale audio-language model extending prior Qwen-Audio work. It accepts diverse audio inputs and generates direct textual responses to speech instructions. Pre-training is simplified via natural language prompts across tasks and data, with expanded data volume. The model supports two interaction modes—voice chat (no text input required) and audio analysis (audio plus text instructions)—switched without system prompts. DPO is applied to improve factuality and behavioral adherence. The central claim is that Qwen2-Audio outperforms prior SOTAs including Gemini-1.5-pro on AIR-Bench for audio-centric instruction-following, and the model is open-sourced.

Significance. If the performance claims are substantiated with full evaluation details, the work would advance multi-modal language modeling by demonstrating effective audio instruction-following in open-source form. Open-sourcing is a clear strength that supports community reproducibility and further development of audio-language systems.

major comments (1)

[Abstract] Abstract: the claim that Qwen2-Audio 'outperformed previous SOTAs, such as Gemini-1.5-pro' on AIR-Bench audio-centric instruction-following is load-bearing for the paper's primary contribution yet supplies no information on model size, training data composition, the exact AIR-Bench subset or prompt templates used, the procedure for querying closed models such as Gemini-1.5-pro, or any statistical significance testing. This absence prevents verification that the reported gap reflects intrinsic capability rather than differences in evaluation protocol.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the recommendation for major revision. We agree that additional context would strengthen verifiability of the performance claims and will revise the manuscript to address this.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that Qwen2-Audio 'outperformed previous SOTAs, such as Gemini-1.5-pro' on AIR-Bench audio-centric instruction-following is load-bearing for the paper's primary contribution yet supplies no information on model size, training data composition, the exact AIR-Bench subset or prompt templates used, the procedure for querying closed models such as Gemini-1.5-pro, or any statistical significance testing. This absence prevents verification that the reported gap reflects intrinsic capability rather than differences in evaluation protocol.

Authors: We acknowledge that the abstract is concise and omits these specifics. In the revised version we will expand the abstract to note the base model size, the expanded audio-text training data relative to Qwen-Audio, the audio-centric instruction-following subset of AIR-Bench, and the use of standard prompt templates. For closed models we will state that official APIs were used with identical instructions to those given to Qwen2-Audio. Full experimental protocols, data composition, and prompt details already appear in Sections 3 and 4 of the manuscript; we will add a cross-reference in the abstract. We did not perform formal statistical significance testing, as the observed gaps were large and consistent across evaluation runs, but we can add a clarifying sentence to this effect. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes model architecture, training with natural language prompts on expanded data, two interaction modes, DPO optimization, and empirical results on the external AIR-Bench benchmark showing outperformance versus Gemini-1.5-pro. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text or abstract. The central claim rests on benchmark comparisons that are independent of the model's internal construction, making the derivation self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical technical report on model training and evaluation; no mathematical axioms, free parameters, or invented entities are invoked in the abstract.

pith-pipeline@v0.9.0 · 5625 in / 997 out tokens · 38908 ms · 2026-05-11T02:10:11.307790+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval
cs.AI 2026-05 unverdicted novelty 8.0

ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
cs.SD 2026-04 unverdicted novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues
cs.AI 2026-04 unverdicted novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification
cs.CV 2026-05 unverdicted novelty 7.0

SpurAudio benchmark shows state-of-the-art few-shot audio classifiers suffer large performance drops when background correlations are disrupted, even in large pretrained models.
NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating
cs.SD 2026-05 unverdicted novelty 7.0

NAACA uses a neuro-inspired oscillatory working memory to gate attention in audio language models, raising AudioQwen's average precision from 53.5% to 70.6% on XD-Violence while cutting unnecessary calls.
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
cs.CL 2026-05 unverdicted novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
cs.CL 2026-05 unverdicted novelty 7.0

MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 7.0

TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
cs.CR 2026-05 conditional novelty 7.0

Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval
cs.AI 2026-05 unverdicted novelty 7.0

ReasonAudio benchmark shows current text-audio retrieval models fail at reasoning tasks like negation and duration discrimination beyond simple semantic matching.
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
eess.AS 2026-04 unverdicted novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
cs.CR 2026-04 unverdicted novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
eess.AS 2026-04 unverdicted novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
Ti-Audio: The First Multi-Dialectal End-to-End Speech LLM for Tibetan
cs.SD 2026-04 unverdicted novelty 7.0

Ti-Audio is the first multi-dialectal end-to-end Speech-LLM for Tibetan that achieves state-of-the-art performance on ASR and speech translation benchmarks via a Dynamic Q-Former Adapter and cross-dialect cooperation.
Unified Multimodal Uncertain Inference
cs.CV 2026-04 unverdicted novelty 7.0

Introduces UMUI task for fine-grained multimodal probabilistic inference and CLUE calibration method, where a 3B model matches larger baselines.
Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering
cs.IR 2026-04 unverdicted novelty 7.0

Jamendo-MT-QA is a new dataset and benchmark for multi-track comparative music question answering, constructed via an LLM-assisted pipeline from Creative Commons Jamendo tracks and used to evaluate audio-language models.
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
eess.AS 2026-04 unverdicted novelty 7.0

Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...
KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness
cs.CL 2026-03 unverdicted novelty 7.0

KoALa-Bench is a new public benchmark with six tasks that tests Korean speech recognition, translation, question answering, instruction following, and faithfulness in large audio language models.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
cs.CV 2026-03 unverdicted novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs
cs.CL 2026-05 unverdicted novelty 6.0

A sequence-tagger-guided LLM with contrastive objective corrects disfluencies in Hindi, Bengali, and Marathi ASR transcripts, outperforming removal-only baselines.
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
eess.AS 2026-05 unverdicted novelty 6.0

A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
cs.SD 2026-05 unverdicted novelty 6.0

VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
eess.AS 2026-05 unverdicted novelty 6.0

JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
cs.AI 2026-05 unverdicted novelty 6.0

Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time
cs.LG 2026-05 unverdicted novelty 6.0

LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.
EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness
cs.CV 2026-05 unverdicted novelty 6.0

EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation
cs.SD 2026-04 unverdicted novelty 6.0

Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis
eess.AS 2026-04 unverdicted novelty 6.0

CROTTC-IF is a prompt-free MDD system with monotonic frame-level alignment and implicit knowledge transfer that reaches 71.77% F1 on L2-ARCTIC and 71.70% on Iqra'Eval2.
MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis
q-bio.NC 2026-04 unverdicted novelty 6.0

MoDAl discovers complementary neurolinguistic modalities via contrastive-decorrelation objectives, cutting brain-to-text word error rate from 26.3% to 21.6% by incorporating area 44 signals.
Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages
cs.CL 2026-04 conditional novelty 6.0

Phoneme-level analysis of ASR on Archi and Rutul shows data scarcity explains recognition errors better than phonological complexity, with language-specific adaptations improving wav2vec2 performance.
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech
eess.AS 2026-04 unverdicted novelty 6.0

VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
cs.SD 2026-04 unverdicted novelty 6.0

SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
eess.AS 2026-04 unverdicted novelty 6.0

A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation
cs.SD 2026-04 unverdicted novelty 6.0

LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestrati...
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
cs.SD 2026-04 unverdicted novelty 6.0

GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on fo...
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs
cs.SD 2026-04 unverdicted novelty 6.0

NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
eess.AS 2026-04 unverdicted novelty 6.0

A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection
cs.CV 2026-04 unverdicted novelty 6.0

RASR retrieves cross-instance semantic evidence and uses domain priors to drive multimodal LLM reasoning for improved fake news video detection on FakeSV and FakeTT datasets.
FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection
cs.SD 2026-04 unverdicted novelty 6.0

FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.
Qwen3-Omni Technical Report
cs.CL 2025-09 unverdicted novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models
eess.AS 2026-05 unverdicted novelty 5.0

A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 5.0

TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA
cs.CL 2026-04 unverdicted novelty 5.0

AUDITA is a challenging audio QA benchmark where humans score 32% accuracy on average while state-of-the-art models score below 9%, using IRT to reveal systematic model deficiencies.
Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps
cs.CL 2026-04 unverdicted novelty 5.0

Four attention metrics enable logistic regression classifiers that detect hallucinations in SpeechLLMs with up to +0.23 PR-AUC gains over baselines on ASR and translation tasks.
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
cs.SD 2026-04 unverdicted novelty 5.0

A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.
FreezeEmpath: Efficient Training for Empathetic Spoken Chatbots with Frozen LLMs
cs.CL 2026-04 unverdicted novelty 5.0

FreezeEmpath achieves emotionally expressive speech output and strong performance on empathetic dialogue, speech emotion recognition, and spoken QA tasks by training with a frozen LLM on existing speech datasets.
TinyMU: A Compact Audio-Language Model for Music Understanding
cs.SD 2026-04 unverdicted novelty 5.0

TinyMU is a 229M-parameter compact music understanding model that achieves 82% of state-of-the-art large audio-language model performance on the MuChoMusic benchmark while being 35 times smaller.
Qwen3.5-Omni Technical Report
cs.CL 2026-04 unverdicted novelty 5.0

Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt
cs.SD 2026-04 unverdicted novelty 5.0

TimePro-RL interleaves timestamp embeddings in audio sequences and applies RL post-SFT to boost temporal alignment in LALMs, yielding gains on grounding, event detection, and dense captioning.
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
eess.AS 2026-04 unverdicted novelty 5.0

Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models
cs.CL 2026-04 unverdicted novelty 5.0

Deep layers of speech language models show high token redundancy that can be compressed via training-free similarity pooling, reducing prefilling costs by 27% while preserving task performance.
Kimi-Audio Technical Report
eess.AS 2025-04 unverdicted novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
Step-Audio-R1.5 Technical Report
eess.AS 2026-04 unverdicted novelty 4.0

Step-Audio-R1.5 applies RLHF to audio reasoning models to maintain analytical performance while improving prosodic naturalness and immersion in extended spoken interactions.
Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss
cs.CL 2026-04 unverdicted novelty 4.0

A cross-modal attention refinement module plus hybrid loss improves robustness of audio-text retrieval on noisy and long-form audio.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
cs.CV 2024-06 unverdicted novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 58 Pith papers · 4 internal anchors

[1]

MusicLM: Generating Music From Text

Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang,ArenJansen,AdamRoberts,MarcoTagliasacchi,etal. Musiclm: Generatingmusicfromtext. arXiv preprint arXiv:2301.11325,

work page internal anchor Pith review arXiv
[2]

JunyiAo,RuiWang,LongZhou,ChengyiWang,ShuoRen,YuWu,ShujieLiu,TomKo,QingLi,YuZhang,etal

JunyiAo,RuiWang,LongZhou,ChengyiWang,ShuoRen,YuWu,ShujieLiu,TomKo,QingLi,YuZhang,etal. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing.arXiv:2110.07205,

work page arXiv
[3]

Ardila, M

R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. InProceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211–4215,

work page 2020
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Speechnet: A universal modularized model for speech processing tasks.arXiv:2105.03070,

Yi-Chen Chen, Po-Han Chi, Shu-wen Yang, Kai-Wei Chang, Jheng-hao Lin, Sung-Feng Huang, Da-Rong Liu, Chi-Liang Liu, Cheng-Kuang Lee, and Hung-yi Lee. Speechnet: A universal modularized model for speech processing tasks.arXiv:2105.03070,

work page arXiv
[6]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919,

work page internal anchor Pith review arXiv
[7]

Fleurs: Few-shot learning evaluation of universal representations of speech

Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. 2022 IEEE Spoken Language T echnology Workshop (SLT) , pages 798–805,

work page 2022
[8]

Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, et al

URLhttps: //api.semanticscholar.org/CorpusID:249062909. Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, David Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, et al. Speechverse: A large-scale generalizable audio language model.arXiv preprint arXiv:2405.08295,

work page arXiv
[9]

Clotho: an audio captioning dataset

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8,

work page 2020
[10]

Aishell-2: Transform- ing mandarin asr research into industrial scale,

Jiayu Du, Xingyu Na, Xuechen Liu, and Hui Bu. AISHELL-2: transforming mandarin ASR research into industrial scale. abs/1808.10583,

work page arXiv
[11]

CLAP: learning audio concepts from natural language supervision

14 Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. CLAP: learning audio concepts from natural language supervision. abs/2206.04769,

work page arXiv
[12]

Funasr: A fundamental end-to-end speech recognition toolkit

Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, and Shiliang Zhang. Funasr: A fundamental end-to-end speech recognition toolkit. CoRR, abs/2305.11013,

work page arXiv
[13]

Vocalsound: Adatasetforimprovinghumanvocalsoundsrecognition

YuanGong,JinYu,andJamesR.Glass. Vocalsound: Adatasetforimprovinghumanvocalsoundsrecognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pages 151–155. IEEE,

work page 2022
[14]

Audioclip: Extending clip to image, text and audio

doi: 10.1109/ICASSP43922.2022.9746828. URLhttps://doi. org/10.1109/ICASSP43922.2022.9746828. Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologie...

work page doi:10.1109/icassp43922.2022.9746828 2022
[15]

Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,

Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities.arXiv preprint arXiv:2402.01831,

work page arXiv
[16]

arXiv preprint arXiv:2306.09093 , year=

Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, and Zhaopeng Tu. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. CoRR, abs/2306.09093,

work page arXiv
[17]

VassilPanayotov,GuoguoChen,DanielPovey,andSanjeevKhudanpur

URLhttps://openai.com/index/hello-gpt-4o/. VassilPanayotov,GuoguoChen,DanielPovey,andSanjeevKhudanpur. Librispeech: AnASRcorpusbasedon public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015 . IEEE,

work page 2015
[18]

MELD: A multimodal multi-party dataset for emotion recognition in conversations

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. MELD: A multimodal multi-party dataset for emotion recognition in conversations. InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, V olume 1: Long Papers. Association f...

work page 2019
[19]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever

URL https://github.com/QwenLM/Qwen-7B. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA,

work page 2023
[20]

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, RaduSoricut,AngelikiLazaridou,OrhanFirat,JulianSchrittwieser,etal.Gemini1.5: Unlockingmultimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Spokenwoz: A large-scale speech-text benchmark for spoken task-oriented dialogue in multiple domains

ShuzhengSi,WentaoMa,YuchuanWu,YinpeiDai, HaoyuGao,Ting-EnLin, HangyuLi,RuiYan, FeiHuang, and Yongbin Li. Spokenwoz: A large-scale speech-text benchmark for spoken task-oriented dialogue in multiple domains. arXiv preprint arXiv:2305.13040,

work page arXiv
[22]

Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023

15 Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction- follow them all.arXiv:2305.16355,

work page arXiv
[24]

ChenWang,MinpengLiao,ZhongqiangHuang,JinliangLu,JunhongWu,YuchenLiu,ChengqingZong,and Jiajun Zhang

URLhttps://arxiv.org/abs/2007.10310. ChenWang,MinpengLiao,ZhongqiangHuang,JinliangLu,JunhongWu,YuchenLiu,ChengqingZong,and Jiajun Zhang. Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing. arXiv:2309.00916, 2023a. Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Yongqiang Wang, Nanxin...

work page arXiv 2007
[25]

Speechgpt: Em- powering large language models with intrinsic cross-modal conversational abilities.CoRR, abs/2305.11000, 2023a

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Em- powering large language models with intrinsic cross-modal conversational abilities.CoRR, abs/2305.11000,

work page arXiv
[26]

Mmspeech: Multi-modal multi-task encoder-decoder pre-training for speech recognition

Xiaohuan Zhou, Jiaming Wang, Zeyu Cui, Shiliang Zhang, Zhijie Yan, Jingren Zhou, and Chang Zhou. Mmspeech: Multi-modal multi-task encoder-decoder pre-training for speech recognition. abs/2212.00500,

work page arXiv