Recognition: 1 theorem link
· Lean TheoremQwen2-Audio Technical Report
Pith reviewed 2026-05-11 02:10 UTC · model grok-4.3
The pith
Qwen2-Audio processes mixed audio inputs like sounds and conversations while following spoken commands, outperforming prior models such as Gemini-1.5-pro on audio instruction benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Qwen2-Audio accepts diverse audio inputs and responds to speech instructions in either voice-chat or audio-analysis mode without requiring explicit system prompts to change behavior. It directly interprets commands embedded in complex audio containing sounds and multi-speaker dialogue, delivering relevant interpretations and replies. Training simplification through natural language prompts across expanded datasets, followed by DPO tuning for factuality, produces stronger instruction adherence than earlier top models on AIR-Bench audio-centric evaluations.
What carries the argument
The dual interaction capability of Qwen2-Audio, where natural language prompts during training enable seamless handling of voice chat and audio analysis without mode-switching prompts.
If this is right
- Users can speak freely to the model and receive context-aware replies even when audio contains overlapping speech and noises.
- The same model instance supports both casual voice dialogue and detailed audio examination in one session.
- Expanded prompt-based training data improves the model's ability to follow instructions across varied audio scenarios.
- DPO tuning raises factuality so replies stay closer to actual audio content and avoid unwanted behaviors.
- Open release allows others to test and extend the model for new audio-language tasks.
Where Pith is reading between the lines
- The removal of mode-switching prompts may point to a general pattern where models learn to infer intent from raw input combinations alone.
- If the training simplification works here, similar prompt-only methods could shorten development cycles for other audio or video models.
- Widespread use of such open audio models could improve voice interfaces in devices that must handle noisy or multi-speaker environments.
- Future benchmarks might need to include live, unscripted audio to check whether the reported gains hold outside controlled test sets.
Load-bearing premise
The AIR-Bench tests used to measure outperformance accurately capture real-world audio instruction following without biases from how the tests were built or which data was chosen.
What would settle it
A controlled comparison on new audio recordings that mix sounds, conversations, and commands, showing Qwen2-Audio does not exceed Gemini-1.5-pro accuracy or relevance in user-rated responses.
read the original abstract
We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis. In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input. In the audio analysis mode, users could provide audio and text instructions for analysis during the interaction. Note that we do not use any system prompts to switch between voice chat and audio analysis modes. Qwen2-Audio is capable of intelligently comprehending the content within audio and following voice commands to respond appropriately. For instance, in an audio segment that simultaneously contains sounds, multi-speaker conversations, and a voice command, Qwen2-Audio can directly understand the command and provide an interpretation and response to the audio. Additionally, DPO has optimized the model's performance in terms of factuality and adherence to desired behavior. According to the evaluation results from AIR-Bench, Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests focused on audio-centric instruction-following capabilities. Qwen2-Audio is open-sourced with the aim of fostering the advancement of the multi-modal language community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Qwen2-Audio, a large-scale audio-language model extending prior Qwen-Audio work. It accepts diverse audio inputs and generates direct textual responses to speech instructions. Pre-training is simplified via natural language prompts across tasks and data, with expanded data volume. The model supports two interaction modes—voice chat (no text input required) and audio analysis (audio plus text instructions)—switched without system prompts. DPO is applied to improve factuality and behavioral adherence. The central claim is that Qwen2-Audio outperforms prior SOTAs including Gemini-1.5-pro on AIR-Bench for audio-centric instruction-following, and the model is open-sourced.
Significance. If the performance claims are substantiated with full evaluation details, the work would advance multi-modal language modeling by demonstrating effective audio instruction-following in open-source form. Open-sourcing is a clear strength that supports community reproducibility and further development of audio-language systems.
major comments (1)
- [Abstract] Abstract: the claim that Qwen2-Audio 'outperformed previous SOTAs, such as Gemini-1.5-pro' on AIR-Bench audio-centric instruction-following is load-bearing for the paper's primary contribution yet supplies no information on model size, training data composition, the exact AIR-Bench subset or prompt templates used, the procedure for querying closed models such as Gemini-1.5-pro, or any statistical significance testing. This absence prevents verification that the reported gap reflects intrinsic capability rather than differences in evaluation protocol.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract and the recommendation for major revision. We agree that additional context would strengthen verifiability of the performance claims and will revise the manuscript to address this.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that Qwen2-Audio 'outperformed previous SOTAs, such as Gemini-1.5-pro' on AIR-Bench audio-centric instruction-following is load-bearing for the paper's primary contribution yet supplies no information on model size, training data composition, the exact AIR-Bench subset or prompt templates used, the procedure for querying closed models such as Gemini-1.5-pro, or any statistical significance testing. This absence prevents verification that the reported gap reflects intrinsic capability rather than differences in evaluation protocol.
Authors: We acknowledge that the abstract is concise and omits these specifics. In the revised version we will expand the abstract to note the base model size, the expanded audio-text training data relative to Qwen-Audio, the audio-centric instruction-following subset of AIR-Bench, and the use of standard prompt templates. For closed models we will state that official APIs were used with identical instructions to those given to Qwen2-Audio. Full experimental protocols, data composition, and prompt details already appear in Sections 3 and 4 of the manuscript; we will add a cross-reference in the abstract. We did not perform formal statistical significance testing, as the observed gaps were large and consistent across evaluation runs, but we can add a clarifying sentence to this effect. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes model architecture, training with natural language prompts on expanded data, two interaction modes, DPO optimization, and empirical results on the external AIR-Bench benchmark showing outperformance versus Gemini-1.5-pro. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text or abstract. The central claim rests on benchmark comparisons that are independent of the model's internal construction, making the derivation self-contained rather than tautological.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval
ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
-
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
-
DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
-
SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification
SpurAudio benchmark shows state-of-the-art few-shot audio classifiers suffer large performance drops when background correlations are disrupted, even in large pretrained models.
-
NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating
NAACA uses a neuro-inspired oscillatory working memory to gate attention in audio language models, raising AudioQwen's average precision from 53.5% to 70.6% on XD-Violence while cutting unnecessary calls.
-
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
-
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
-
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
-
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
-
ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval
ReasonAudio benchmark shows current text-audio retrieval models fail at reasoning tasks like negation and duration discrimination beyond simple semantic matching.
-
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
-
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
-
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
-
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
-
Ti-Audio: The First Multi-Dialectal End-to-End Speech LLM for Tibetan
Ti-Audio is the first multi-dialectal end-to-end Speech-LLM for Tibetan that achieves state-of-the-art performance on ASR and speech translation benchmarks via a Dynamic Q-Former Adapter and cross-dialect cooperation.
-
Unified Multimodal Uncertain Inference
Introduces UMUI task for fine-grained multimodal probabilistic inference and CLUE calibration method, where a 3B model matches larger baselines.
-
Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering
Jamendo-MT-QA is a new dataset and benchmark for multi-track comparative music question answering, constructed via an LLM-assisted pipeline from Creative Commons Jamendo tracks and used to evaluate audio-language models.
-
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...
-
KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness
KoALa-Bench is a new public benchmark with six tasks that tests Korean speech recognition, translation, question answering, instruction following, and faithfulness in large audio language models.
-
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
-
Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs
A sequence-tagger-guided LLM with contrastive objective corrects disfluencies in Hindi, Bengali, and Marathi ASR transcripts, outperforming removal-only baselines.
-
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
-
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
-
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
-
When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
-
Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time
LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.
-
EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness
EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.
-
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation
Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
-
Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis
CROTTC-IF is a prompt-free MDD system with monotonic frame-level alignment and implicit knowledge transfer that reaches 71.77% F1 on L2-ARCTIC and 71.70% on Iqra'Eval2.
-
MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis
MoDAl discovers complementary neurolinguistic modalities via contrastive-decorrelation objectives, cutting brain-to-text word error rate from 26.3% to 21.6% by incorporating area 44 signals.
-
Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages
Phoneme-level analysis of ASR on Archi and Rutul shows data scarcity explains recognition errors better than phonological complexity, with language-specific adaptations improving wav2vec2 performance.
-
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech
VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
-
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.
-
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
-
LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation
LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestrati...
-
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
-
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on fo...
-
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs
NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
-
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
-
RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection
RASR retrieves cross-instance semantic evidence and uses domain priors to drive multimodal LLM reasoning for improved fake news video detection on FakeSV and FakeTT datasets.
-
FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection
FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
-
Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models
A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.
-
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
-
AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA
AUDITA is a challenging audio QA benchmark where humans score 32% accuracy on average while state-of-the-art models score below 9%, using IRT to reveal systematic model deficiencies.
-
Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps
Four attention metrics enable logistic regression classifiers that detect hallucinations in SpeechLLMs with up to +0.23 PR-AUC gains over baselines on ASR and translation tasks.
-
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.
-
FreezeEmpath: Efficient Training for Empathetic Spoken Chatbots with Frozen LLMs
FreezeEmpath achieves emotionally expressive speech output and strong performance on empathetic dialogue, speech emotion recognition, and spoken QA tasks by training with a frozen LLM on existing speech datasets.
-
TinyMU: A Compact Audio-Language Model for Music Understanding
TinyMU is a 229M-parameter compact music understanding model that achieves 82% of state-of-the-art large audio-language model performance on the MuChoMusic benchmark while being 35 times smaller.
-
Qwen3.5-Omni Technical Report
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
-
Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt
TimePro-RL interleaves timestamp embeddings in audio sequences and applies RL post-SFT to boost temporal alignment in LALMs, yielding gains on grounding, event detection, and dense captioning.
-
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
-
Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models
Deep layers of speech language models show high token redundancy that can be compressed via training-free similarity pooling, reducing prefilling costs by 27% while preserving task performance.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
-
Step-Audio-R1.5 Technical Report
Step-Audio-R1.5 applies RLHF to audio reasoning models to maintain analytical performance while improving prosodic naturalness and immersion in extended spoken interactions.
-
Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss
A cross-modal attention refinement module plus hybrid loss improves robustness of audio-text retrieval on noisy and long-form audio.
-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
Reference graph
Works this paper leans on
-
[1]
MusicLM: Generating Music From Text
Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang,ArenJansen,AdamRoberts,MarcoTagliasacchi,etal. Musiclm: Generatingmusicfromtext. arXiv preprint arXiv:2301.11325,
work page internal anchor Pith review arXiv
-
[2]
JunyiAo,RuiWang,LongZhou,ChengyiWang,ShuoRen,YuWu,ShujieLiu,TomKo,QingLi,YuZhang,etal
JunyiAo,RuiWang,LongZhou,ChengyiWang,ShuoRen,YuWu,ShujieLiu,TomKo,QingLi,YuZhang,etal. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing.arXiv:2110.07205,
-
[3]
R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. InProceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211–4215,
work page 2020
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Speechnet: A universal modularized model for speech processing tasks.arXiv:2105.03070,
Yi-Chen Chen, Po-Han Chi, Shu-wen Yang, Kai-Wei Chang, Jheng-hao Lin, Sung-Feng Huang, Da-Rong Liu, Chi-Liang Liu, Cheng-Kuang Lee, and Hung-yi Lee. Speechnet: A universal modularized model for speech processing tasks.arXiv:2105.03070,
-
[6]
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919,
work page internal anchor Pith review arXiv
-
[7]
Fleurs: Few-shot learning evaluation of universal representations of speech
Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. 2022 IEEE Spoken Language T echnology Workshop (SLT) , pages 798–805,
work page 2022
-
[8]
URLhttps: //api.semanticscholar.org/CorpusID:249062909. Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, David Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, et al. Speechverse: A large-scale generalizable audio language model.arXiv preprint arXiv:2405.08295,
-
[9]
Clotho: an audio captioning dataset
Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8,
work page 2020
-
[10]
Aishell-2: Transform- ing mandarin asr research into industrial scale,
Jiayu Du, Xingyu Na, Xuechen Liu, and Hui Bu. AISHELL-2: transforming mandarin ASR research into industrial scale. abs/1808.10583,
-
[11]
CLAP: learning audio concepts from natural language supervision
14 Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. CLAP: learning audio concepts from natural language supervision. abs/2206.04769,
-
[12]
Funasr: A fundamental end-to-end speech recognition toolkit
Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, and Shiliang Zhang. Funasr: A fundamental end-to-end speech recognition toolkit. CoRR, abs/2305.11013,
-
[13]
Vocalsound: Adatasetforimprovinghumanvocalsoundsrecognition
YuanGong,JinYu,andJamesR.Glass. Vocalsound: Adatasetforimprovinghumanvocalsoundsrecognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pages 151–155. IEEE,
work page 2022
-
[14]
Audioclip: Extending clip to image, text and audio
doi: 10.1109/ICASSP43922.2022.9746828. URLhttps://doi. org/10.1109/ICASSP43922.2022.9746828. Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologie...
-
[15]
Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,
Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities.arXiv preprint arXiv:2402.01831,
-
[16]
arXiv preprint arXiv:2306.09093 , year=
Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, and Zhaopeng Tu. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. CoRR, abs/2306.09093,
-
[17]
VassilPanayotov,GuoguoChen,DanielPovey,andSanjeevKhudanpur
URLhttps://openai.com/index/hello-gpt-4o/. VassilPanayotov,GuoguoChen,DanielPovey,andSanjeevKhudanpur. Librispeech: AnASRcorpusbasedon public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015 . IEEE,
work page 2015
-
[18]
MELD: A multimodal multi-party dataset for emotion recognition in conversations
Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. MELD: A multimodal multi-party dataset for emotion recognition in conversations. InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, V olume 1: Long Papers. Association f...
work page 2019
-
[19]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever
URL https://github.com/QwenLM/Qwen-7B. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA,
work page 2023
-
[20]
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, RaduSoricut,AngelikiLazaridou,OrhanFirat,JulianSchrittwieser,etal.Gemini1.5: Unlockingmultimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Spokenwoz: A large-scale speech-text benchmark for spoken task-oriented dialogue in multiple domains
ShuzhengSi,WentaoMa,YuchuanWu,YinpeiDai, HaoyuGao,Ting-EnLin, HangyuLi,RuiYan, FeiHuang, and Yongbin Li. Spokenwoz: A large-scale speech-text benchmark for spoken task-oriented dialogue in multiple domains. arXiv preprint arXiv:2305.13040,
-
[22]
Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023
15 Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction- follow them all.arXiv:2305.16355,
-
[24]
ChenWang,MinpengLiao,ZhongqiangHuang,JinliangLu,JunhongWu,YuchenLiu,ChengqingZong,and Jiajun Zhang
URLhttps://arxiv.org/abs/2007.10310. ChenWang,MinpengLiao,ZhongqiangHuang,JinliangLu,JunhongWu,YuchenLiu,ChengqingZong,and Jiajun Zhang. Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing. arXiv:2309.00916, 2023a. Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Yongqiang Wang, Nanxin...
-
[25]
Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Em- powering large language models with intrinsic cross-modal conversational abilities.CoRR, abs/2305.11000,
-
[26]
Mmspeech: Multi-modal multi-task encoder-decoder pre-training for speech recognition
Xiaohuan Zhou, Jiaming Wang, Zeyu Cui, Shiliang Zhang, Zhijie Yan, Jingren Zhou, and Chang Zhou. Mmspeech: Multi-modal multi-task encoder-decoder pre-training for speech recognition. abs/2212.00500,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.