arxiv: 2311.07919 · v2 · submitted 2023-11-14 · 📡 eess.AS · cs.CL· cs.LG

Recognition: 2 theorem links

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Chang Zhou, Jingren Zhou, Jin Xu, Qian Yang, Shiliang Zhang, Xiaohuan Zhou, Yunfei Chu, Zhijie Yan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 18:52 UTC · model grok-4.3

classification 📡 eess.AS cs.CLcs.LG

keywords audio-language modelsmulti-task traininghierarchical tagsuniversal audio understandinginstruction followingpre-trainingQwen-Audiospeech and music processing

0 comments

The pith

Scaling audio-language pre-training to over 30 tasks and audio types with hierarchical tag conditioning produces a single model that performs strongly on diverse benchmarks without any task-specific fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops Qwen-Audio to overcome the limitation that prior audio-language models could handle only narrow sets of audio types and tasks. It scales pre-training across human speech, natural sounds, music, and songs while using a multi-task framework that feeds hierarchical tags to the decoder. Shared tags promote knowledge transfer across tasks; specified tags isolate differences in label format, language, and granularity to reduce interference. The resulting model matches or exceeds specialized systems on standard benchmarks with no further tuning per task. This unified approach also supports a follow-on chat model for multi-turn audio-text conversations.

Core claim

Qwen-Audio is trained by scaling audio-language pre-training over more than 30 tasks and multiple audio categories. Direct joint training creates interference because textual labels differ in focus, language, granularity, and structure across datasets. The authors address this with a conditioning mechanism that supplies a sequence of hierarchical tags to the decoder: shared tags encourage cross-task knowledge sharing while task-specific tags isolate conflicting label characteristics. The trained model then delivers strong results on a wide range of audio understanding benchmarks without requiring any per-task fine-tuning and outperforms prior counterparts.

What carries the argument

The multi-task training framework that conditions the decoder on a sequence of hierarchical tags, where shared tags promote knowledge sharing and specified tags isolate dataset-specific label variations to prevent interference.

If this is right

The same model supports direct multi-turn dialogue when paired with text inputs, enabling Qwen-Audio-Chat for audio-centric scenarios.
Universal audio understanding becomes feasible for mixed inputs spanning speech, environmental sounds, music, and songs within one system.
Task-specific fine-tuning steps can be skipped for many standard audio benchmarks while still matching or exceeding specialized models.
The training recipe scales to additional audio types or tasks without redesigning separate heads or loss functions for each.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The tag-conditioning pattern may transfer to other multi-modal settings where label formats vary across datasets, such as vision-language or video tasks.
If the interference-avoidance mechanism holds, future unified models could incorporate streaming or real-time audio without retraining separate pipelines.
Performance gains on held-out benchmarks would suggest that the hierarchical tags act as a lightweight form of task routing inside a single decoder.

Load-bearing premise

The hierarchical tag conditioning is sufficient to prevent performance interference from the large variations in textual labels across the collected datasets.

What would settle it

A controlled ablation in which the same model is trained on the identical data mixture but without the hierarchical tags, then evaluated on the same benchmarks; if individual-task scores drop markedly or fall below fine-tuned baselines, the central claim would be falsified.

read the original abstract

Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

Qwen-Audio scales audio-language pretraining across 30+ tasks with hierarchical tag conditioning to reduce label interference, but the abstract supplies no metrics or ablations to show the mechanism works. The paper pulls together speech, natural sounds, music, and songs into one training run and adds a chat extension for multi-turn audio-text dialogues. The tag design splits conditioning into shared tags for knowledge transfer and specific tags to block interference from varying labels, languages, and annotation styles. That is a concrete, practical step for multi-task audio models. The write-up clearly names the one-to-many interference problem and explains how the framework tries to solve it without task-specific fine-tuning at test time. The scale itself follows the usual large-model pattern but applies it to a broader audio mix than most prior work. The main weakness is the missing evidence. The abstract asserts strong zero-shot results and claims to surpass counterparts yet gives no numbers, baselines, per-task scores, or ablation removing the hierarchy. Without those, it is impossible to tell whether the tags actually prevent interference or whether raw data volume and model size drive any gains. The stress-test note is accurate on this point. If the full paper contains detailed tables, tag-construction details, and controls, the contribution becomes easier to assess. This paper is for researchers working on unified audio-language systems who need ideas for handling heterogeneous datasets. A reader focused on multi-task training or audio understanding would get value from the framework description even before the numbers are verified. It deserves a serious referee to check the experiments and results in full. I would send it to peer review once the quantitative support is confirmed.

Referee Report

2 major / 1 minor

Summary. The paper introduces Qwen-Audio, a unified large-scale audio-language model pre-trained on diverse audio types (speech, natural sounds, music, songs) covering over 30 tasks. It proposes a multi-task training framework that conditions the decoder on a sequence of hierarchical tags to encourage knowledge sharing via shared tags while mitigating interference from heterogeneous textual labels via specified tags. The central claim is that this enables strong zero-shot performance across benchmarks without any task-specific fine-tuning, surpassing prior models; the work also presents Qwen-Audio-Chat for multi-turn audio-text dialogues.

Significance. If the zero-shot performance claims are substantiated, the work would advance universal audio understanding by scaling multi-task pre-training across heterogeneous datasets and audio modalities, addressing a key barrier in instruction-following audio models. It receives credit for the broad task and data coverage as well as the extension to an interactive chat model.

major comments (2)

[Abstract] Abstract: The assertion of 'impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts' is presented without any quantitative metrics, specific benchmark names, baseline comparisons, or result tables. This leaves the central empirical claim unsupported by evidence in the manuscript summary.
[Multi-task training framework] Multi-task training framework: The hierarchical tag conditioning is described as the solution to one-to-many interference arising from label variations across datasets. However, the manuscript supplies no ablation removing the tag hierarchy, no per-task performance breakdowns on label-variation subsets, and no details on tag construction for speech/music/sound data. Without these controls, it is impossible to determine whether reported gains stem from the proposed mechanism or from data scale alone.

minor comments (1)

[Abstract] The abstract refers to 'over 30 tasks' without enumerating them or providing examples of the task categories and datasets used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and describe the revisions we will make to improve clarity and empirical support.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion of 'impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts' is presented without any quantitative metrics, specific benchmark names, baseline comparisons, or result tables. This leaves the central empirical claim unsupported by evidence in the manuscript summary.

Authors: We agree that the abstract would benefit from more concrete quantitative support for the central claim. The full manuscript reports detailed zero-shot results in Section 4 (Tables 1-3), including specific benchmarks such as LibriSpeech ASR, AudioCaps captioning, and VGGSound classification, with Qwen-Audio outperforming baselines like Whisper-large and AudioPaLM by 5-15% relative on several tasks. We will revise the abstract to include key metrics, benchmark names, and brief baseline comparisons while preserving its concise nature. revision: yes
Referee: [Multi-task training framework] Multi-task training framework: The hierarchical tag conditioning is described as the solution to one-to-many interference arising from label variations across datasets. However, the manuscript supplies no ablation removing the tag hierarchy, no per-task performance breakdowns on label-variation subsets, and no details on tag construction for speech/music/sound data. Without these controls, it is impossible to determine whether reported gains stem from the proposed mechanism or from data scale alone.

Authors: We thank the referee for this observation. Section 3.2 describes the hierarchical tag design, with examples such as <speech><asr> for transcription tasks, <music><caption> for music, and <sound><classify> for environmental sounds, chosen to share common prefixes while specifying task granularity. However, the manuscript indeed lacks explicit ablations isolating the hierarchy and fine-grained breakdowns on label-variation subsets. We will add a controlled ablation on a 10% data subset comparing hierarchical tags versus flat tags or no tags, plus expanded tag construction details and per-task performance tables in the revised manuscript to better demonstrate the mechanism's contribution beyond scale. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical performance claims rest on training and benchmarks

full rationale

The paper describes scaling audio-language pre-training across >30 tasks and introduces a hierarchical tag conditioning mechanism in the multi-task framework to mitigate label interference. These are presented as design choices whose effectiveness is asserted via reported benchmark results (no task-specific fine-tuning, surpassing counterparts). No equations, derivations, or 'predictions' appear that reduce by construction to fitted parameters or self-referential definitions. No self-citation chains are invoked to justify uniqueness or force the central result. The claims are self-contained empirical outcomes, not tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that tag-based conditioning resolves label interference without performance loss; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Diverse audio datasets produce textual labels that vary enough in focus, language, granularity, and structure to cause training interference when co-trained directly.
Explicitly stated in the abstract as the core problem addressed by the hierarchical tag design.

pith-pipeline@v0.9.0 · 5568 in / 1132 out tokens · 58496 ms · 2026-05-12T18:52:17.916733+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
cs.SD 2026-04 unverdicted novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating
cs.SD 2026-05 unverdicted novelty 7.0

NAACA uses a neuro-inspired oscillatory working memory to gate attention in audio language models, raising AudioQwen's average precision from 53.5% to 70.6% on XD-Violence while cutting unnecessary calls.
Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration
cs.SD 2026-05 unverdicted novelty 7.0

Polyphonia improves zero-shot stem-specific timbre transfer in polyphonic music by 15.5% target alignment via acoustic-informed attention calibration that uses probabilistic priors to set coarse boundaries.
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
cs.CL 2026-05 unverdicted novelty 7.0

MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition
cs.HC 2026-05 unverdicted novelty 7.0

AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emot...
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
cs.SD 2026-04 unverdicted novelty 7.0

ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
eess.AS 2026-04 unverdicted novelty 7.0

Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
cs.CR 2026-04 unverdicted novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
eess.AS 2026-04 unverdicted novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
Speaker Attributed Automatic Speech Recognition Using Speech Aware LLMS
eess.AS 2026-04 unverdicted novelty 7.0

Adapting speech-aware LLMs with speaker cluster identification tags and concatenated multi-speaker data yields superior speaker-attributed ASR performance versus sequential diarization-plus-ASR pipelines.
FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips
cs.CV 2026-04 unverdicted novelty 7.0

FoleyDesigner generates spatio-temporally aligned stereo Foley audio for film clips via multi-agent analysis, diffusion models on video cues, and LLM mixing, supported by the new FilmStereo dataset.
When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
cs.AI 2026-05 unverdicted novelty 6.0

Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval
cs.SD 2026-04 unverdicted novelty 6.0

Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
cs.SD 2026-04 unverdicted novelty 6.0

SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.
Qwen3-Omni Technical Report
cs.CL 2025-09 unverdicted novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
TinyMU: A Compact Audio-Language Model for Music Understanding
cs.SD 2026-04 unverdicted novelty 5.0

TinyMU is a 229M-parameter compact music understanding model that achieves 82% of state-of-the-art large audio-language model performance on the MuChoMusic benchmark while being 35 times smaller.
Qwen3.5-Omni Technical Report
cs.CL 2026-04 unverdicted novelty 5.0

Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt
cs.SD 2026-04 unverdicted novelty 5.0

TimePro-RL interleaves timestamp embeddings in audio sequences and applies RL post-SFT to boost temporal alignment in LALMs, yielding gains on grounding, event detection, and dense captioning.
Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition
cs.CL 2026-04 unverdicted novelty 5.0

The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.
Kimi-Audio Technical Report
eess.AS 2025-04 unverdicted novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions
eess.AS 2026-04 unverdicted novelty 4.0

Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.
Whisper-AuT: Domain-Adapted Audio Encoder for Efficient Audio-LLM Training
cs.SD 2026-04 unverdicted novelty 4.0

Whisper-AuT is a domain-adapted audio encoder obtained by fine-tuning Whisper-large-v3 on mixed speech, environmental, and music data, yielding gains of +23% on ESC-50, +5% on GTZAN, and +0.7% on Speech Commands.
Qwen2-Audio Technical Report
eess.AS 2024-07 unverdicted novelty 4.0

Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 25 Pith papers · 9 internal anchors

[1]

Spice: Semantic propositional image caption evaluation

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V

work page 2016
[2]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. PaLM 2 technical report.arXiv:2305.10403,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

JunyiAo,RuiWang,LongZhou,ChengyiWang,ShuoRen,YuWu,ShujieLiu,TomKo,QingLi,YuZhang,etal

under review. JunyiAo,RuiWang,LongZhou,ChengyiWang,ShuoRen,YuWu,ShujieLiu,TomKo,QingLi,YuZhang,etal. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing.arXiv:2110.07205,

work page arXiv
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023a. Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jin- grenZhou. Qwen-VL:Afrontierlargevision-languagemodelwithversatileabilitie...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline

Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline. In20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, O-COCOSDA 2017, Seoul, South Korea, November 1-3,

work page 2017
[6]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

12 Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv:2306.15195,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Speechnet: A universal modularized model for speech processing tasks.arXiv:2105.03070,

Yi-Chen Chen, Po-Han Chi, Shu-wen Yang, Kai-Wei Chang, Jheng-hao Lin, Sung-Feng Huang, Da-Rong Liu, Chi-Liang Liu, Cheng-Kuang Lee, and Hung-yi Lee. Speechnet: A universal modularized model for speech processing tasks.arXiv:2105.03070,

work page arXiv
[8]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,HyungWonChung,CharlesSutton,SebastianGehrmann,etal. PaLM:Scalinglanguagemodeling with pathways.arXiv:2204.02311,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

High Fidelity Neural Audio Compression

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv:2210.13438,

work page internal anchor Pith review arXiv
[10]

Clotho: an audio captioning dataset

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8,

work page 2020
[11]

Aishell-2: Transform- ing mandarin asr research into industrial scale,

Jiayu Du, Xingyu Na, Xuechen Liu, and Hui Bu. AISHELL-2: transforming mandarin ASR research into industrial scale. abs/1808.10583,

work page arXiv
[12]

CLAP: learning audio concepts from natural language supervision

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. CLAP: learning audio concepts from natural language supervision. abs/2206.04769,

work page arXiv
[13]

Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan

Jesse H. Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan. Neural audio synthesis of musical notes with wavenet autoencoders. InProceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, Proceedings of Machine Learning Research. PMLR,

work page 2017
[14]

Funasr: A fundamental end-to-end speech recognition toolkit

Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, and Shiliang Zhang. Funasr: A fundamental end-to-end speech recognition toolkit. CoRR, abs/2305.11013,

work page arXiv
[15]

Vocalsound: Adatasetforimprovinghumanvocalsoundsrecognition

YuanGong,JinYu,andJamesR.Glass. Vocalsound: Adatasetforimprovinghumanvocalsoundsrecognition. InIEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing,ICASSP2022,VirtualandSingapore, 23-27 May 2022, pages 151–155. IEEE,

work page 2022
[16]

Audioclip: Extending clip to image, text and audio

doi: 10.1109/ICASSP43922.2022.9746828. URLhttps://doi. org/10.1109/ICASSP43922.2022.9746828. Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James R. Glass. Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers.CoRR, abs/2307.03183, 2023a. Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, and James R....

work page doi:10.1109/icassp43922.2022.9746828 2022
[17]

LoRA: Low-Rank Adaptation of Large Language Models

EdwardJHu,YelongShen,PhillipWallis,ZeyuanAllen-Zhu,YuanzhiLi,SheanWang,LuWang,andWeizhu Chen. Lora: Low-rank adaptation of large language models.arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Audiogpt: Understanding and generating speech, music, sound, and talking head

13 Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, and Shinji Watanabe. Audiogpt: Understanding and generating speech, music, sound, and talking head.CoRR, abs/2304.12995,

work page arXiv
[19]

Cochlscene: Acquisition of acoustic scene data using crowdsourcing

Il-Young Jeong and Jeongsoo Park. Cochlscene: Acquisition of acoustic scene data using crowdsourcing. abs/2211.02289,

work page arXiv
[20]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language-image pre- training with frozen image encoders and large language models. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Proceedings of Machine Learning Research. PMLR,

work page 2023
[21]

Clotho-aqa: A crowdsourceddatasetforaudioquestionanswering

Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos, and Tuomas Virtanen. Clotho-aqa: A crowdsourceddatasetforaudioquestionanswering. In 30thEuropeanSignalProcessingConference,EUSIPCO 2022, Belgrade, Serbia, August 29 - Sept. 2,

work page 2022
[22]

arXiv preprint arXiv:2306.09093 , year=

Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, and Zhaopeng Tu. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. CoRR, abs/2306.09093,

work page arXiv
[23]

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

SoumiMaiti,YifanPeng,ShukjaeChoi,Jee-weonJung,XuankaiChang,andShinjiWatanabe. Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks. arXiv:2309.07937,

work page arXiv
[24]

Montreal forced aligner: Trainable text-speech alignment using kaldi

Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. Montreal forced aligner: Trainable text-speech alignment using kaldi. InInterspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017,

work page 2017
[25]

DCASE2017 challenge setup: Tasks, datasets and baseline system

Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj, and Tuomas Virtanen. DCASE2017 challenge setup: Tasks, datasets and baseline system. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2017, Munich, Germany, November 16-17, 2017,

work page 2017
[26]

Librispeech: AnASRcorpusbasedon public domain audio books

VassilPanayotov,GuoguoChen,DanielPovey,andSanjeevKhudanpur. Librispeech: AnASRcorpusbasedon public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24,

work page 2015
[27]

Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D

Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. Specaugment: A simple data augmentation method for automatic speech recognition. InInterspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September

work page 2019
[28]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv:2306.14824,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

MELD: A multimodal multi-party dataset for emotion recognition in conversations

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. MELD: A multimodal multi-party dataset for emotion recognition in conversations. InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association fo...

work page 2019
[30]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever

URL https://github.com/QwenLM/Qwen-7B. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA,

work page 2023
[31]

Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Mucken- hirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara N. Sainath, Johan Schalkwyk, Matthew Sharifi, MichelleTadmorRamanovich,MarcoTagliasacchi,AlexandruTudor,MihajloVeli...

work page internal anchor Pith review arXiv
[32]

Llasm: Large language and speech model.arXiv:2308.15930,

Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, and Yemin Shi. Llasm: Large language and speech model.arXiv:2308.15930,

work page arXiv
[33]

Generative pretraining in mul- timodality

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality.arXiv:2307.05222,

work page arXiv
[34]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv:2302.13971, 2023a. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste R...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

ChenWang,MinpengLiao,ZhongqiangHuang,JinliangLu,JunhongWu,YuchenLiu,ChengqingZong,and Jiajun Zhang

URLhttps://arxiv.org/abs/2007.10310. ChenWang,MinpengLiao,ZhongqiangHuang,JinliangLu,JunhongWu,YuchenLiu,ChengqingZong,and Jiajun Zhang. Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing. arXiv:2309.00916, 2023a. Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Yongqiang Wang, Nanxin...

work page arXiv 2007
[37]

Speechgpt: Em- powering large language models with intrinsic cross-modal conversational abilities.CoRR, abs/2305.11000, 2023a

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Em- powering large language models with intrinsic cross-modal conversational abilities.CoRR, abs/2305.11000, 2023a. XinZhang,DongZhang,ShiminLi,YaqianZhou,andXipengQiu. Speechtokenizer: Unifiedspeechtokenizer for speech large language models.CoRR, abs/2308.166...

work page arXiv
[38]

Whisper-large-v2 Qwen-audio 1st-stage LLM init

Table 6: Training hyperparameters of Qwen-Audio Configuration Multi-task Pre-training Supervised Fine-tuning Audio encoder init. Whisper-large-v2 Qwen-audio 1st-stage LLM init. Qwen-7B Qwen-7B SpecAugment Policy LibriSpeech Basic LibriSpeech Basic Optimizer AdamW AdamW Optimizer hyperparameter β1=0.9, β2=0.98, eps = 1e−6 Peak learning rate 5e−5 1e−5 Minim...

work page 2000