Recognition: 2 theorem links
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Pith reviewed 2026-05-12 18:52 UTC · model grok-4.3
The pith
Scaling audio-language pre-training to over 30 tasks and audio types with hierarchical tag conditioning produces a single model that performs strongly on diverse benchmarks without any task-specific fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Qwen-Audio is trained by scaling audio-language pre-training over more than 30 tasks and multiple audio categories. Direct joint training creates interference because textual labels differ in focus, language, granularity, and structure across datasets. The authors address this with a conditioning mechanism that supplies a sequence of hierarchical tags to the decoder: shared tags encourage cross-task knowledge sharing while task-specific tags isolate conflicting label characteristics. The trained model then delivers strong results on a wide range of audio understanding benchmarks without requiring any per-task fine-tuning and outperforms prior counterparts.
What carries the argument
The multi-task training framework that conditions the decoder on a sequence of hierarchical tags, where shared tags promote knowledge sharing and specified tags isolate dataset-specific label variations to prevent interference.
If this is right
- The same model supports direct multi-turn dialogue when paired with text inputs, enabling Qwen-Audio-Chat for audio-centric scenarios.
- Universal audio understanding becomes feasible for mixed inputs spanning speech, environmental sounds, music, and songs within one system.
- Task-specific fine-tuning steps can be skipped for many standard audio benchmarks while still matching or exceeding specialized models.
- The training recipe scales to additional audio types or tasks without redesigning separate heads or loss functions for each.
Where Pith is reading between the lines
- The tag-conditioning pattern may transfer to other multi-modal settings where label formats vary across datasets, such as vision-language or video tasks.
- If the interference-avoidance mechanism holds, future unified models could incorporate streaming or real-time audio without retraining separate pipelines.
- Performance gains on held-out benchmarks would suggest that the hierarchical tags act as a lightweight form of task routing inside a single decoder.
Load-bearing premise
The hierarchical tag conditioning is sufficient to prevent performance interference from the large variations in textual labels across the collected datasets.
What would settle it
A controlled ablation in which the same model is trained on the identical data mixture but without the hierarchical tags, then evaluated on the same benchmarks; if individual-task scores drop markedly or fall below fine-tuned baselines, the central claim would be falsified.
read the original abstract
Recently, instruction-following audio-language models have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Qwen-Audio, a unified large-scale audio-language model pre-trained on diverse audio types (speech, natural sounds, music, songs) covering over 30 tasks. It proposes a multi-task training framework that conditions the decoder on a sequence of hierarchical tags to encourage knowledge sharing via shared tags while mitigating interference from heterogeneous textual labels via specified tags. The central claim is that this enables strong zero-shot performance across benchmarks without any task-specific fine-tuning, surpassing prior models; the work also presents Qwen-Audio-Chat for multi-turn audio-text dialogues.
Significance. If the zero-shot performance claims are substantiated, the work would advance universal audio understanding by scaling multi-task pre-training across heterogeneous datasets and audio modalities, addressing a key barrier in instruction-following audio models. It receives credit for the broad task and data coverage as well as the extension to an interactive chat model.
major comments (2)
- [Abstract] Abstract: The assertion of 'impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts' is presented without any quantitative metrics, specific benchmark names, baseline comparisons, or result tables. This leaves the central empirical claim unsupported by evidence in the manuscript summary.
- [Multi-task training framework] Multi-task training framework: The hierarchical tag conditioning is described as the solution to one-to-many interference arising from label variations across datasets. However, the manuscript supplies no ablation removing the tag hierarchy, no per-task performance breakdowns on label-variation subsets, and no details on tag construction for speech/music/sound data. Without these controls, it is impossible to determine whether reported gains stem from the proposed mechanism or from data scale alone.
minor comments (1)
- [Abstract] The abstract refers to 'over 30 tasks' without enumerating them or providing examples of the task categories and datasets used.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and describe the revisions we will make to improve clarity and empirical support.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion of 'impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts' is presented without any quantitative metrics, specific benchmark names, baseline comparisons, or result tables. This leaves the central empirical claim unsupported by evidence in the manuscript summary.
Authors: We agree that the abstract would benefit from more concrete quantitative support for the central claim. The full manuscript reports detailed zero-shot results in Section 4 (Tables 1-3), including specific benchmarks such as LibriSpeech ASR, AudioCaps captioning, and VGGSound classification, with Qwen-Audio outperforming baselines like Whisper-large and AudioPaLM by 5-15% relative on several tasks. We will revise the abstract to include key metrics, benchmark names, and brief baseline comparisons while preserving its concise nature. revision: yes
-
Referee: [Multi-task training framework] Multi-task training framework: The hierarchical tag conditioning is described as the solution to one-to-many interference arising from label variations across datasets. However, the manuscript supplies no ablation removing the tag hierarchy, no per-task performance breakdowns on label-variation subsets, and no details on tag construction for speech/music/sound data. Without these controls, it is impossible to determine whether reported gains stem from the proposed mechanism or from data scale alone.
Authors: We thank the referee for this observation. Section 3.2 describes the hierarchical tag design, with examples such as <speech><asr> for transcription tasks, <music><caption> for music, and <sound><classify> for environmental sounds, chosen to share common prefixes while specifying task granularity. However, the manuscript indeed lacks explicit ablations isolating the hierarchy and fine-grained breakdowns on label-variation subsets. We will add a controlled ablation on a 10% data subset comparing hierarchical tags versus flat tags or no tags, plus expanded tag construction details and per-task performance tables in the revised manuscript to better demonstrate the mechanism's contribution beyond scale. revision: partial
Circularity Check
No circularity; empirical performance claims rest on training and benchmarks
full rationale
The paper describes scaling audio-language pre-training across >30 tasks and introduces a hierarchical tag conditioning mechanism in the multi-task framework to mitigate label interference. These are presented as design choices whose effectiveness is asserted via reported benchmark results (no task-specific fine-tuning, surpassing counterparts). No equations, derivations, or 'predictions' appear that reduce by construction to fitted parameters or self-referential definitions. No self-citation chains are invoked to justify uniqueness or force the central result. The claims are self-contained empirical outcomes, not tautological reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diverse audio datasets produce textual labels that vary enough in focus, language, granularity, and structure to cause training interference when co-trained directly.
Forward citations
Cited by 25 Pith papers
-
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
-
NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating
NAACA uses a neuro-inspired oscillatory working memory to gate attention in audio language models, raising AudioQwen's average precision from 53.5% to 70.6% on XD-Violence while cutting unnecessary calls.
-
Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration
Polyphonia improves zero-shot stem-specific timbre transfer in polyphonic music by 15.5% target alignment via acoustic-informed attention calibration that uses probabilistic priors to set coarse boundaries.
-
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
-
AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition
AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emot...
-
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.
-
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
-
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
-
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
-
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
-
Speaker Attributed Automatic Speech Recognition Using Speech Aware LLMS
Adapting speech-aware LLMs with speaker cluster identification tags and concatenated multi-speaker data yields superior speaker-attributed ASR performance versus sequential diarization-plus-ASR pipelines.
-
FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips
FoleyDesigner generates spatio-temporally aligned stereo Foley audio for film clips via multi-agent analysis, diffusion models on video cues, and LLM mixing, supported by the new FilmStereo dataset.
-
When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
-
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval
Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.
-
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
-
TinyMU: A Compact Audio-Language Model for Music Understanding
TinyMU is a 229M-parameter compact music understanding model that achieves 82% of state-of-the-art large audio-language model performance on the MuChoMusic benchmark while being 35 times smaller.
-
Qwen3.5-Omni Technical Report
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
-
Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt
TimePro-RL interleaves timestamp embeddings in audio sequences and applies RL post-SFT to boost temporal alignment in LALMs, yielding gains on grounding, event detection, and dense captioning.
-
Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition
The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
-
In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions
Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.
-
Whisper-AuT: Domain-Adapted Audio Encoder for Efficient Audio-LLM Training
Whisper-AuT is a domain-adapted audio encoder obtained by fine-tuning Whisper-large-v3 on mixed speech, environmental, and music data, yielding gains of +23% on ESC-50, +5% on GTZAN, and +0.7% on Speech Commands.
-
Qwen2-Audio Technical Report
Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.
Reference graph
Works this paper leans on
-
[1]
Spice: Semantic propositional image caption evaluation
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V
work page 2016
-
[2]
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. PaLM 2 technical report.arXiv:2305.10403,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
JunyiAo,RuiWang,LongZhou,ChengyiWang,ShuoRen,YuWu,ShujieLiu,TomKo,QingLi,YuZhang,etal
under review. JunyiAo,RuiWang,LongZhou,ChengyiWang,ShuoRen,YuWu,ShujieLiu,TomKo,QingLi,YuZhang,etal. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing.arXiv:2110.07205,
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023a. Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jin- grenZhou. Qwen-VL:Afrontierlargevision-languagemodelwithversatileabilitie...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline
Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline. In20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, O-COCOSDA 2017, Seoul, South Korea, November 1-3,
work page 2017
-
[6]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
12 Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv:2306.15195,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Speechnet: A universal modularized model for speech processing tasks.arXiv:2105.03070,
Yi-Chen Chen, Po-Han Chi, Shu-wen Yang, Kai-Wei Chang, Jheng-hao Lin, Sung-Feng Huang, Da-Rong Liu, Chi-Liang Liu, Cheng-Kuang Lee, and Hung-yi Lee. Speechnet: A universal modularized model for speech processing tasks.arXiv:2105.03070,
-
[8]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,HyungWonChung,CharlesSutton,SebastianGehrmann,etal. PaLM:Scalinglanguagemodeling with pathways.arXiv:2204.02311,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
High Fidelity Neural Audio Compression
Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv:2210.13438,
work page internal anchor Pith review arXiv
-
[10]
Clotho: an audio captioning dataset
Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8,
work page 2020
-
[11]
Aishell-2: Transform- ing mandarin asr research into industrial scale,
Jiayu Du, Xingyu Na, Xuechen Liu, and Hui Bu. AISHELL-2: transforming mandarin ASR research into industrial scale. abs/1808.10583,
-
[12]
CLAP: learning audio concepts from natural language supervision
Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. CLAP: learning audio concepts from natural language supervision. abs/2206.04769,
-
[13]
Jesse H. Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan. Neural audio synthesis of musical notes with wavenet autoencoders. InProceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, Proceedings of Machine Learning Research. PMLR,
work page 2017
-
[14]
Funasr: A fundamental end-to-end speech recognition toolkit
Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, and Shiliang Zhang. Funasr: A fundamental end-to-end speech recognition toolkit. CoRR, abs/2305.11013,
-
[15]
Vocalsound: Adatasetforimprovinghumanvocalsoundsrecognition
YuanGong,JinYu,andJamesR.Glass. Vocalsound: Adatasetforimprovinghumanvocalsoundsrecognition. InIEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing,ICASSP2022,VirtualandSingapore, 23-27 May 2022, pages 151–155. IEEE,
work page 2022
-
[16]
Audioclip: Extending clip to image, text and audio
doi: 10.1109/ICASSP43922.2022.9746828. URLhttps://doi. org/10.1109/ICASSP43922.2022.9746828. Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James R. Glass. Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers.CoRR, abs/2307.03183, 2023a. Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, and James R....
-
[17]
LoRA: Low-Rank Adaptation of Large Language Models
EdwardJHu,YelongShen,PhillipWallis,ZeyuanAllen-Zhu,YuanzhiLi,SheanWang,LuWang,andWeizhu Chen. Lora: Low-rank adaptation of large language models.arXiv:2106.09685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Audiogpt: Understanding and generating speech, music, sound, and talking head
13 Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, and Shinji Watanabe. Audiogpt: Understanding and generating speech, music, sound, and talking head.CoRR, abs/2304.12995,
-
[19]
Cochlscene: Acquisition of acoustic scene data using crowdsourcing
Il-Young Jeong and Jeongsoo Park. Cochlscene: Acquisition of acoustic scene data using crowdsourcing. abs/2211.02289,
-
[20]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language-image pre- training with frozen image encoders and large language models. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Proceedings of Machine Learning Research. PMLR,
work page 2023
-
[21]
Clotho-aqa: A crowdsourceddatasetforaudioquestionanswering
Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos, and Tuomas Virtanen. Clotho-aqa: A crowdsourceddatasetforaudioquestionanswering. In 30thEuropeanSignalProcessingConference,EUSIPCO 2022, Belgrade, Serbia, August 29 - Sept. 2,
work page 2022
-
[22]
arXiv preprint arXiv:2306.09093 , year=
Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, and Zhaopeng Tu. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. CoRR, abs/2306.09093,
-
[23]
SoumiMaiti,YifanPeng,ShukjaeChoi,Jee-weonJung,XuankaiChang,andShinjiWatanabe. Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks. arXiv:2309.07937,
-
[24]
Montreal forced aligner: Trainable text-speech alignment using kaldi
Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. Montreal forced aligner: Trainable text-speech alignment using kaldi. InInterspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017,
work page 2017
-
[25]
DCASE2017 challenge setup: Tasks, datasets and baseline system
Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj, and Tuomas Virtanen. DCASE2017 challenge setup: Tasks, datasets and baseline system. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2017, Munich, Germany, November 16-17, 2017,
work page 2017
-
[26]
Librispeech: AnASRcorpusbasedon public domain audio books
VassilPanayotov,GuoguoChen,DanielPovey,andSanjeevKhudanpur. Librispeech: AnASRcorpusbasedon public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24,
work page 2015
-
[27]
Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D
Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le. Specaugment: A simple data augmentation method for automatic speech recognition. InInterspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September
work page 2019
-
[28]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv:2306.14824,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
MELD: A multimodal multi-party dataset for emotion recognition in conversations
Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. MELD: A multimodal multi-party dataset for emotion recognition in conversations. InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association fo...
work page 2019
-
[30]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever
URL https://github.com/QwenLM/Qwen-7B. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA,
work page 2023
-
[31]
Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Mucken- hirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara N. Sainath, Johan Schalkwyk, Matthew Sharifi, MichelleTadmorRamanovich,MarcoTagliasacchi,AlexandruTudor,MihajloVeli...
work page internal anchor Pith review arXiv
-
[32]
Llasm: Large language and speech model.arXiv:2308.15930,
Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, and Yemin Shi. Llasm: Large language and speech model.arXiv:2308.15930,
-
[33]
Generative pretraining in mul- timodality
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality.arXiv:2307.05222,
-
[34]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv:2302.13971, 2023a. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste R...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[36]
ChenWang,MinpengLiao,ZhongqiangHuang,JinliangLu,JunhongWu,YuchenLiu,ChengqingZong,and Jiajun Zhang
URLhttps://arxiv.org/abs/2007.10310. ChenWang,MinpengLiao,ZhongqiangHuang,JinliangLu,JunhongWu,YuchenLiu,ChengqingZong,and Jiajun Zhang. Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing. arXiv:2309.00916, 2023a. Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Yongqiang Wang, Nanxin...
-
[37]
Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Em- powering large language models with intrinsic cross-modal conversational abilities.CoRR, abs/2305.11000, 2023a. XinZhang,DongZhang,ShiminLi,YaqianZhou,andXipengQiu. Speechtokenizer: Unifiedspeechtokenizer for speech large language models.CoRR, abs/2308.166...
-
[38]
Whisper-large-v2 Qwen-audio 1st-stage LLM init
Table 6: Training hyperparameters of Qwen-Audio Configuration Multi-task Pre-training Supervised Fine-tuning Audio encoder init. Whisper-large-v2 Qwen-audio 1st-stage LLM init. Qwen-7B Qwen-7B SpecAugment Policy LibriSpeech Basic LibriSpeech Basic Optimizer AdamW AdamW Optimizer hyperparameter β1=0.9, β2=0.98, eps = 1e−6 Peak learning rate 5e−5 1e−5 Minim...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.