Jamendo-MT-QA is a new dataset and benchmark for multi-track comparative music question answering, constructed via an LLM-assisted pipeline from Creative Commons Jamendo tracks and used to evaluate audio-language models.
GAMA: A large audio- language model with advanced audio understanding and complex rea- soning abilities
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
HyPeR is a hybrid perception-reasoning framework that uses a new hierarchical PAQA dataset and PAUSE tokens to improve large audio language models' handling of multi-speaker and ambiguous audio.
Temporal Contrastive Decoding mitigates temporal smoothing bias in unified large audio-language models by contrasting logits from original and blurred audio inputs during decoding, yielding consistent gains on MMAU and AIR-Bench.
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
Frame-aligned fusion of Canary and WavLM encoders, with WavLM temporally prepared via learnable strided convolution, outperforms other fusion strategies and reaches Eval RMSE 24.96 and Corr 0.796 on non-intrusive intelligibility prediction.
citing papers explorer
-
Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering
Jamendo-MT-QA is a new dataset and benchmark for multi-track comparative music question answering, constructed via an LLM-assisted pipeline from Creative Commons Jamendo tracks and used to evaluate audio-language models.
-
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
-
Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding
HyPeR is a hybrid perception-reasoning framework that uses a new hierarchical PAQA dataset and PAUSE tokens to improve large audio language models' handling of multi-speaker and ambiguous audio.
-
Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models
Temporal Contrastive Decoding mitigates temporal smoothing bias in unified large audio-language models by contrasting logits from original and blurred audio inputs during decoding, yielding consistent gains on MMAU and AIR-Bench.
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
-
Frame-Aligned Fusion of Canary and WavLM for Non-Intrusive Intelligibility Prediction of Hearing-Aid-Processed Speech
Frame-aligned fusion of Canary and WavLM encoders, with WavLM temporally prepared via learnable strided convolution, outperforms other fusion strategies and reaches Eval RMSE 24.96 and Corr 0.796 on non-intrusive intelligibility prediction.