GAMA: A large audio- language model with advanced audio understanding and complex rea- soning abilities

· 2024 · arXiv 2406.11768

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

representative citing papers

Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering

cs.IR · 2026-04-08 · unverdicted · novelty 7.0

Jamendo-MT-QA is a new dataset and benchmark for multi-track comparative music question answering, constructed via an LLM-assisted pipeline from Creative Commons Jamendo tracks and used to evaluate audio-language models.

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

cs.CV · 2025-12-01 · unverdicted · novelty 7.0

AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.

Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding

cs.SD · 2026-04-16 · unverdicted · novelty 6.0

HyPeR is a hybrid perception-reasoning framework that uses a new hierarchical PAQA dataset and PAUSE tokens to improve large audio language models' handling of multi-speaker and ambiguous audio.

Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models

cs.SD · 2026-04-16 · unverdicted · novelty 6.0

Temporal Contrastive Decoding mitigates temporal smoothing bias in unified large audio-language models by contrasting logits from original and blurred audio inputs during decoding, yielding consistent gains on MMAU and AIR-Bench.

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

cs.LG · 2025-05-22 · conditional · novelty 6.0

LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.

Kimi-Audio Technical Report

eess.AS · 2025-04-25 · unverdicted · novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.

Frame-Aligned Fusion of Canary and WavLM for Non-Intrusive Intelligibility Prediction of Hearing-Aid-Processed Speech

eess.AS · 2026-05-22 · unverdicted · novelty 4.0

Frame-aligned fusion of Canary and WavLM encoders, with WavLM temporally prepared via learnable strided convolution, outperforms other fusion strategies and reaches Eval RMSE 24.96 and Corr 0.796 on non-intrusive intelligibility prediction.

citing papers explorer

Showing 7 of 7 citing papers.

Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering cs.IR · 2026-04-08 · unverdicted · none · ref 2
Jamendo-MT-QA is a new dataset and benchmark for multi-track comparative music question answering, constructed via an LLM-assisted pipeline from Creative Commons Jamendo tracks and used to evaluate audio-language models.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models cs.CV · 2025-12-01 · unverdicted · none · ref 18
AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding cs.SD · 2026-04-16 · unverdicted · none · ref 2
HyPeR is a hybrid perception-reasoning framework that uses a new hierarchical PAQA dataset and PAUSE tokens to improve large audio language models' handling of multi-speaker and ambiguous audio.
Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models cs.SD · 2026-04-16 · unverdicted · none · ref 1
Temporal Contrastive Decoding mitigates temporal smoothing bias in unified large audio-language models by contrasting logits from original and blurred audio inputs during decoding, yielding consistent gains on MMAU and AIR-Bench.
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning cs.LG · 2025-05-22 · conditional · none · ref 9
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
Kimi-Audio Technical Report eess.AS · 2025-04-25 · unverdicted · none · ref 22
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
Frame-Aligned Fusion of Canary and WavLM for Non-Intrusive Intelligibility Prediction of Hearing-Aid-Processed Speech eess.AS · 2026-05-22 · unverdicted · none · ref 26
Frame-aligned fusion of Canary and WavLM encoders, with WavLM temporally prepared via learnable strided convolution, outperforms other fusion strategies and reaches Eval RMSE 24.96 and Corr 0.796 on non-intrusive intelligibility prediction.

GAMA: A large audio- language model with advanced audio understanding and complex rea- soning abilities

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer