Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence

· 2025 · arXiv 2508.13992

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

representative citing papers

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

cs.SD · 2026-05-01 · unverdicted · novelty 7.0

MedMosaic is a large-scale medical audio QA benchmark that shows even state-of-the-art models like Gemini-2.5-pro reach only about 68% accuracy on diverse clinical audio scenarios.

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

eess.AS · 2026-04-28 · unverdicted · novelty 7.0

Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

cs.SD · 2026-04-27 · unverdicted · novelty 6.0

Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.

HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

cs.SD · 2026-04-26 · unverdicted · novelty 6.0

HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.

Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

cs.SD · 2026-04-07 · conditional · novelty 6.0

A three-stage synthetic data pipeline generates 8800 doctor-patient conversations totaling 1.3k hours of audio and LLM-produced SOAP notes, with evaluation showing cascaded transcription-then-summarization models outperform end-to-end audio models.

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

eess.AS · 2026-04-14 · unverdicted · novelty 5.0

Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.

OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.

Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB)

cs.SD · 2026-05-06 · unverdicted · novelty 3.0

LLMs exhibit a persistent modality gap versus specialized audio encoders on MSEB tasks, with no conclusive evidence favoring audio-native over cascaded architectures.

citing papers explorer

Showing 9 of 9 citing papers.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where cs.SD · 2026-04-16 · unverdicted · none · ref 2
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio cs.SD · 2026-05-01 · unverdicted · none · ref 1
MedMosaic is a large-scale medical audio QA benchmark that shows even state-of-the-art models like Gemini-2.5-pro reach only about 68% accuracy on diverse clinical audio scenarios.
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models eess.AS · 2026-04-28 · unverdicted · none · ref 28
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation cs.SD · 2026-04-27 · unverdicted · none · ref 32
Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models cs.SD · 2026-04-26 · unverdicted · none · ref 12
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization cs.SD · 2026-04-07 · conditional · none · ref 16
A three-stage synthetic data pipeline generates 8800 doctor-patient conversations totaling 1.3k hours of audio and LLM-produced SOAP notes, with evaluation showing cascaded transcription-then-summarization models outperform end-to-end audio models.
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models eess.AS · 2026-04-14 · unverdicted · none · ref 31
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering cs.CV · 2026-04-09 · unverdicted · none · ref 13
OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.
Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB) cs.SD · 2026-05-06 · unverdicted · none · ref 19
LLMs exhibit a persistent modality gap versus specialized audio encoders on MSEB tasks, with no conclusive evidence favoring audio-native over cascaded architectures.

Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence

fields

years

verdicts

representative citing papers

citing papers explorer