hub Baseline reference

Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence

· 2025 · arXiv 2508.13992

Baseline reference. 57% of citing Pith papers use this work as a benchmark or comparison.

11 Pith papers citing it

Baseline 57% of classified citations

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 4 background 2 other 1

citation-polarity summary

use dataset 4 background 2 unclear 1

representative citing papers

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

eess.AS · 2026-04-28 · unverdicted · novelty 7.0

Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

cs.SD · 2026-04-27 · unverdicted · novelty 6.0

Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.

HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

cs.SD · 2026-04-26 · unverdicted · novelty 6.0

HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.

Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

cs.SD · 2026-04-07 · conditional · novelty 6.0

A three-stage synthetic data pipeline generates 8800 doctor-patient conversations totaling 1.3k hours of audio and LLM-produced SOAP notes, with evaluation showing cascaded transcription-then-summarization models outperform end-to-end audio models.

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

cs.SD · 2026-05-18 · unverdicted · novelty 5.0

A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

eess.AS · 2026-04-14 · unverdicted · novelty 5.0

Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.

OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.

Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB)

cs.SD · 2026-05-06 · unverdicted · novelty 3.0

LLMs exhibit a persistent modality gap versus specialized audio encoders on MSEB tasks, with no conclusive evidence favoring audio-native over cascaded architectures.

A Survey of Audio Reasoning in Multimodal Foundation Models

eess.AS · 2026-05-20 · unverdicted · novelty 2.0

A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

cs.SD · 2026-05-01

citing papers explorer

Showing 7 of 7 citing papers after filters.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where cs.SD · 2026-04-16 · unverdicted · none · ref 2
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation cs.SD · 2026-04-27 · unverdicted · none · ref 32
Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models cs.SD · 2026-04-26 · unverdicted · none · ref 12
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization cs.SD · 2026-04-07 · conditional · none · ref 16
A three-stage synthetic data pipeline generates 8800 doctor-patient conversations totaling 1.3k hours of audio and LLM-produced SOAP notes, with evaluation showing cascaded transcription-then-summarization models outperform end-to-end audio models.
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook cs.SD · 2026-05-18 · unverdicted · none · ref 195
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB) cs.SD · 2026-05-06 · unverdicted · none · ref 19
LLMs exhibit a persistent modality gap versus specialized audio encoders on MSEB tasks, with no conclusive evidence favoring audio-native over cascaded architectures.
MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio cs.SD · 2026-05-01 · unreviewed · ref 1

Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer