arxiv: 2509.08031 · v3 · submitted 2025-09-09 · 💻 cs.SD · cs.AI· cs.LG· eess.AS

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

Hoang Nguyen , Sidharth Surapaneni , Akshay Kalkunte , Jash Mehta , Aman Tiwari , Oluwanifemi Bamgbose , Khyati Mahajan , Jash Shah

show 4 more authors

Shruthan Radhakrishna Sathwik Tejaswi Madhusudhan Vikas Yadav Sai Rajeswar

This is my paper

Pith reviewed 2026-05-18 18:02 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.LGeess.AS

keywords Large Audio Language ModelsEvaluation ToolkitBenchmarkingMulti-turn DialogueAudio LLMsPerformance OptimizationOpen Source Framework

0 comments

The pith

AU-Harness evaluates large audio language models up to 151 percent faster than prior toolkits while adding multi-turn dialogue support.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AU-Harness as an open-source evaluation framework built specifically for large audio language models. It identifies three main shortcomings in existing tools: slow pipelines that block large studies, missing support for conversations that span multiple turns, and the lack of one unified system that scales with new models and benchmarks. The authors address these gaps by introducing batch processing and parallel execution that deliver measured speed gains. Standardized prompting and configuration options are included to support consistent comparisons across models and tasks. The framework is positioned to make previously impractical analyses, such as tracking how performance changes over extended audio exchanges, routine and reproducible.

Core claim

AU-Harness is an evaluation framework for large audio language models that reaches up to 151 percent speedup relative to existing toolkits by using optimized batch processing and parallel execution. It supplies standardized prompting protocols and flexible configurations that enable fair comparisons across models and scenarios. The system also supports multi-turn dialogue evaluation, which allows direct examination of context integration and performance dynamics over longer audio conversations.

What carries the argument

Optimized batch processing and parallel execution pipeline that shortens evaluation time while preserving output consistency with sequential baselines.

If this is right

Large-scale evaluations of audio language models become feasible on standard hardware.
Researchers can now measure how model responses evolve across multiple turns of audio context.
Standardized protocols reduce variability when comparing different models on the same benchmarks.
Systematic identification of limitations in current audio reasoning becomes practical.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use could shift community benchmarks toward longer multi-turn audio tasks instead of isolated queries.
Developers might discover specific failure patterns in cross-turn audio understanding that single-turn tests miss.
Lower evaluation costs could encourage more frequent retraining cycles and rapid iteration on audio models.

Load-bearing premise

The speed optimizations leave evaluation accuracy and model rankings unchanged from the slower sequential methods used in earlier toolkits.

What would settle it

Running identical audio inputs, prompts, and models through AU-Harness and through an existing sequential toolkit and obtaining different accuracy scores or different model orderings.

Figures

Figures reproduced from arXiv: 2509.08031 by Akshay Kalkunte, Aman Tiwari, Hoang Nguyen, Jash Mehta, Jash Shah, Khyati Mahajan, Oluwanifemi Bamgbose, Sai Rajeswar, Sathwik Tejaswi Madhusudhan, Shruthan Radhakrishna, Sidharth Surapaneni, Vikas Yadav.

**Figure 1.** Figure 1: Architecture overview of AU-Harness evaluation framework. Our system comprises three core components: (1) Config module for hierarchical task configuration and standardized prompting, (2) Request Controller managing token-based concurrency limits across all engines with adaptive retry mechanisms, and (3) Concurrent Engines executing parallel model evaluation with dataset sharding. The Request Controller ma… view at source ↗

**Figure 2.** Figure 2: Task distribution and coverage in AU-Harness. Our framework encompasses six major task categories with balanced representation: Speech Recognition (ASR variants), Paralinguistics (emotion, speaker, accent recognition), Spoken Language Understanding (QA, translation, summarization), Audio Understanding (scene, music), Spoken Language Reasoning (function calling, coding, instruction following), and Safety … view at source ↗

**Figure 3.** Figure 3: LLM-Adaptive Diarization methodology comparison. 1 Traditional diarization (top, bottom-right) outputs time-stamped audio segments with speaker annotations, ideal for specialized neural architectures. LLM-Adaptive approach (bottom-left) integrates speaker information directly into transcripts, enabling evaluation through prompting-based generation evaluated via word-level metrics (WDER, cpWER). This approa… view at source ↗

**Figure 4.** Figure 4: Efficiency comparison across evaluation frameworks and runtime scenarios. (a) Processed Samples per Second (↑ better) and (b) Real-time Factor (↓ better) measured across three datasets (MELD-Emotion, LibriSpeech-test-clean, ClothoAQA) and three runtime conditions: Individual (dataset-specific), Sequential (worst-case serialized execution), and Parallel (optimal concurrent execution). Our framework consi… view at source ↗

**Figure 5.** Figure 5: Parallel runtime efficiency analysis across evaluation frameworks. Scatter plot comparing frameworks under optimal parallel execution conditions, plotting Real-time Factor (x-axis, ↓ better) against Processed Samples per Second (y-axis, ↑ better). Our framework (rightmost cluster) achieves superior performance in both dimensions, demonstrating the effectiveness of token-based request scheduling, dataset s… view at source ↗

read the original abstract

Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient and non-standardized toolkits that limit fair comparison and systematic assessment. Existing evaluation frameworks exhibit three critical limitations: (1) slow and inefficient processing pipeline that bottlenecks large-scale studies, (2) inadequate multi-turn dialogue support, leaving fundamental questions about cross-turn context integration and performance dynamics over extended conversations in LALMs unanswered; and (3) the absence of unified and scalable evaluation framework capable of keeping pace with the rapid growth of both LALMs and audio benchmarks. To address these issues, we introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 151% over existing evaluation toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously considered impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. AU-Harness unlocks a range of in-depth analyses difficult to conduct without a unified foundation, including multi-turn dialogue dynamics, enabling the study of true audio reasoning capabilities in existing LALMs. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AU-Harness integrates batching and multi-turn support into one open-source audio LLM eval toolkit, which is useful infrastructure, but the 151% speedup still needs direct checks that outputs match the old sequential pipelines.

read the letter

The main thing to know is that this paper ships an open-source toolkit called AU-Harness that combines batch and parallel processing with multi-turn dialogue evaluation for large audio language models. It targets three gaps at once: slow pipelines that block big studies, weak multi-turn context handling, and the lack of a single scalable framework as models and benchmarks grow fast. The headline number is a claimed 151% speedup over prior toolkits, plus standardized prompting and configs for fairer comparisons.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AU-Harness, an open-source toolkit for holistic evaluation of Large Audio Language Models (LALMs). It identifies three limitations in prior frameworks—in efficient processing, inadequate multi-turn dialogue support, and lack of a unified scalable system—and claims that AU-Harness delivers up to 151% speedup via optimized batch processing and parallel execution, standardized prompting protocols, and capabilities for analyzing multi-turn dynamics and audio reasoning.

Significance. If the performance claims hold and the optimizations preserve evaluation fidelity, the toolkit could meaningfully advance LALM research by removing practical bottlenecks to large-scale, reproducible, and multi-turn studies. The open-source release and focus on standardized protocols are constructive contributions for community adoption.

major comments (2)

[Abstract] Abstract and results presentation: the headline claim of a 151% speedup through batch processing and parallel execution is not accompanied by any experimental details, baseline comparisons, timing tables, or error analysis. This leaves the central efficiency assertion unsupported.
[Evaluation / Experiments] No section demonstrates that the batch and parallel optimizations produce identical per-sample scores, multi-turn context handling, prompt tokenization, and aggregate metrics to the sequential pipelines used in prior toolkits. Without such verification, the speedup cannot be shown to support fair comparisons or the claim of enabling previously impractical large-scale studies.

minor comments (2)

[Abstract] Clarify the precise meaning of '151% speedup' (e.g., wall-clock time reduction factor or throughput multiplier) and specify the hardware, batch sizes, and model configurations used for the measurement.
[Abstract] The abstract states that the toolkit 'unlocks a range of in-depth analyses' but provides no concrete examples or case studies of such analyses in the provided text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which identify key areas where additional evidence is needed to support our claims. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and results presentation: the headline claim of a 151% speedup through batch processing and parallel execution is not accompanied by any experimental details, baseline comparisons, timing tables, or error analysis. This leaves the central efficiency assertion unsupported.

Authors: We agree that the abstract states the speedup claim without the supporting experimental details, baseline comparisons, timing tables, or error analysis. While the methods section describes the batch and parallel optimizations, we acknowledge that this leaves the central efficiency assertion insufficiently supported in the current version. In the revised manuscript, we will add a dedicated subsection in the Experiments section that includes timing benchmarks, direct comparisons against existing toolkits, tables reporting speedup factors under varying batch sizes and hardware configurations, and an error analysis confirming that the optimizations introduce no measurable variance in results. revision: yes
Referee: [Evaluation / Experiments] No section demonstrates that the batch and parallel optimizations produce identical per-sample scores, multi-turn context handling, prompt tokenization, and aggregate metrics to the sequential pipelines used in prior toolkits. Without such verification, the speedup cannot be shown to support fair comparisons or the claim of enabling previously impractical large-scale studies.

Authors: We recognize the importance of explicitly verifying output equivalence to ensure the optimizations support fair comparisons. The current manuscript describes the design of independent per-sample processing but does not include a dedicated verification experiment. In the revision, we will add a verification subsection that runs a representative set of evaluations in both sequential and batched/parallel modes, reporting identical per-sample scores, preserved multi-turn context handling, unchanged prompt tokenization, and matching aggregate metrics. This will directly support the claim that the speedup enables large-scale studies without compromising evaluation fidelity. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering implementation claim with no derivation chain or self-referential reduction

full rationale

The paper introduces an open-source evaluation toolkit whose central claim is an empirical speedup (up to 151%) obtained by batch processing and parallel execution. No equations, fitted parameters, or mathematical derivations are present that could reduce the reported speedup or multi-turn support to the input assumptions by construction. The contribution is a software artifact whose performance numbers are measured against external baselines rather than defined in terms of themselves. While the skeptic correctly notes that equivalence of optimized versus sequential outputs must be demonstrated for the speedup to be fairly comparable, this is a question of empirical validation and correctness risk, not circularity. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear in the provided text. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that prior frameworks exhibit the three stated limitations and that standard LLM API interfaces can be wrapped without loss of fidelity; no free parameters or new entities are introduced.

axioms (1)

domain assumption Existing evaluation frameworks exhibit slow processing, inadequate multi-turn support, and lack a unified scalable framework.
This premise directly motivates the design of AU-Harness and is stated in the abstract as the motivation for the work.

pith-pipeline@v0.9.0 · 5822 in / 1389 out tokens · 86861 ms · 2026-05-18T18:02:55.684928+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our system achieves a speedup of up to 151% over existing evaluation toolkits through optimized batch processing and parallel execution
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LLM-Adaptive Diarization ... evaluated via Word-diarization Error Rate (WDER) and concatenated minimum-permutation word error rate (cpWER)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 6 internal anchors

[1]

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al

URLhttps://arxiv.org/ abs/2508.21376. Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743,

work page arXiv
[2]

Evaluation of real-time tran- scriptions using end-to-end asr models.arXiv preprint arXiv:2409.05674,

Carlos Arriaga, Alejandro Pozo, Javier Conde, and Alvaro Alonso. Evaluation of real-time tran- scriptions using end-to-end asr models.arXiv preprint arXiv:2409.05674,

work page arXiv
[3]

VoiceBench: Benchmarking LLM-Based Voice Assistants

11 Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. V oicebench: Benchmarking llm-based voice assistants.arXiv preprint arXiv:2410.17196,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capa- bilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Recent advances in speech language models: A survey.arXiv preprint arXiv:2410.03751,

Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, and Irwin King. Recent advances in speech language models: A survey.arXiv preprint arXiv:2410.03751,

work page arXiv
[6]

Kimi-Audio Technical Report

Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

How numerical precision affects arithmetical reasoning capabilities of llms

Guhao Feng, Kai Yang, Yuntian Gu, Xinyue Ai, Shengjie Luo, Jiacheng Sun, Di He, Zhenguo Li, and Liwei Wang. How numerical precision affects arithmetical reasoning capabilities of llms. In Findings of the Association for Computational Linguistics: ACL 2025, pp. 46–85,

work page 2025
[8]

Dynamic-superb phase-2: A collab- oratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks

Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T Liu, Chen-An Li, Yu-Xiang Lin, Wei- Cheng Tseng, Anuj Diwan, Yi-Jen Shih, Jiatong Shi, et al. Dynamic-superb phase-2: A collab- oratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks. InThe Thirteenth International Conference on Learning Representations. Chi...

work page 2024
[9]

V oxtral.arXiv preprint arXiv:2507.13264,

Alexander H Liu, Andy Ehrenberg, Andy Lo, Cl´ement Denoix, Corentin Barreau, Guillaume Lam- ple, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Mud- direddy, et al. V oxtral.arXiv preprint arXiv:2507.13264,

work page arXiv
[10]

MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

URLhttps://arxiv.org/abs/2507.23511. Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InAdvances in Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

A survey on speech large language models.arXiv preprint arXiv:2410.18908,

Jing Peng, Yucheng Wang, Yangui Fang, Yu Xi, Xu Li, Xizhuo Zhang, and Kai Yu. A survey on speech large language models.arXiv preprint arXiv:2410.18908,

work page arXiv
[12]

Joint speech recognition and speaker diariza- tion via sequence transduction

Laurent El Shafey, Hagen Soltau, and Izhak Shafran. Joint speech recognition and speaker diariza- tion via sequence transduction. InProc. Interspeech 2019, pp. 396–400,

work page 2019
[13]

Versa: A versatile evaluation toolkit for speech, audio, and music

Jiatong Shi, Hye-jin Shim, Jinchuan Tian, Siddhant Arora, Haibin Wu, Darius Petermann, Jia Qi Yip, You Zhang, Yuxun Tang, Wangyou Zhang, et al. Versa: A versatile evaluation toolkit for speech, audio, and music. InProceedings of the 2025 Conference of the Nations of the Amer- icas Chapter of the Association for Computational Linguistics: Human Language Te...

work page 2025
[14]

In: Proceedings of the 3rd Workshop on Figurative Language Processing (FLP), pp

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun MA, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models. InThe Twelfth International Conference on Learning Representations. 12 Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F. Chen. ...

work page doi:10.18653/v1/ 2025
[15]

Chime-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings

Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, et al. Chime-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings. InProc. CHiME 2020, pp. 1–7,

work page 2020
[16]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Air-bench: Benchmarking large audio-language models via generative comprehension

Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. Air-bench: Benchmarking large audio-language models via generative comprehension. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1979–1998,

work page 1979
[18]

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.),Proceedings o...

work page 2018
[19]

doi: 10.18653/v1/D18-1425

Association for Computational Linguistics. doi: 10.18653/v1/D18-1425. URLhttps://aclanthology.org/D18-1425. Junbo Zhang, Heinrich Dinkel, Yadong Niu, Chenyu Liu, Si Cheng, Anbei Zhao, and Jian Luan. X- ares: A comprehensive framework for assessing audio encoder performance,

work page doi:10.18653/v1/d18-1425
[20]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al

URLhttps: //arxiv.org/abs/2505.16369. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623,

work page arXiv
[21]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv