KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

Haechan Kim; Inkyu Park; Jihoo Lee; Jonghyun Lee; Seungjun Chung

arxiv: 2605.27984 · v1 · pith:754L6FRKnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

Haechan Kim , Seungjun Chung , Inkyu Park , Jihoo Lee , Jonghyun Lee This is my paper

Pith reviewed 2026-06-29 13:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Korean speech benchmarksSpeechLMsSpokenQAaudio understandingmultilingual evaluationbenchmark constructionagent-driven transferperformance gaps

0 comments

The pith

Two agent-driven frameworks transfer English speech benchmarks to Korean while preserving instructions, constraints, and audio properties, enabling evaluation of eight SpeechLMs that uncovers varying gaps and diverging task rankings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes two human-agent benchmark-construction frameworks to address the English-centric nature of SpeechLM evaluation. One framework transfers source-language SpokenQA benchmarks into target-language versions, while the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. These are applied to create and release three Korean benchmarks totaling 12,345 samples: KVoiceBench and KOpenAudioBench for SpokenQA, and KMMAU for audio understanding. Evaluation of eight recent SpeechLMs shows that English-Korean performance gaps differ substantially across models and task families. A sympathetic reader would care because the results indicate that English-only testing misses complementary weaknesses between SpokenQA and audio understanding capabilities.

Core claim

We propose two human-agent benchmark-construction frameworks: one transfers source-language SpokenQA benchmarks into target-language SpokenQA benchmarks, and the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. Using these frameworks, we construct and publicly release three Korean speech benchmarks: KVoiceBench and KOpenAudioBench for Korean SpokenQA, and KMMAU for Korean audio understanding, comprising 12,345 samples in total. We evaluate eight recent SpeechLMs and find that English-Korean performance gaps vary substantially across models and task families, and that SpokenQA and audio understanding rankings diverge, re

What carries the argument

Two human-agent benchmark-construction frameworks that transfer SpokenQA benchmarks and convert ASR corpora into audio understanding tasks while aiming to preserve language-specific instructions, answer constraints, spoken forms, speaker attributes, accents, and paralinguistic properties.

If this is right

English-Korean performance gaps vary substantially across models and task families.
SpokenQA and audio understanding rankings diverge across the evaluated SpeechLMs.
The divergence reveals complementary weaknesses that English-only evaluation cannot detect.
The three released benchmarks provide a total of 12,345 Korean samples for future SpeechLM assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The frameworks could be adapted to create benchmarks in additional languages beyond Korean.
Diverging task rankings suggest SpeechLMs may require separate training objectives for question answering versus audio understanding.
Public availability of the benchmarks enables direct comparison of new models against the eight evaluated here.
The approach highlights a general need for language-specific construction methods rather than simple translation pipelines.

Load-bearing premise

The agent-driven frameworks successfully transfer benchmarks while preserving language-specific instructions, answer constraints, spoken forms, speaker attributes, accents, and paralinguistic properties.

What would settle it

A direct comparison where human evaluators rate the transferred Korean benchmarks as having lost original answer constraints or speaker attributes, or where model rankings on the new benchmarks match English rankings exactly with no divergence between SpokenQA and audio understanding.

Figures

Figures reproduced from arXiv: 2605.27984 by Haechan Kim, Inkyu Park, Jihoo Lee, Jonghyun Lee, Seungjun Chung.

**Figure 2.** Figure 2: Construction pipeline for KVoiceBench and KOpenAudioBench. The framework consists of four stages: [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Construction pipeline for target-language audio understanding benchmarks. Four construction methods are [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Representative samples from the released Korean speech benchmark suite. Red cells correspond to [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment of multilingual speech capabilities. Straightforward benchmark transfer through ASR, translation, normalization, and TTS can corrupt language-specific instructions, answer constraints, and spoken forms; for audio understanding, transferring source-language audio also fails to preserve target-language speaker attributes, accents, and paralinguistic properties. To address these limitations, we propose two human-agent benchmark-construction frameworks: one transfers source-language SpokenQA benchmarks into target-language SpokenQA benchmarks, and the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. Using these frameworks, we construct and publicly release three Korean speech benchmarks: KVoiceBench and KOpenAudioBench for Korean SpokenQA, and KMMAU for Korean audio understanding, comprising 12,345 samples in total. We evaluate eight recent SpeechLMs and find that English-Korean performance gaps vary substantially across models and task families, and that SpokenQA and audio understanding rankings diverge, revealing complementary weaknesses invisible to English-only evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New Korean speech benchmarks via agent frameworks, but no validation data shown for whether the transfers actually preserve the claimed properties.

read the letter

The main takeaway is that the authors built and released three Korean speech benchmarks using two agent-assisted construction methods, then ran eight SpeechLMs on them to show English-Korean gaps and ranking shifts between SpokenQA and audio understanding tasks.

What is actually new is the pair of frameworks—one that converts source SpokenQA items into target-language versions, the other that turns target ASR corpora into audio understanding sets—and the resulting public datasets totaling 12,345 samples. The evaluation results are straightforward: gaps vary by model and task family, and the two task types surface different weaknesses. Releasing the data is the concrete step forward.

The soft spot is the missing evidence on transfer quality. The abstract describes the motivation and process but reports no inter-annotator numbers, constraint preservation rates, accent fidelity checks, or comparison against simpler transfer baselines. That leaves the central claim—that these benchmarks reliably expose multilingual issues—resting on an unverified assumption. If the full paper contains those checks, they need to be front and center; right now they are not visible.

This is for researchers who need non-English test sets for SpeechLMs or who want to compare models beyond English-only suites. A reader who wants ready-to-use Korean data and some initial model numbers will find it practical.

It deserves peer review because the resources are new and the motivation is clear, even with the validation gap. Send it out, but flag the need for explicit quality metrics on the transferred items.

Referee Report

1 major / 0 minor

Summary. The paper proposes two human-agent benchmark-construction frameworks to transfer English SpokenQA benchmarks and convert target-language ASR corpora into Korean equivalents while aiming to preserve language-specific instructions, answer constraints, spoken forms, speaker attributes, accents, and paralinguistic properties. It constructs and releases three Korean speech benchmarks (KVoiceBench, KOpenAudioBench for SpokenQA; KMMAU for audio understanding) totaling 12,345 samples, then evaluates eight recent SpeechLMs to report varying English-Korean performance gaps across models and task families, plus divergent rankings between SpokenQA and audio understanding that reveal complementary weaknesses not visible in English-only evaluation.

Significance. If the transfer frameworks are shown to preserve the targeted properties, the work supplies publicly released Korean benchmarks that can support more reliable multilingual SpeechLM assessment and highlight model limitations invisible to English-centric testing. The public release of the benchmarks and the empirical evaluation across multiple models and task types are concrete contributions to the field.

major comments (1)

[Abstract, motivation and construction sections] Abstract and construction sections: the central claim that the agent-driven frameworks successfully transfer benchmarks while preserving language-specific instructions, answer constraints, spoken forms, speaker attributes, accents, and paralinguistic properties lacks supporting quantitative validation such as inter-annotator agreement scores, error rates on transferred constraints, or acoustic fidelity metrics; without this, the superiority over straightforward transfer methods cannot be assessed and the evaluation results rest on an unverified assumption.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point-by-point below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract, motivation and construction sections] Abstract and construction sections: the central claim that the agent-driven frameworks successfully transfer benchmarks while preserving language-specific instructions, answer constraints, spoken forms, speaker attributes, accents, and paralinguistic properties lacks supporting quantitative validation such as inter-annotator agreement scores, error rates on transferred constraints, or acoustic fidelity metrics; without this, the superiority over straightforward transfer methods cannot be assessed and the evaluation results rest on an unverified assumption.

Authors: We agree that quantitative validation would strengthen the central claim and allow direct comparison to straightforward transfer. The manuscript describes the human-agent frameworks and states that human verification steps are used to preserve the listed properties, but does not report inter-annotator agreement, sampled error rates, or acoustic metrics. In the revised version we will add a dedicated subsection under 'Benchmark Construction' that reports (1) inter-annotator agreement scores from the human verification stage on a random sample of 500 instances per benchmark, (2) error rates on language-specific instructions and answer constraints for the same sample, and (3) a brief discussion of why objective acoustic fidelity metrics for accents and paralinguistic cues are difficult to obtain without large-scale perceptual listening tests (which we flag as future work). These additions will make the superiority claim empirically testable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark construction

full rationale

The paper proposes human-agent frameworks for transferring/constructing Korean speech benchmarks from existing sources, releases three new datasets (KVoiceBench, KOpenAudioBench, KMMAU), and reports evaluation results on eight SpeechLMs. No equations, parameter fitting, predictions derived from inputs, or self-citations appear in the provided text. The central claims concern the construction process and observed performance gaps; these rest on the described methodology and external model evaluations rather than reducing to self-definitional steps or fitted inputs. This is standard empirical work with no load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that agent-driven transfer avoids corruption of language-specific elements; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Straightforward benchmark transfer through ASR, translation, normalization, and TTS corrupts language-specific instructions, answer constraints, and spoken forms; transferring source-language audio fails to preserve target-language speaker attributes, accents, and paralinguistic properties.
This premise is explicitly stated in the abstract as the motivation for developing the new frameworks.

pith-pipeline@v0.9.1-grok · 5755 in / 1298 out tokens · 52013 ms · 2026-06-29T13:38:28.787559+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 4 internal anchors

[1]

A survey on LLM-as-a-judge.arXiv preprint arXiv:2411.15594. Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. 2026. Qwen3-TTS technical report.Preprint, arXiv:2601.15621. Ailin Huang and 1 others. 2025. Step-audio:...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension

Association for Computational Linguistics. Chia-Hsuan Li, Szu-Lin Wu, Chi-Liang Liu, and Hung yi Lee. 2018. Spoken SQuAD: A study of mitigating the impact of speech recognition errors on listening comprehension.Preprint, arXiv:1804.00320. Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong,...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

MMAU: A massive multi-task audio un- derstanding and reasoning benchmark.Preprint, arXiv:2410.19168. Richard Sproat, Alan W. Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards

work page internal anchor Pith review Pith/arXiv arXiv
[4]

RNN Approaches to Text Normalization: A Challenge

Normalization of non-standard words.Com- puter Speech & Language, 15(3):287–333. Richard Sproat and Navdeep Jaitly. 2017. RNN approaches to text normalization: A challenge. Preprint, arXiv:1611.00068. Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. 2024. SALMONN: Towards generic hear- ing abilities f...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

A survey on LLM-as-a-judge.arXiv preprint arXiv:2411.15594. Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. 2026. Qwen3-TTS technical report.Preprint, arXiv:2601.15621. Ailin Huang and 1 others. 2025. Step-audio:...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension

Association for Computational Linguistics. Chia-Hsuan Li, Szu-Lin Wu, Chi-Liang Liu, and Hung yi Lee. 2018. Spoken SQuAD: A study of mitigating the impact of speech recognition errors on listening comprehension.Preprint, arXiv:1804.00320. Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong,...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

MMAU: A massive multi-task audio un- derstanding and reasoning benchmark.Preprint, arXiv:2410.19168. Richard Sproat, Alan W. Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

RNN Approaches to Text Normalization: A Challenge

Normalization of non-standard words.Com- puter Speech & Language, 15(3):287–333. Richard Sproat and Navdeep Jaitly. 2017. RNN approaches to text normalization: A challenge. Preprint, arXiv:1611.00068. Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. 2024. SALMONN: Towards generic hear- ing abilities f...

work page internal anchor Pith review Pith/arXiv arXiv 2017