pith. sign in

arxiv: 2606.07229 · v1 · pith:22RWTZZ2new · submitted 2026-06-05 · 💻 cs.SD · cs.CL· cs.MM

MMAE: A Massive Multitask Audio Editing Benchmark

Pith reviewed 2026-06-27 20:59 UTC · model grok-4.3

classification 💻 cs.SD cs.CLcs.MM
keywords audio editing benchmarkinstruction followingmultitask evaluationrubric criteriamultimodal audiomodel performancesound speech music
0
0 comments X

The pith

A new benchmark shows audio editing models achieve exact match rates below 5 percent on instruction tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MMAE as a benchmark for instruction-based audio editing that spans seven modalities including sound, speech, music and mixtures. It organizes tasks into six complexity levels and uses 2000 samples broken down into 17741 rubric criteria to measure how well models follow instructions while keeping audio context consistent. Tests of leading models find exact match rates below 5 percent overall and zero percent on complex mixed-modality cases. These results indicate that current systems fall short on precise execution for real-world audio edits.

Core claim

MMAE is presented as the first broad testbed for general audio editing, covering multiple modalities, task complexities from basic changes to multi-hop reasoning, and operation types. Through its rubric framework that turns free-form instructions into verifiable criteria, the work finds that existing models deliver exact match rates consistently below 5 percent and reach absolute zero on complex mixed-modality tasks, showing they cannot yet produce reliable edits.

What carries the argument

The rubric-based evaluation framework that decomposes free-form tasks into 17741 verifiable criteria to assess instruction following and context consistency.

If this is right

  • Precise execution remains a critical bottleneck for current audio editing models.
  • Structural robustness is insufficient when handling mixed modalities and multi-round edits.
  • The benchmark supplies a standardized way to diagnose weaknesses and track progress.
  • Multi-dimensional scoring separates instruction adherence from audio context consistency.
  • Complex tasks expose gaps that simpler single-modality tests miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The task taxonomy could guide creation of targeted training data for audio models.
  • Similar rubric methods might apply to video or image editing benchmarks to compare reliability across media.
  • Persistent low scores suggest that simply scaling current models may not close the gap without new mechanisms for audio reasoning.
  • Widespread adoption could shift development focus toward measurable instruction following rather than perceptual quality alone.

Load-bearing premise

The 2000 samples curated through human-agent collaboration and the decomposition into 17741 criteria provide an unbiased and complete measure of real-world instruction-following performance.

What would settle it

A model that records an exact match rate above 20 percent on the complex mixed-modality tasks would show the reported performance shortfalls are overstated.

read the original abstract

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MMAE, the first comprehensive benchmark for general-purpose instruction-based audio editing. It covers 7 audio modalities (sound, speech, music, mixtures), a taxonomy of 6 complexity levels, 2 granularities, and 8 operation types. The benchmark contains 2,000 samples curated via human-agent collaboration and evaluated via a rubric decomposing tasks into 17,741 verifiable criteria. Extensive evaluation of leading models shows Exact Match Rate (EMR) below 5% overall and 0% on complex mixed-modality tasks, concluding that current systems are far from reliable audio editing.

Significance. If the benchmark construction and scoring are shown to be reliable, MMAE would provide a much-needed standardized, multi-dimensional testbed that exposes clear capability gaps in instruction-following and structural robustness for audio editing models, serving as a diagnostic roadmap for the field.

major comments (2)
  1. [Curation and rubric construction] Curation and rubric construction (methods section describing the 2,000 samples and 17,741 criteria): no inter-annotator agreement statistics, bias audits, or validation of the criteria against independent human judgments are reported. This directly undermines the central claim that EMR <5% (and 0% on mixed-modality tasks) demonstrates model limitations rather than possible artifacts in rubric granularity or selection.
  2. [Evaluation protocol] Evaluation protocol (section on rubric-based scoring and EMR computation): insufficient detail is given on how model outputs are matched against the 17,741 criteria, how partial matches or context consistency are quantified, or any agreement metrics between automated and human scoring. These omissions are load-bearing for the reported performance numbers and the conclusion about current systems.
minor comments (2)
  1. [Taxonomy] The taxonomy of 6 complexity levels, 2 granularities, and 8 operation types would be clearer if summarized in a single table rather than described only in prose.
  2. [Figures] Figure captions for the modality and complexity distributions should explicitly state the sample counts per category to allow readers to assess balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, agreeing that additional reporting is warranted to strengthen the benchmark's validity, and commit to revisions accordingly.

read point-by-point responses
  1. Referee: [Curation and rubric construction] Curation and rubric construction (methods section describing the 2,000 samples and 17,741 criteria): no inter-annotator agreement statistics, bias audits, or validation of the criteria against independent human judgments are reported. This directly undermines the central claim that EMR <5% (and 0% on mixed-modality tasks) demonstrates model limitations rather than possible artifacts in rubric granularity or selection.

    Authors: We agree that these elements are important for establishing benchmark reliability. Although not reported in the initial submission, the human-agent collaboration involved structured review processes. In the revised manuscript, we will add inter-annotator agreement statistics (e.g., Cohen's kappa) computed over a sampled subset of the 2,000 tasks. We will include a bias audit subsection describing source diversity, modality balancing, and guidelines used to reduce selection bias. We will also report a post-hoc validation study in which independent human judges assessed alignment of a random sample of criteria with task intent. These additions will directly support the robustness of the EMR results. revision: yes

  2. Referee: [Evaluation protocol] Evaluation protocol (section on rubric-based scoring and EMR computation): insufficient detail is given on how model outputs are matched against the 17,741 criteria, how partial matches or context consistency are quantified, or any agreement metrics between automated and human scoring. These omissions are load-bearing for the reported performance numbers and the conclusion about current systems.

    Authors: We concur that greater transparency on the scoring mechanics is needed. The revised manuscript will expand the evaluation protocol section with a precise description of the criterion-matching procedure, including the hierarchical decomposition used for the 17,741 criteria and the rules for handling partial matches (via proportional credit assignment to sub-criteria). Context consistency checks will be formalized, and we will report agreement metrics (e.g., percentage agreement and Cohen's kappa) between the automated scorer and human evaluators on a held-out set of 200 model outputs. This will provide the necessary validation for the automated EMR computation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with no derivation chain.

full rationale

The paper constructs a new test set (2,000 samples, 17,741 rubric criteria) via human-agent collaboration and reports direct empirical measurements of existing models' Exact Match Rates. No equations, parameter fits, predictions derived from inputs, or self-citation chains are present. The evaluation is self-contained against external models and does not reduce any claimed result to its own construction by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a new benchmark and evaluation rubric without introducing fitted parameters, new physical entities, or non-standard mathematical axioms; it relies on standard practices of dataset curation and multi-criteria scoring.

pith-pipeline@v0.9.1-grok · 5980 in / 1126 out tokens · 29453 ms · 2026-06-27T20:59:54.729794+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 6 linked inside Pith

  1. [1]

    Nano Banana 2: Google’s latest AI image generation model., 2026

    Google DeepMind. Nano Banana 2: Google’s latest AI image generation model., 2026. URL https://blog.google/ innovation-and-ai/technology/ai/nano-banana-2/

  2. [2]

    Gemini Omni: Native multimodal generation and video model., 2026

    Google DeepMind. Gemini Omni: Native multimodal generation and video model., 2026. URLhttps://blog.google/ innovation-and-ai/models-and-research/gemini-models/gemini-omni/

  3. [3]

    Guiding audio editing with audio language model

    Zitong Lan, Yiduo Hao, and Mingmin Zhao. Guiding audio editing with audio language model. Proc. NeurIPS, 2025

  4. [4]

    MMEDIT: A Unified Framework for Multi-T ype Audio Editing via Audio Language Model

    Ye Tao, Wen Wu, Chao Zhang, Mengyue Wu, Shuai Wang, and Xuenan Xu. MMEDIT: A Unified Framework for Multi-T ype Audio Editing via Audio Language Model. arXiv preprint arXiv:2512.20339, 2025

  5. [5]

    Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Rep- resentation

    Canxiang Yan, Chunxiang Jin, Dawei Huang, Haibing Yu, Han Peng, Hui Zhan, Jie Gao, Jing Peng, Jingdong Chen, Jun Zhou, et al. Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Rep- resentation. arXiv preprint arXiv:2511.05516, 2025

  6. [6]

    Step-Audio-EditX Technical Report

    Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Li Xie, Yuxin Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, et al. Step-Audio-EditX Technical Report. arXiv preprint arXiv:2511.03601, 2025

  7. [7]

    AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

    William Chen, Prem Seetharaman, Rithesh Kumar, Oriol Nieto, Shinji Watanabe, Justin Salamon, and Zeyu Jin. AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing. arXiv preprint arXiv:2602.17097, 2026

  8. [8]

    Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

    Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lv , Wei Xue, et al. Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing. Proc. SIGGRAPH, 2026

  9. [9]

    Voicecraft: Zero-shot speech editing and text-to-speech in the wild

    Puyuan Peng, Po- Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, and David Harwath. Voicecraft: Zero-shot speech editing and text-to-speech in the wild. In Proc. ACL, 2024

  10. [10]

    Rubrics as rewards: Reinforcement learning beyond verifiable domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains. arXiv preprint arXiv:2507.17746, 2025

  11. [11]

    The interspeech 2026 audio reasoning challenge: Evaluating reasoning process quality for audio reasoning models and agents

    Ziyang Ma, Ruiyang Xu, Yinghao Ma, Chao-Han Huck Yang, Bohan Li, Jaeyeon Kim, Jin Xu, Jinyu Li, Carlos Busso, Kai Yu, et al. The interspeech 2026 audio reasoning challenge: Evaluating reasoning process quality for audio reasoning models and agents. arXiv preprint arXiv:2602.14224, 2026

  12. [12]

    Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

    Xuehai Bai, Yang Shi, Yi-Fan Zhang, Xuanyu Zhu, Yuran Wang, Yifan Dai, Xinyu Liu, Yiyan Ji, Xiaoling Gu, and Yuanxing Zhang. Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling. arXiv preprint arXiv:2605.13062, 2026

  13. [13]

    Fluentspeech: Stutter- oriented automatic speech editing with context-aware diffusion models

    Ziyue Jiang, Qian Yang, Jialong Zuo, Zhenhui Ye, Rongjie Huang, Yi Ren, and Zhou Zhao. Fluentspeech: Stutter- oriented automatic speech editing with context-aware diffusion models. In Proc. ACL, 2023

  14. [14]

    Ssr-speech: T owards stable, safe and robust zero-shot text-based speech editing and synthesis

    Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, and Dong Yu. Ssr-speech: T owards stable, safe and robust zero-shot text-based speech editing and synthesis. In Proc. ICASSP, 2025

  15. [15]

    Recomposer: Event-roll-guided generative audio editing

    Daniel PW Ellis, Eduardo Fonseca, Ron J Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R Hershey , Aren Jansen, R Channing Moore, and Manoj Plakal. Recomposer: Event-roll-guided generative audio editing. arXiv preprint arXiv:2509.05256, 2025

  16. [16]

    Prompt-guided precise audio editing with diffusion models

    Manjie Xu, Chenxing Li, Dan Su, Wei Liang, Dong Yu, et al. Prompt-guided precise audio editing with diffusion models. Proc. ICML, 2024

  17. [17]

    Zero-shot unsupervised and text-based audio editing using DDPM inversion

    Hila Manor and T omer Michaeli. Zero-shot unsupervised and text-based audio editing using DDPM inversion. Proc. ICML, 2024

  18. [18]

    Instructspeech: Following speech editing instructions via large language models

    Rongjie Huang, Ruofan Hu, Yongqi Wang, Zehan Wang, Xize Cheng, Ziyue Jiang, Zhenhui Ye, Dongchao Yang, Luping Liu, Peng Gao, et al. Instructspeech: Following speech editing instructions via large language models. In Proc. ICML, 2024. 12

  19. [19]

    WavCraft: Audio editing and generation with large language models

    Jinhua Liang, Huan Zhang, Haohe Liu, Yin Cao, Qiuqiang Kong, Xubo Liu, Wenwu Wang, Mark D Plumbley , Huy Phan, and Emmanouil Benetos. WavCraft: Audio editing and generation with large language models. arXiv preprint arXiv:2403.09527, 2024

  20. [20]

    Audit: Audio editing by following instructions with latent diffusion models

    Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian, et al. Audit: Audio editing by following instructions with latent diffusion models. Proc. NeurIPS, 2023

  21. [21]

    Audioeditor: A training-free diffusion-based audio editing framework

    Yuhang Jia, Yang Chen, Jinghua Zhao, Shiwan Zhao, Wenjia Zeng, Yong Chen, and Yong Qin. Audioeditor: A training-free diffusion-based audio editing framework. In Proc. ICASSP, 2025

  22. [22]

    AudioMorphix: Training-free audio editing with diffusion probabilistic models

    Jinhua Liang, Yuanzhe Chen, Yi Yuan, Dongya Jia, Xiaobin Zhuang, Zhuo Chen, Yuping Wang, and Yuxuan Wang. AudioMorphix: Training-free audio editing with diffusion probabilistic models. arXiv preprint arXiv:2505.16076 , 2025

  23. [23]

    VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

    Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun, Zhu Liu, Vimal Bhat, and David Harwath. VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing. In Proc. EMNLP, 2025

  24. [24]

    CosyEdit: Unlock- ing End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models

    Junyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou, Yaxin Han, Mengying Feng, and Yong Qin. CosyEdit: Unlock- ing End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models. arXiv preprint arXiv:2601.05329, 2026

  25. [25]

    Instructav2av: Instruction-guided audio-video joint editing

    Haojie Zheng, Yixin Yang, Siqi Yang, Shuchen Weng, and Boxin Shi. Instructav2av: Instruction-guided audio-video joint editing. arXiv preprint arXiv:2605.18467, 2026

  26. [26]

    SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

    Sen Liang, Cong Wang, Fengbin Guan, Zhentao Yu, Yiting Lu, Yuanzhi Wang, Yuan Zhou, Xin Li, and Zhibo Chen. SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing. arXiv preprint arXiv:2605.25193, 2026

  27. [27]

    Flam: Frame-wise language-audio modeling

    Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seethara- man, and Justin Salamon. Flam: Frame-wise language-audio modeling. Proc. ICML, 2025

  28. [28]

    Omni-captioner: Data pipeline, models, and benchmark for omni detailed perception

    Ziyang Ma, Ruiyang Xu, Zhenghao Xing, Yunfei Chu, Yuxuan Wang, Jinzheng He, Jin Xu, Pheng-Ann Heng, Kai Yu, Junyang Lin, et al. Omni-captioner: Data pipeline, models, and benchmark for omni detailed perception. Proc. ICLR, 2026

  29. [29]

    Qwen3-omni technical report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report. arXiv preprint arXiv:2509.17765, 2025

  30. [30]

    Qwen3.5-omni technical report

    Qwen Team. Qwen3.5-omni technical report. arXiv preprint arXiv:2604.15804, 2026

  31. [31]

    id": "69e898163a050f39ac567501

    Google DeepMind. Introducing Gemini 2.0: our new AI model for the agentic era, 2024. URL https://blog.google/ innovation-and-ai/models-and-research/google-deepmind/google-gemini-ai-update-december-2024/ . 13 Appendices A Demo Examples We present representative samples from the MMAE benchmark to illustrate the diversity of tasks and the granularity of rubr...