MMAE: A Massive Multitask Audio Editing Benchmark
Pith reviewed 2026-06-27 20:59 UTC · model grok-4.3
The pith
A new benchmark shows audio editing models achieve exact match rates below 5 percent on instruction tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MMAE is presented as the first broad testbed for general audio editing, covering multiple modalities, task complexities from basic changes to multi-hop reasoning, and operation types. Through its rubric framework that turns free-form instructions into verifiable criteria, the work finds that existing models deliver exact match rates consistently below 5 percent and reach absolute zero on complex mixed-modality tasks, showing they cannot yet produce reliable edits.
What carries the argument
The rubric-based evaluation framework that decomposes free-form tasks into 17741 verifiable criteria to assess instruction following and context consistency.
If this is right
- Precise execution remains a critical bottleneck for current audio editing models.
- Structural robustness is insufficient when handling mixed modalities and multi-round edits.
- The benchmark supplies a standardized way to diagnose weaknesses and track progress.
- Multi-dimensional scoring separates instruction adherence from audio context consistency.
- Complex tasks expose gaps that simpler single-modality tests miss.
Where Pith is reading between the lines
- The task taxonomy could guide creation of targeted training data for audio models.
- Similar rubric methods might apply to video or image editing benchmarks to compare reliability across media.
- Persistent low scores suggest that simply scaling current models may not close the gap without new mechanisms for audio reasoning.
- Widespread adoption could shift development focus toward measurable instruction following rather than perceptual quality alone.
Load-bearing premise
The 2000 samples curated through human-agent collaboration and the decomposition into 17741 criteria provide an unbiased and complete measure of real-world instruction-following performance.
What would settle it
A model that records an exact match rate above 20 percent on the complex mixed-modality tasks would show the reported performance shortfalls are overstated.
read the original abstract
We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MMAE, the first comprehensive benchmark for general-purpose instruction-based audio editing. It covers 7 audio modalities (sound, speech, music, mixtures), a taxonomy of 6 complexity levels, 2 granularities, and 8 operation types. The benchmark contains 2,000 samples curated via human-agent collaboration and evaluated via a rubric decomposing tasks into 17,741 verifiable criteria. Extensive evaluation of leading models shows Exact Match Rate (EMR) below 5% overall and 0% on complex mixed-modality tasks, concluding that current systems are far from reliable audio editing.
Significance. If the benchmark construction and scoring are shown to be reliable, MMAE would provide a much-needed standardized, multi-dimensional testbed that exposes clear capability gaps in instruction-following and structural robustness for audio editing models, serving as a diagnostic roadmap for the field.
major comments (2)
- [Curation and rubric construction] Curation and rubric construction (methods section describing the 2,000 samples and 17,741 criteria): no inter-annotator agreement statistics, bias audits, or validation of the criteria against independent human judgments are reported. This directly undermines the central claim that EMR <5% (and 0% on mixed-modality tasks) demonstrates model limitations rather than possible artifacts in rubric granularity or selection.
- [Evaluation protocol] Evaluation protocol (section on rubric-based scoring and EMR computation): insufficient detail is given on how model outputs are matched against the 17,741 criteria, how partial matches or context consistency are quantified, or any agreement metrics between automated and human scoring. These omissions are load-bearing for the reported performance numbers and the conclusion about current systems.
minor comments (2)
- [Taxonomy] The taxonomy of 6 complexity levels, 2 granularities, and 8 operation types would be clearer if summarized in a single table rather than described only in prose.
- [Figures] Figure captions for the modality and complexity distributions should explicitly state the sample counts per category to allow readers to assess balance.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, agreeing that additional reporting is warranted to strengthen the benchmark's validity, and commit to revisions accordingly.
read point-by-point responses
-
Referee: [Curation and rubric construction] Curation and rubric construction (methods section describing the 2,000 samples and 17,741 criteria): no inter-annotator agreement statistics, bias audits, or validation of the criteria against independent human judgments are reported. This directly undermines the central claim that EMR <5% (and 0% on mixed-modality tasks) demonstrates model limitations rather than possible artifacts in rubric granularity or selection.
Authors: We agree that these elements are important for establishing benchmark reliability. Although not reported in the initial submission, the human-agent collaboration involved structured review processes. In the revised manuscript, we will add inter-annotator agreement statistics (e.g., Cohen's kappa) computed over a sampled subset of the 2,000 tasks. We will include a bias audit subsection describing source diversity, modality balancing, and guidelines used to reduce selection bias. We will also report a post-hoc validation study in which independent human judges assessed alignment of a random sample of criteria with task intent. These additions will directly support the robustness of the EMR results. revision: yes
-
Referee: [Evaluation protocol] Evaluation protocol (section on rubric-based scoring and EMR computation): insufficient detail is given on how model outputs are matched against the 17,741 criteria, how partial matches or context consistency are quantified, or any agreement metrics between automated and human scoring. These omissions are load-bearing for the reported performance numbers and the conclusion about current systems.
Authors: We concur that greater transparency on the scoring mechanics is needed. The revised manuscript will expand the evaluation protocol section with a precise description of the criterion-matching procedure, including the hierarchical decomposition used for the 17,741 criteria and the rules for handling partial matches (via proportional credit assignment to sub-criteria). Context consistency checks will be formalized, and we will report agreement metrics (e.g., percentage agreement and Cohen's kappa) between the automated scorer and human evaluators on a held-out set of 200 model outputs. This will provide the necessary validation for the automated EMR computation. revision: yes
Circularity Check
No significant circularity; empirical benchmark with no derivation chain.
full rationale
The paper constructs a new test set (2,000 samples, 17,741 rubric criteria) via human-agent collaboration and reports direct empirical measurements of existing models' Exact Match Rates. No equations, parameter fits, predictions derived from inputs, or self-citation chains are present. The evaluation is self-contained against external models and does not reduce any claimed result to its own construction by definition.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Nano Banana 2: Google’s latest AI image generation model., 2026
Google DeepMind. Nano Banana 2: Google’s latest AI image generation model., 2026. URL https://blog.google/ innovation-and-ai/technology/ai/nano-banana-2/
2026
-
[2]
Gemini Omni: Native multimodal generation and video model., 2026
Google DeepMind. Gemini Omni: Native multimodal generation and video model., 2026. URLhttps://blog.google/ innovation-and-ai/models-and-research/gemini-models/gemini-omni/
2026
-
[3]
Guiding audio editing with audio language model
Zitong Lan, Yiduo Hao, and Mingmin Zhao. Guiding audio editing with audio language model. Proc. NeurIPS, 2025
2025
-
[4]
MMEDIT: A Unified Framework for Multi-T ype Audio Editing via Audio Language Model
Ye Tao, Wen Wu, Chao Zhang, Mengyue Wu, Shuai Wang, and Xuenan Xu. MMEDIT: A Unified Framework for Multi-T ype Audio Editing via Audio Language Model. arXiv preprint arXiv:2512.20339, 2025
arXiv 2025
-
[5]
Canxiang Yan, Chunxiang Jin, Dawei Huang, Haibing Yu, Han Peng, Hui Zhan, Jie Gao, Jing Peng, Jingdong Chen, Jun Zhou, et al. Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Rep- resentation. arXiv preprint arXiv:2511.05516, 2025
arXiv 2025
-
[6]
Step-Audio-EditX Technical Report
Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Li Xie, Yuxin Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, et al. Step-Audio-EditX Technical Report. arXiv preprint arXiv:2511.03601, 2025
arXiv 2025
-
[7]
AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing
William Chen, Prem Seetharaman, Rithesh Kumar, Oriol Nieto, Shinji Watanabe, Justin Salamon, and Zeyu Jin. AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing. arXiv preprint arXiv:2602.17097, 2026
arXiv 2026
-
[8]
Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lv , Wei Xue, et al. Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing. Proc. SIGGRAPH, 2026
2026
-
[9]
Voicecraft: Zero-shot speech editing and text-to-speech in the wild
Puyuan Peng, Po- Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, and David Harwath. Voicecraft: Zero-shot speech editing and text-to-speech in the wild. In Proc. ACL, 2024
2024
-
[10]
Rubrics as rewards: Reinforcement learning beyond verifiable domains
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains. arXiv preprint arXiv:2507.17746, 2025
Pith/arXiv arXiv 2025
-
[11]
Ziyang Ma, Ruiyang Xu, Yinghao Ma, Chao-Han Huck Yang, Bohan Li, Jaeyeon Kim, Jin Xu, Jinyu Li, Carlos Busso, Kai Yu, et al. The interspeech 2026 audio reasoning challenge: Evaluating reasoning process quality for audio reasoning models and agents. arXiv preprint arXiv:2602.14224, 2026
arXiv 2026
-
[12]
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
Xuehai Bai, Yang Shi, Yi-Fan Zhang, Xuanyu Zhu, Yuran Wang, Yifan Dai, Xinyu Liu, Yiyan Ji, Xiaoling Gu, and Yuanxing Zhang. Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling. arXiv preprint arXiv:2605.13062, 2026
Pith/arXiv arXiv 2026
-
[13]
Fluentspeech: Stutter- oriented automatic speech editing with context-aware diffusion models
Ziyue Jiang, Qian Yang, Jialong Zuo, Zhenhui Ye, Rongjie Huang, Yi Ren, and Zhou Zhao. Fluentspeech: Stutter- oriented automatic speech editing with context-aware diffusion models. In Proc. ACL, 2023
2023
-
[14]
Ssr-speech: T owards stable, safe and robust zero-shot text-based speech editing and synthesis
Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, and Dong Yu. Ssr-speech: T owards stable, safe and robust zero-shot text-based speech editing and synthesis. In Proc. ICASSP, 2025
2025
-
[15]
Recomposer: Event-roll-guided generative audio editing
Daniel PW Ellis, Eduardo Fonseca, Ron J Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R Hershey , Aren Jansen, R Channing Moore, and Manoj Plakal. Recomposer: Event-roll-guided generative audio editing. arXiv preprint arXiv:2509.05256, 2025
arXiv 2025
-
[16]
Prompt-guided precise audio editing with diffusion models
Manjie Xu, Chenxing Li, Dan Su, Wei Liang, Dong Yu, et al. Prompt-guided precise audio editing with diffusion models. Proc. ICML, 2024
2024
-
[17]
Zero-shot unsupervised and text-based audio editing using DDPM inversion
Hila Manor and T omer Michaeli. Zero-shot unsupervised and text-based audio editing using DDPM inversion. Proc. ICML, 2024
2024
-
[18]
Instructspeech: Following speech editing instructions via large language models
Rongjie Huang, Ruofan Hu, Yongqi Wang, Zehan Wang, Xize Cheng, Ziyue Jiang, Zhenhui Ye, Dongchao Yang, Luping Liu, Peng Gao, et al. Instructspeech: Following speech editing instructions via large language models. In Proc. ICML, 2024. 12
2024
-
[19]
WavCraft: Audio editing and generation with large language models
Jinhua Liang, Huan Zhang, Haohe Liu, Yin Cao, Qiuqiang Kong, Xubo Liu, Wenwu Wang, Mark D Plumbley , Huy Phan, and Emmanouil Benetos. WavCraft: Audio editing and generation with large language models. arXiv preprint arXiv:2403.09527, 2024
arXiv 2024
-
[20]
Audit: Audio editing by following instructions with latent diffusion models
Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian, et al. Audit: Audio editing by following instructions with latent diffusion models. Proc. NeurIPS, 2023
2023
-
[21]
Audioeditor: A training-free diffusion-based audio editing framework
Yuhang Jia, Yang Chen, Jinghua Zhao, Shiwan Zhao, Wenjia Zeng, Yong Chen, and Yong Qin. Audioeditor: A training-free diffusion-based audio editing framework. In Proc. ICASSP, 2025
2025
-
[22]
AudioMorphix: Training-free audio editing with diffusion probabilistic models
Jinhua Liang, Yuanzhe Chen, Yi Yuan, Dongya Jia, Xiaobin Zhuang, Zhuo Chen, Yuping Wang, and Yuxuan Wang. AudioMorphix: Training-free audio editing with diffusion probabilistic models. arXiv preprint arXiv:2505.16076 , 2025
arXiv 2025
-
[23]
VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun, Zhu Liu, Vimal Bhat, and David Harwath. VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing. In Proc. EMNLP, 2025
2025
-
[24]
CosyEdit: Unlock- ing End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models
Junyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou, Yaxin Han, Mengying Feng, and Yong Qin. CosyEdit: Unlock- ing End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models. arXiv preprint arXiv:2601.05329, 2026
arXiv 2026
-
[25]
Instructav2av: Instruction-guided audio-video joint editing
Haojie Zheng, Yixin Yang, Siqi Yang, Shuchen Weng, and Boxin Shi. Instructav2av: Instruction-guided audio-video joint editing. arXiv preprint arXiv:2605.18467, 2026
Pith/arXiv arXiv 2026
-
[26]
SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing
Sen Liang, Cong Wang, Fengbin Guan, Zhentao Yu, Yiting Lu, Yuanzhi Wang, Yuan Zhou, Xin Li, and Zhibo Chen. SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing. arXiv preprint arXiv:2605.25193, 2026
Pith/arXiv arXiv 2026
-
[27]
Flam: Frame-wise language-audio modeling
Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seethara- man, and Justin Salamon. Flam: Frame-wise language-audio modeling. Proc. ICML, 2025
2025
-
[28]
Omni-captioner: Data pipeline, models, and benchmark for omni detailed perception
Ziyang Ma, Ruiyang Xu, Zhenghao Xing, Yunfei Chu, Yuxuan Wang, Jinzheng He, Jin Xu, Pheng-Ann Heng, Kai Yu, Junyang Lin, et al. Omni-captioner: Data pipeline, models, and benchmark for omni detailed perception. Proc. ICLR, 2026
2026
-
[29]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report. arXiv preprint arXiv:2509.17765, 2025
Pith/arXiv arXiv 2025
-
[30]
Qwen Team. Qwen3.5-omni technical report. arXiv preprint arXiv:2604.15804, 2026
Pith/arXiv arXiv 2026
-
[31]
id": "69e898163a050f39ac567501
Google DeepMind. Introducing Gemini 2.0: our new AI model for the agentic era, 2024. URL https://blog.google/ innovation-and-ai/models-and-research/google-deepmind/google-gemini-ai-update-december-2024/ . 13 Appendices A Demo Examples We present representative samples from the MMAE benchmark to illustrate the diversity of tasks and the granularity of rubr...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.