MMAE: A Massive Multitask Audio Editing Benchmark

Auden; Binghao Qiang; Chen Yang; Eng-Siong Chng; Guanrou Yang; Haina Zhu; Haotian Zhang; Jiaying Chi; Jie Fang; Junxi Liu

arxiv: 2606.07229 · v1 · pith:22RWTZZ2new · submitted 2026-06-05 · 💻 cs.SD · cs.CL· cs.MM

MMAE: A Massive Multitask Audio Editing Benchmark

Ziyang Ma , Ruiqi Yan , Ruiyang Xu , Jie Fang , Zhikang Niu , Yi-Wen Chao , Wenming Tu , Tianrui Wang

show 30 more authors

Auden Qi Chen Wenxi Chen Jiaying Chi Yanru Huo Zixuan Jiang Xiquan Li Yalin Li Junxi Liu Minghao Liu Binghao Qiang Yijia Shan Zheshu Song Tian Tan Zixiang Wang Zeyu Xie Zhifei Xie Xiaoyu Xing Qixiang Xu Chen Yang Guanrou Yang Shan Yang Yifan Yang Steve Yves Haotian Zhang Haina Zhu Kai Yu Liefeng Bo Eng-Siong Chng Xie Chen

This is my paper

Pith reviewed 2026-06-27 20:59 UTC · model grok-4.3

classification 💻 cs.SD cs.CLcs.MM

keywords audio editing benchmarkinstruction followingmultitask evaluationrubric criteriamultimodal audiomodel performancesound speech music

0 comments

The pith

A new benchmark shows audio editing models achieve exact match rates below 5 percent on instruction tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MMAE as a benchmark for instruction-based audio editing that spans seven modalities including sound, speech, music and mixtures. It organizes tasks into six complexity levels and uses 2000 samples broken down into 17741 rubric criteria to measure how well models follow instructions while keeping audio context consistent. Tests of leading models find exact match rates below 5 percent overall and zero percent on complex mixed-modality cases. These results indicate that current systems fall short on precise execution for real-world audio edits.

Core claim

MMAE is presented as the first broad testbed for general audio editing, covering multiple modalities, task complexities from basic changes to multi-hop reasoning, and operation types. Through its rubric framework that turns free-form instructions into verifiable criteria, the work finds that existing models deliver exact match rates consistently below 5 percent and reach absolute zero on complex mixed-modality tasks, showing they cannot yet produce reliable edits.

What carries the argument

The rubric-based evaluation framework that decomposes free-form tasks into 17741 verifiable criteria to assess instruction following and context consistency.

If this is right

Precise execution remains a critical bottleneck for current audio editing models.
Structural robustness is insufficient when handling mixed modalities and multi-round edits.
The benchmark supplies a standardized way to diagnose weaknesses and track progress.
Multi-dimensional scoring separates instruction adherence from audio context consistency.
Complex tasks expose gaps that simpler single-modality tests miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The task taxonomy could guide creation of targeted training data for audio models.
Similar rubric methods might apply to video or image editing benchmarks to compare reliability across media.
Persistent low scores suggest that simply scaling current models may not close the gap without new mechanisms for audio reasoning.
Widespread adoption could shift development focus toward measurable instruction following rather than perceptual quality alone.

Load-bearing premise

The 2000 samples curated through human-agent collaboration and the decomposition into 17741 criteria provide an unbiased and complete measure of real-world instruction-following performance.

What would settle it

A model that records an exact match rate above 20 percent on the complex mixed-modality tasks would show the reported performance shortfalls are overstated.

read the original abstract

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MMAE builds a broad audio editing benchmark with a rubric framework but the low model scores rest on unvalidated curation and scoring details.

read the letter

The paper introduces a benchmark covering 7 audio modalities, 6 complexity levels, 2 granularities, and 8 operation types, with 2000 samples and a rubric that splits instructions into 17741 criteria. It finds existing models score below 5% exact match rate, hitting zero on complex mixed-modality cases.

What stands out is the attempt to move past narrow prior benchmarks by creating one testbed for general instruction-based editing. The taxonomy and human-agent curation process give a structured way to handle free-form tasks that earlier work did not cover at this scale.

The soft spot is the missing validation. The abstract describes the curation and rubric but reports no inter-annotator agreement, no checks against human edit judgments, and no details on how outputs were scored against the criteria. If the rubrics turn out overly granular or the sample selection favors certain failure modes, the headline numbers could partly reflect benchmark design rather than model limits.

This is for researchers building or testing audio editing models who need a diagnostic tool for instruction following. A reader focused on evaluation methods in generative audio would find the taxonomy and rubric approach useful once the reliability questions are addressed.

It deserves peer review. The scope is new and the problem matters, but the authors need to supply the missing validation numbers before the performance claims can be read as firm.

Referee Report

2 major / 2 minor

Summary. The paper introduces MMAE, the first comprehensive benchmark for general-purpose instruction-based audio editing. It covers 7 audio modalities (sound, speech, music, mixtures), a taxonomy of 6 complexity levels, 2 granularities, and 8 operation types. The benchmark contains 2,000 samples curated via human-agent collaboration and evaluated via a rubric decomposing tasks into 17,741 verifiable criteria. Extensive evaluation of leading models shows Exact Match Rate (EMR) below 5% overall and 0% on complex mixed-modality tasks, concluding that current systems are far from reliable audio editing.

Significance. If the benchmark construction and scoring are shown to be reliable, MMAE would provide a much-needed standardized, multi-dimensional testbed that exposes clear capability gaps in instruction-following and structural robustness for audio editing models, serving as a diagnostic roadmap for the field.

major comments (2)

[Curation and rubric construction] Curation and rubric construction (methods section describing the 2,000 samples and 17,741 criteria): no inter-annotator agreement statistics, bias audits, or validation of the criteria against independent human judgments are reported. This directly undermines the central claim that EMR <5% (and 0% on mixed-modality tasks) demonstrates model limitations rather than possible artifacts in rubric granularity or selection.
[Evaluation protocol] Evaluation protocol (section on rubric-based scoring and EMR computation): insufficient detail is given on how model outputs are matched against the 17,741 criteria, how partial matches or context consistency are quantified, or any agreement metrics between automated and human scoring. These omissions are load-bearing for the reported performance numbers and the conclusion about current systems.

minor comments (2)

[Taxonomy] The taxonomy of 6 complexity levels, 2 granularities, and 8 operation types would be clearer if summarized in a single table rather than described only in prose.
[Figures] Figure captions for the modality and complexity distributions should explicitly state the sample counts per category to allow readers to assess balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, agreeing that additional reporting is warranted to strengthen the benchmark's validity, and commit to revisions accordingly.

read point-by-point responses

Referee: [Curation and rubric construction] Curation and rubric construction (methods section describing the 2,000 samples and 17,741 criteria): no inter-annotator agreement statistics, bias audits, or validation of the criteria against independent human judgments are reported. This directly undermines the central claim that EMR <5% (and 0% on mixed-modality tasks) demonstrates model limitations rather than possible artifacts in rubric granularity or selection.

Authors: We agree that these elements are important for establishing benchmark reliability. Although not reported in the initial submission, the human-agent collaboration involved structured review processes. In the revised manuscript, we will add inter-annotator agreement statistics (e.g., Cohen's kappa) computed over a sampled subset of the 2,000 tasks. We will include a bias audit subsection describing source diversity, modality balancing, and guidelines used to reduce selection bias. We will also report a post-hoc validation study in which independent human judges assessed alignment of a random sample of criteria with task intent. These additions will directly support the robustness of the EMR results. revision: yes
Referee: [Evaluation protocol] Evaluation protocol (section on rubric-based scoring and EMR computation): insufficient detail is given on how model outputs are matched against the 17,741 criteria, how partial matches or context consistency are quantified, or any agreement metrics between automated and human scoring. These omissions are load-bearing for the reported performance numbers and the conclusion about current systems.

Authors: We concur that greater transparency on the scoring mechanics is needed. The revised manuscript will expand the evaluation protocol section with a precise description of the criterion-matching procedure, including the hierarchical decomposition used for the 17,741 criteria and the rules for handling partial matches (via proportional credit assignment to sub-criteria). Context consistency checks will be formalized, and we will report agreement metrics (e.g., percentage agreement and Cohen's kappa) between the automated scorer and human evaluators on a held-out set of 200 model outputs. This will provide the necessary validation for the automated EMR computation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with no derivation chain.

full rationale

The paper constructs a new test set (2,000 samples, 17,741 rubric criteria) via human-agent collaboration and reports direct empirical measurements of existing models' Exact Match Rates. No equations, parameter fits, predictions derived from inputs, or self-citation chains are present. The evaluation is self-contained against external models and does not reduce any claimed result to its own construction by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a new benchmark and evaluation rubric without introducing fitted parameters, new physical entities, or non-standard mathematical axioms; it relies on standard practices of dataset curation and multi-criteria scoring.

pith-pipeline@v0.9.1-grok · 5980 in / 1126 out tokens · 29453 ms · 2026-06-27T20:59:54.729794+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 6 linked inside Pith

[1]

Nano Banana 2: Google’s latest AI image generation model., 2026

Google DeepMind. Nano Banana 2: Google’s latest AI image generation model., 2026. URL https://blog.google/ innovation-and-ai/technology/ai/nano-banana-2/

2026
[2]

Gemini Omni: Native multimodal generation and video model., 2026

Google DeepMind. Gemini Omni: Native multimodal generation and video model., 2026. URLhttps://blog.google/ innovation-and-ai/models-and-research/gemini-models/gemini-omni/

2026
[3]

Guiding audio editing with audio language model

Zitong Lan, Yiduo Hao, and Mingmin Zhao. Guiding audio editing with audio language model. Proc. NeurIPS, 2025

2025
[4]

MMEDIT: A Unified Framework for Multi-T ype Audio Editing via Audio Language Model

Ye Tao, Wen Wu, Chao Zhang, Mengyue Wu, Shuai Wang, and Xuenan Xu. MMEDIT: A Unified Framework for Multi-T ype Audio Editing via Audio Language Model. arXiv preprint arXiv:2512.20339, 2025

arXiv 2025
[5]

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Rep- resentation

Canxiang Yan, Chunxiang Jin, Dawei Huang, Haibing Yu, Han Peng, Hui Zhan, Jie Gao, Jing Peng, Jingdong Chen, Jun Zhou, et al. Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Rep- resentation. arXiv preprint arXiv:2511.05516, 2025

arXiv 2025
[6]

Step-Audio-EditX Technical Report

Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Li Xie, Yuxin Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, et al. Step-Audio-EditX Technical Report. arXiv preprint arXiv:2511.03601, 2025

arXiv 2025
[7]

AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

William Chen, Prem Seetharaman, Rithesh Kumar, Oriol Nieto, Shinji Watanabe, Justin Salamon, and Zeyu Jin. AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing. arXiv preprint arXiv:2602.17097, 2026

arXiv 2026
[8]

Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lv , Wei Xue, et al. Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing. Proc. SIGGRAPH, 2026

2026
[9]

Voicecraft: Zero-shot speech editing and text-to-speech in the wild

Puyuan Peng, Po- Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, and David Harwath. Voicecraft: Zero-shot speech editing and text-to-speech in the wild. In Proc. ACL, 2024

2024
[10]

Rubrics as rewards: Reinforcement learning beyond verifiable domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains. arXiv preprint arXiv:2507.17746, 2025

Pith/arXiv arXiv 2025
[11]

The interspeech 2026 audio reasoning challenge: Evaluating reasoning process quality for audio reasoning models and agents

Ziyang Ma, Ruiyang Xu, Yinghao Ma, Chao-Han Huck Yang, Bohan Li, Jaeyeon Kim, Jin Xu, Jinyu Li, Carlos Busso, Kai Yu, et al. The interspeech 2026 audio reasoning challenge: Evaluating reasoning process quality for audio reasoning models and agents. arXiv preprint arXiv:2602.14224, 2026

arXiv 2026
[12]

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

Xuehai Bai, Yang Shi, Yi-Fan Zhang, Xuanyu Zhu, Yuran Wang, Yifan Dai, Xinyu Liu, Yiyan Ji, Xiaoling Gu, and Yuanxing Zhang. Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling. arXiv preprint arXiv:2605.13062, 2026

Pith/arXiv arXiv 2026
[13]

Fluentspeech: Stutter- oriented automatic speech editing with context-aware diffusion models

Ziyue Jiang, Qian Yang, Jialong Zuo, Zhenhui Ye, Rongjie Huang, Yi Ren, and Zhou Zhao. Fluentspeech: Stutter- oriented automatic speech editing with context-aware diffusion models. In Proc. ACL, 2023

2023
[14]

Ssr-speech: T owards stable, safe and robust zero-shot text-based speech editing and synthesis

Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, and Dong Yu. Ssr-speech: T owards stable, safe and robust zero-shot text-based speech editing and synthesis. In Proc. ICASSP, 2025

2025
[15]

Recomposer: Event-roll-guided generative audio editing

Daniel PW Ellis, Eduardo Fonseca, Ron J Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R Hershey , Aren Jansen, R Channing Moore, and Manoj Plakal. Recomposer: Event-roll-guided generative audio editing. arXiv preprint arXiv:2509.05256, 2025

arXiv 2025
[16]

Prompt-guided precise audio editing with diffusion models

Manjie Xu, Chenxing Li, Dan Su, Wei Liang, Dong Yu, et al. Prompt-guided precise audio editing with diffusion models. Proc. ICML, 2024

2024
[17]

Zero-shot unsupervised and text-based audio editing using DDPM inversion

Hila Manor and T omer Michaeli. Zero-shot unsupervised and text-based audio editing using DDPM inversion. Proc. ICML, 2024

2024
[18]

Instructspeech: Following speech editing instructions via large language models

Rongjie Huang, Ruofan Hu, Yongqi Wang, Zehan Wang, Xize Cheng, Ziyue Jiang, Zhenhui Ye, Dongchao Yang, Luping Liu, Peng Gao, et al. Instructspeech: Following speech editing instructions via large language models. In Proc. ICML, 2024. 12

2024
[19]

WavCraft: Audio editing and generation with large language models

Jinhua Liang, Huan Zhang, Haohe Liu, Yin Cao, Qiuqiang Kong, Xubo Liu, Wenwu Wang, Mark D Plumbley , Huy Phan, and Emmanouil Benetos. WavCraft: Audio editing and generation with large language models. arXiv preprint arXiv:2403.09527, 2024

arXiv 2024
[20]

Audit: Audio editing by following instructions with latent diffusion models

Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian, et al. Audit: Audio editing by following instructions with latent diffusion models. Proc. NeurIPS, 2023

2023
[21]

Audioeditor: A training-free diffusion-based audio editing framework

Yuhang Jia, Yang Chen, Jinghua Zhao, Shiwan Zhao, Wenjia Zeng, Yong Chen, and Yong Qin. Audioeditor: A training-free diffusion-based audio editing framework. In Proc. ICASSP, 2025

2025
[22]

AudioMorphix: Training-free audio editing with diffusion probabilistic models

Jinhua Liang, Yuanzhe Chen, Yi Yuan, Dongya Jia, Xiaobin Zhuang, Zhuo Chen, Yuping Wang, and Yuxuan Wang. AudioMorphix: Training-free audio editing with diffusion probabilistic models. arXiv preprint arXiv:2505.16076 , 2025

arXiv 2025
[23]

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun, Zhu Liu, Vimal Bhat, and David Harwath. VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing. In Proc. EMNLP, 2025

2025
[24]

CosyEdit: Unlock- ing End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models

Junyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou, Yaxin Han, Mengying Feng, and Yong Qin. CosyEdit: Unlock- ing End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models. arXiv preprint arXiv:2601.05329, 2026

arXiv 2026
[25]

Instructav2av: Instruction-guided audio-video joint editing

Haojie Zheng, Yixin Yang, Siqi Yang, Shuchen Weng, and Boxin Shi. Instructav2av: Instruction-guided audio-video joint editing. arXiv preprint arXiv:2605.18467, 2026

Pith/arXiv arXiv 2026
[26]

SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

Sen Liang, Cong Wang, Fengbin Guan, Zhentao Yu, Yiting Lu, Yuanzhi Wang, Yuan Zhou, Xin Li, and Zhibo Chen. SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing. arXiv preprint arXiv:2605.25193, 2026

Pith/arXiv arXiv 2026
[27]

Flam: Frame-wise language-audio modeling

Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seethara- man, and Justin Salamon. Flam: Frame-wise language-audio modeling. Proc. ICML, 2025

2025
[28]

Omni-captioner: Data pipeline, models, and benchmark for omni detailed perception

Ziyang Ma, Ruiyang Xu, Zhenghao Xing, Yunfei Chu, Yuxuan Wang, Jinzheng He, Jin Xu, Pheng-Ann Heng, Kai Yu, Junyang Lin, et al. Omni-captioner: Data pipeline, models, and benchmark for omni detailed perception. Proc. ICLR, 2026

2026
[29]

Qwen3-omni technical report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report. arXiv preprint arXiv:2509.17765, 2025

Pith/arXiv arXiv 2025
[30]

Qwen3.5-omni technical report

Qwen Team. Qwen3.5-omni technical report. arXiv preprint arXiv:2604.15804, 2026

Pith/arXiv arXiv 2026
[31]

id": "69e898163a050f39ac567501

Google DeepMind. Introducing Gemini 2.0: our new AI model for the agentic era, 2024. URL https://blog.google/ innovation-and-ai/models-and-research/google-deepmind/google-gemini-ai-update-december-2024/ . 13 Appendices A Demo Examples We present representative samples from the MMAE benchmark to illustrate the diversity of tasks and the granularity of rubr...

2024

[1] [1]

Nano Banana 2: Google’s latest AI image generation model., 2026

Google DeepMind. Nano Banana 2: Google’s latest AI image generation model., 2026. URL https://blog.google/ innovation-and-ai/technology/ai/nano-banana-2/

2026

[2] [2]

Gemini Omni: Native multimodal generation and video model., 2026

Google DeepMind. Gemini Omni: Native multimodal generation and video model., 2026. URLhttps://blog.google/ innovation-and-ai/models-and-research/gemini-models/gemini-omni/

2026

[3] [3]

Guiding audio editing with audio language model

Zitong Lan, Yiduo Hao, and Mingmin Zhao. Guiding audio editing with audio language model. Proc. NeurIPS, 2025

2025

[4] [4]

MMEDIT: A Unified Framework for Multi-T ype Audio Editing via Audio Language Model

Ye Tao, Wen Wu, Chao Zhang, Mengyue Wu, Shuai Wang, and Xuenan Xu. MMEDIT: A Unified Framework for Multi-T ype Audio Editing via Audio Language Model. arXiv preprint arXiv:2512.20339, 2025

arXiv 2025

[5] [5]

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Rep- resentation

Canxiang Yan, Chunxiang Jin, Dawei Huang, Haibing Yu, Han Peng, Hui Zhan, Jie Gao, Jing Peng, Jingdong Chen, Jun Zhou, et al. Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Rep- resentation. arXiv preprint arXiv:2511.05516, 2025

arXiv 2025

[6] [6]

Step-Audio-EditX Technical Report

Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Li Xie, Yuxin Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, et al. Step-Audio-EditX Technical Report. arXiv preprint arXiv:2511.03601, 2025

arXiv 2025

[7] [7]

AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

William Chen, Prem Seetharaman, Rithesh Kumar, Oriol Nieto, Shinji Watanabe, Justin Salamon, and Zeyu Jin. AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing. arXiv preprint arXiv:2602.17097, 2026

arXiv 2026

[8] [8]

Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lv , Wei Xue, et al. Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing. Proc. SIGGRAPH, 2026

2026

[9] [9]

Voicecraft: Zero-shot speech editing and text-to-speech in the wild

Puyuan Peng, Po- Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, and David Harwath. Voicecraft: Zero-shot speech editing and text-to-speech in the wild. In Proc. ACL, 2024

2024

[10] [10]

Rubrics as rewards: Reinforcement learning beyond verifiable domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains. arXiv preprint arXiv:2507.17746, 2025

Pith/arXiv arXiv 2025

[11] [11]

The interspeech 2026 audio reasoning challenge: Evaluating reasoning process quality for audio reasoning models and agents

Ziyang Ma, Ruiyang Xu, Yinghao Ma, Chao-Han Huck Yang, Bohan Li, Jaeyeon Kim, Jin Xu, Jinyu Li, Carlos Busso, Kai Yu, et al. The interspeech 2026 audio reasoning challenge: Evaluating reasoning process quality for audio reasoning models and agents. arXiv preprint arXiv:2602.14224, 2026

arXiv 2026

[12] [12]

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

Xuehai Bai, Yang Shi, Yi-Fan Zhang, Xuanyu Zhu, Yuran Wang, Yifan Dai, Xinyu Liu, Yiyan Ji, Xiaoling Gu, and Yuanxing Zhang. Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling. arXiv preprint arXiv:2605.13062, 2026

Pith/arXiv arXiv 2026

[13] [13]

Fluentspeech: Stutter- oriented automatic speech editing with context-aware diffusion models

Ziyue Jiang, Qian Yang, Jialong Zuo, Zhenhui Ye, Rongjie Huang, Yi Ren, and Zhou Zhao. Fluentspeech: Stutter- oriented automatic speech editing with context-aware diffusion models. In Proc. ACL, 2023

2023

[14] [14]

Ssr-speech: T owards stable, safe and robust zero-shot text-based speech editing and synthesis

Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, and Dong Yu. Ssr-speech: T owards stable, safe and robust zero-shot text-based speech editing and synthesis. In Proc. ICASSP, 2025

2025

[15] [15]

Recomposer: Event-roll-guided generative audio editing

Daniel PW Ellis, Eduardo Fonseca, Ron J Weiss, Kevin Wilson, Scott Wisdom, Hakan Erdogan, John R Hershey , Aren Jansen, R Channing Moore, and Manoj Plakal. Recomposer: Event-roll-guided generative audio editing. arXiv preprint arXiv:2509.05256, 2025

arXiv 2025

[16] [16]

Prompt-guided precise audio editing with diffusion models

Manjie Xu, Chenxing Li, Dan Su, Wei Liang, Dong Yu, et al. Prompt-guided precise audio editing with diffusion models. Proc. ICML, 2024

2024

[17] [17]

Zero-shot unsupervised and text-based audio editing using DDPM inversion

Hila Manor and T omer Michaeli. Zero-shot unsupervised and text-based audio editing using DDPM inversion. Proc. ICML, 2024

2024

[18] [18]

Instructspeech: Following speech editing instructions via large language models

Rongjie Huang, Ruofan Hu, Yongqi Wang, Zehan Wang, Xize Cheng, Ziyue Jiang, Zhenhui Ye, Dongchao Yang, Luping Liu, Peng Gao, et al. Instructspeech: Following speech editing instructions via large language models. In Proc. ICML, 2024. 12

2024

[19] [19]

WavCraft: Audio editing and generation with large language models

Jinhua Liang, Huan Zhang, Haohe Liu, Yin Cao, Qiuqiang Kong, Xubo Liu, Wenwu Wang, Mark D Plumbley , Huy Phan, and Emmanouil Benetos. WavCraft: Audio editing and generation with large language models. arXiv preprint arXiv:2403.09527, 2024

arXiv 2024

[20] [20]

Audit: Audio editing by following instructions with latent diffusion models

Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian, et al. Audit: Audio editing by following instructions with latent diffusion models. Proc. NeurIPS, 2023

2023

[21] [21]

Audioeditor: A training-free diffusion-based audio editing framework

Yuhang Jia, Yang Chen, Jinghua Zhao, Shiwan Zhao, Wenjia Zeng, Yong Chen, and Yong Qin. Audioeditor: A training-free diffusion-based audio editing framework. In Proc. ICASSP, 2025

2025

[22] [22]

AudioMorphix: Training-free audio editing with diffusion probabilistic models

Jinhua Liang, Yuanzhe Chen, Yi Yuan, Dongya Jia, Xiaobin Zhuang, Zhuo Chen, Yuping Wang, and Yuxuan Wang. AudioMorphix: Training-free audio editing with diffusion probabilistic models. arXiv preprint arXiv:2505.16076 , 2025

arXiv 2025

[23] [23]

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun, Zhu Liu, Vimal Bhat, and David Harwath. VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing. In Proc. EMNLP, 2025

2025

[24] [24]

CosyEdit: Unlock- ing End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models

Junyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou, Yaxin Han, Mengying Feng, and Yong Qin. CosyEdit: Unlock- ing End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models. arXiv preprint arXiv:2601.05329, 2026

arXiv 2026

[25] [25]

Instructav2av: Instruction-guided audio-video joint editing

Haojie Zheng, Yixin Yang, Siqi Yang, Shuchen Weng, and Boxin Shi. Instructav2av: Instruction-guided audio-video joint editing. arXiv preprint arXiv:2605.18467, 2026

Pith/arXiv arXiv 2026

[26] [26]

SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

Sen Liang, Cong Wang, Fengbin Guan, Zhentao Yu, Yiting Lu, Yuanzhi Wang, Yuan Zhou, Xin Li, and Zhibo Chen. SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing. arXiv preprint arXiv:2605.25193, 2026

Pith/arXiv arXiv 2026

[27] [27]

Flam: Frame-wise language-audio modeling

Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seethara- man, and Justin Salamon. Flam: Frame-wise language-audio modeling. Proc. ICML, 2025

2025

[28] [28]

Omni-captioner: Data pipeline, models, and benchmark for omni detailed perception

Ziyang Ma, Ruiyang Xu, Zhenghao Xing, Yunfei Chu, Yuxuan Wang, Jinzheng He, Jin Xu, Pheng-Ann Heng, Kai Yu, Junyang Lin, et al. Omni-captioner: Data pipeline, models, and benchmark for omni detailed perception. Proc. ICLR, 2026

2026

[29] [29]

Qwen3-omni technical report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report. arXiv preprint arXiv:2509.17765, 2025

Pith/arXiv arXiv 2025

[30] [30]

Qwen3.5-omni technical report

Qwen Team. Qwen3.5-omni technical report. arXiv preprint arXiv:2604.15804, 2026

Pith/arXiv arXiv 2026

[31] [31]

id": "69e898163a050f39ac567501

Google DeepMind. Introducing Gemini 2.0: our new AI model for the agentic era, 2024. URL https://blog.google/ innovation-and-ai/models-and-research/google-deepmind/google-gemini-ai-update-december-2024/ . 13 Appendices A Demo Examples We present representative samples from the MMAE benchmark to illustrate the diversity of tasks and the granularity of rubr...

2024