Bagpiper-Edit: Zero-Shot Open-Ended Audio Editing via Rich-Caption
Pith reviewed 2026-06-26 13:15 UTC · model grok-4.3
The pith
Audio editing is recast as rewriting rich captions to allow zero-shot open-ended changes while anchoring to the original audio.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Bagpiper-Edit reformulates audio editing as a rich-caption rewriting task. A rich caption acts as the semantic representation of an audio clip. The user request is translated into an edited caption that guides generation of the target edited audio, with the original audio serving as a contextual acoustic anchor. This enables free-form editing without paired audio-editing training data and supports zero-shot capabilities across domains.
What carries the argument
Rich-caption rewriting, where edited captions condition audio generation anchored by the original audio's acoustic features.
If this is right
- Supports free-form natural language instructions instead of templates.
- Works across speech, music, and general audio without separate pipelines.
- Achieves zero-shot editing by avoiding paired training data requirements.
- Maintains consistency with the original audio while performing comparably to specialized models.
Where Pith is reading between the lines
- The method might generalize to other generation tasks where semantic representations can be edited textually.
- It could inspire similar caption-based editing in video or image domains.
- Reducing reliance on paired data could accelerate development of audio manipulation tools.
Load-bearing premise
A rich caption serves as a sufficient semantic representation of the audio clip so that editing the caption and regenerating produces the desired acoustic changes without losing un-described details.
What would settle it
Finding an audio example where important un-captioned acoustic features are altered or lost after caption rewriting and regeneration, despite the edit request not targeting those features.
Figures
read the original abstract
Current text-guided audio editing methods rely on paired training data, predefined operation templates, and separate processing pipelines across speech, music, and sound. We present Bagpiper-Edit to enable open-ended audio editing via free-form natural language instructions. We reformulate audio editing as a rich-caption rewriting task by treating a rich caption as the semantic representation of an audio clip. The user request is translated into an edited caption, which then guides Bagpiper-Edit to generate the target edited audio with the original audio as contextual acoustic anchor. This unlocks the potential of free-form editing, and circumvents the need for paired audio-editing training data, enabling powerful zero-shot editing capabilities. Evaluations across speech, audio, and free-form editing show Bagpiper-Edit maintains good consistency to the original audio and achieves similar performance to other expert models in most cases. Demo: https://bagpiper-edit.github.io, Codes: https://github.com/espnet/espnet/pull/6417 & https://github.com/HsunGong/espnet
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Bagpiper-Edit, a zero-shot method for open-ended audio editing. It reformulates editing as a rich-caption rewriting task in which a rich caption serves as the semantic representation of an audio clip; user instructions are translated into an edited caption that, together with the original audio as contextual acoustic anchor, guides generation of the target audio. The approach is claimed to support free-form natural-language instructions across speech, music, and environmental sound without requiring paired audio-editing training data. Evaluations are stated to demonstrate good consistency with the original and performance comparable to specialized expert models.
Significance. If the central claims hold, the work offers a unified, training-data-efficient route to open-ended audio editing by composing existing captioning and generation models. The explicit release of code (ESPnet PR and companion repository) and a public demo constitute a clear reproducibility strength. The core modeling choice—treating rich captions as editable semantic proxies while anchoring on the original waveform—addresses a practical bottleneck in the field, provided the caption-plus-anchor mechanism reliably preserves task-relevant acoustic detail.
major comments (2)
- [Abstract / Evaluation section] Abstract and Evaluation section: the statement that 'evaluations across speech, audio, and free-form editing show Bagpiper-Edit maintains good consistency … and achieves similar performance to other expert models in most cases' supplies no quantitative metrics, baseline tables, dataset descriptions, or statistical tests. Because these results are the sole empirical support for the zero-shot claim, their absence prevents assessment of whether the method actually meets the stated performance level.
- [Method section] Method section (rich-caption rewriting and anchor conditioning): the central claim that rewriting the caption and regenerating with the original audio as anchor yields the intended edit rests on the untested assumption that any acoustic property omitted by the captioner (timbre micro-variations, exact reverb tails, non-linguistic textures) is either irrelevant or perfectly recovered by the anchor. No ablation isolating caption-omitted features or measuring drift when the edited caption receives higher weight is reported; this directly bears on whether the zero-shot regime is robust.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the empirical presentation and the core modeling assumptions. We address each major comment below and commit to revisions that strengthen the manuscript without altering its central claims.
read point-by-point responses
-
Referee: [Abstract / Evaluation section] Abstract and Evaluation section: the statement that 'evaluations across speech, audio, and free-form editing show Bagpiper-Edit maintains good consistency … and achieves similar performance to other expert models in most cases' supplies no quantitative metrics, baseline tables, dataset descriptions, or statistical tests. Because these results are the sole empirical support for the zero-shot claim, their absence prevents assessment of whether the method actually meets the stated performance level.
Authors: The referee is correct that the current version presents evaluation outcomes primarily through qualitative descriptions and high-level statements rather than quantitative tables, baseline comparisons, dataset details, or statistical tests. This limits independent assessment of the zero-shot performance claims. In the revised manuscript we will expand the Evaluation section to include quantitative metrics (e.g., consistency and quality scores), comparison tables against expert models, descriptions of the test sets used for speech/music/environmental editing, and any applicable statistical analysis. These additions will be placed in both the main text and supplementary material. revision: yes
-
Referee: [Method section] Method section (rich-caption rewriting and anchor conditioning): the central claim that rewriting the caption and regenerating with the original audio as anchor yields the intended edit rests on the untested assumption that any acoustic property omitted by the captioner (timbre micro-variations, exact reverb tails, non-linguistic textures) is either irrelevant or perfectly recovered by the anchor. No ablation isolating caption-omitted features or measuring drift when the edited caption receives higher weight is reported; this directly bears on whether the zero-shot regime is robust.
Authors: We agree that the robustness of the caption-plus-anchor mechanism would be better supported by explicit ablations. The current manuscript relies on the design rationale and qualitative listening results to argue that the acoustic anchor recovers omitted details, but does not isolate the contribution of caption-omitted features or vary the relative weighting of the edited caption. In revision we will add an ablation subsection that (1) compares performance when selected acoustic attributes are deliberately omitted from the caption and (2) reports results under different caption-versus-anchor weighting schemes. This will directly test the assumption highlighted by the referee. revision: yes
Circularity Check
No circularity; pipeline relies on external models
full rationale
The paper's core claim is a methodological reformulation of audio editing into a rich-caption rewriting task that uses an external captioner to produce a semantic representation, rewrites the caption per user instruction, and regenerates audio via a separate generator conditioned on the original clip as anchor. This construction draws performance from pre-trained external components rather than from any fitted parameter, self-referential equation, or self-citation chain internal to the present work. No equation or derivation step reduces the claimed zero-shot result to its own inputs by construction; the zero-shot property follows directly from the absence of paired editing data in the training pipeline. The provided text contains no load-bearing self-citations or uniqueness theorems imported from the authors' prior work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Open-ended audio editing aims to modify an original audio recording according to a natural language user request. Unlike text-to-audio generation, the model must perform targeted op- erations while keeping other content unchanged. This requires identifying which elements to edit and which to preserve, while maintaining overall realism. The ma...
Pith/arXiv arXiv 2026
-
[2]
ZETA [33] and Prompt-Guided-Edit [13] approximate latent inversion per- form text-guided editing but [13] requires explicit token-level temporal alignments
Related work Audio Editing: Prior text-guided audio editing approaches of- ten intervene directly within diffusion frameworks. ZETA [33] and Prompt-Guided-Edit [13] approximate latent inversion per- form text-guided editing but [13] requires explicit token-level temporal alignments. Alternatively, AUDIT [11] binds pre- defined atomic operations (such as a...
-
[3]
While these pipelines improve user flexi- bility, they remain constrained by rigid atomic operations and the dependency on operation-specific paired training data
and AudioChat [23] employ LLMs, as reasoning agents to decompose abstract requests into deterministic sequences of atomic operations. While these pipelines improve user flexi- bility, they remain constrained by rigid atomic operations and the dependency on operation-specific paired training data. In contrast,Bagpiper-Editbypasses these limitations by repl...
-
[4]
Preliminaries and Problem Formulation Bagpiper-Editis built upon the foundational architecture of Bagpiper-Base[30], a unified, autoregressive audio foundation model
Bagpiper-Edit 3.1. Preliminaries and Problem Formulation Bagpiper-Editis built upon the foundational architecture of Bagpiper-Base[30], a unified, autoregressive audio foundation model. At its core, the model utilizes a decoder-only LLM to establish a bidirectional mapping between an audio signalaand its rich captionc, thereby unifying comprehensive audio...
-
[5]
TheBagpiper-Editmodel is built based onBagpiper-Base, and we train two distinct model variantsSTandMTbased on these two patterns in Sec.3.2
Experimental setup Bagpiper-Baseutilizes the Qwen3-8B-Base [36] and multi- stream X-Codec [37] operating at 50Hz, as the audio prediction targets, and is pre-trained on caption-to-audio synthesis, audio- to-caption understanding, and pure text modeling data [30]. TheBagpiper-Editmodel is built based onBagpiper-Base, and we train two distinct model variant...
-
[6]
Speech editing We first analyze transcription editing in Tab.1
Experiments 5.1. Speech editing We first analyze transcription editing in Tab.1. Although pro- vided with original audio,Bagpiper-Basestill causes severe speaker identity loss (SpkSIM = 0.58). Our self-supervised training resolves this: theSTpattern achieves SpkSIM of 0.86, surpassing CosyV oice-3. However, it also results in poor edit- ing accuracy (47.1...
-
[7]
By avoiding rigid predefined operations, it provides a unified, natural language-driven interface for audio editing across speech, sound, and music domains
Conclusion This paper introducesBagpiper-Edit, a novel framework that reformulates open-ended audio editing as a text-space rich- caption rewriting task. By avoiding rigid predefined operations, it provides a unified, natural language-driven interface for audio editing across speech, sound, and music domains. A core con- tribution is our self-supervised t...
-
[8]
U25A20409, and in part by SJTU Med-X (Medicine & Engineering) Translational Research Grant (YG2025LC09)
Acknowledgments This work was supported in part by China NSFC project under Grants No. U25A20409, and in part by SJTU Med-X (Medicine & Engineering) Translational Research Grant (YG2025LC09)
-
[9]
Additionally, as detailed in Sec.4, Large Lan- guage Models (LLMs) and multimodal foundational models were employed for evaluation
Generative AI Use Disclosure Generative AI tools were utilized in the preparation of this manuscript primarily for stylistic polishing and grammatical re- finement; they were not used to author significant original tech- nical content. Additionally, as detailed in Sec.4, Large Lan- guage Models (LLMs) and multimodal foundational models were employed for e...
-
[10]
Qwen2.5-omni technical report,
J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Danget al., “Qwen2.5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025
Pith/arXiv arXiv 2025
-
[11]
J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025
Pith/arXiv arXiv 2025
-
[12]
Audio flamingo: a novel audio language model with few- shot learning and dialogue abilities,
Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle, and B. Catan- zaro, “Audio flamingo: a novel audio language model with few- shot learning and dialogue abilities,” inProceedings of the 41st In- ternational Conference on Machine Learning, 2024, pp. 25 125– 25 148
2024
-
[13]
Audio flamingo 2: An audio- language model with long-audio understanding and expert reason- ing abilities,
S. Ghosh, Z. Kong, S. Kumar, S. Sakshi, J. Kim, W. Ping, R. Valle, D. Manocha, and B. Catanzaro, “Audio flamingo 2: An audio- language model with long-audio understanding and expert reason- ing abilities,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 19 358–19 405
2025
-
[14]
Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,
S. Ghosh, A. Goel, J. Kim, S. Kumar, Z. Kong, S. Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, and B. Catanzaro, “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,” inThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems, 2025
2025
-
[15]
Music flamingo: Scaling music understanding in audio language models,
S. Ghosh, A. Goel, L. Koroshinadze, S.-g. Lee, Z. Kong, J. F. Santos, R. Duraiswami, D. Manocha, W. Ping, M. Shoeybiet al., “Music flamingo: Scaling music understanding in audio language models,”arXiv preprint arXiv:2511.10289, 2025
arXiv 2025
-
[16]
Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024
Pith/arXiv arXiv 2024
-
[17]
Owsm v3. 1: Better and faster open whisper-style speech models based on e-branchformer,
Y . Peng, J. Tian, W. Chen, S. Arora, B. Yan, Y . Sudo, M. Shakeel, K. Choi, J. Shi, X. Changet al., “Owsm v3. 1: Better and faster open whisper-style speech models based on e-branchformer,” in Proc. Interspeech 2024, 2024, pp. 352–356
2024
-
[18]
Owls: Scaling laws for multilingual speech recognition and translation models,
W. Chen, J. Tian, Y . Peng, B. Yan, C.-H. H. Yang, and S. Watan- abe, “Owls: Scaling laws for multilingual speech recognition and translation models,” inForty-second International Conference on Machine Learning, 2025
2025
-
[19]
Moshi: a speech-text foundation model for real-time dialogue,
A. D ´efossezet al., “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024
Pith/arXiv arXiv 2024
-
[20]
Audit: Audio editing by following instructions with latent diffusion models,
Y . Wang, Z. Ju, X. Tan, L. He, Z. Wu, J. Bianet al., “Audit: Audio editing by following instructions with latent diffusion models,” inAdvances in Neural Information Processing Systems, vol. 36, 2023, pp. 71 340–71 357
2023
-
[21]
Audioeditor: A training-free diffusion-based audio edit- ing framework,
Y . Jia, Y . Chen, J. Zhao, S. Zhao, W. Zeng, Y . Chen, and Y . Qin, “Audioeditor: A training-free diffusion-based audio edit- ing framework,” inICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5
2025
-
[22]
Prompt- guided precise audio editing with diffusion models,
M. Xu, C. Li, D. Zhang, D. Su, W. Liang, and D. Yu, “Prompt- guided precise audio editing with diffusion models,” inProceed- ings of the 41st International Conference on Machine Learning, 2024, pp. 55 126–55 143
2024
-
[23]
Zero-shot unsupervised and text- based audio editing using ddpm inversion,
H. Manor and T. Michaeli, “Zero-shot unsupervised and text- based audio editing using ddpm inversion,” inProceedings of the 41st International Conference on Machine Learning, 2024, pp. 34 603–34 629
2024
-
[24]
Stable audio open,
Z. Evans, J. D. Parker, C. Carr, Z. Zukowski, J. Taylor, and J. Pons, “Stable audio open,” inICASSP 2025-2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5
2025
-
[25]
Step-audio-editx technical report,
C. Yan, B. Wuet al., “Step-audio-editx technical report,”arXiv preprint arXiv:2511.03601, 2025
arXiv 2025
-
[26]
Editgen: Harnessing cross-attention control for instruction-based auto- regressive audio editing,
V . Sioros, A. Potamianos, and G. Paraskevopoulos, “Editgen: Harnessing cross-attention control for instruction-based auto- regressive audio editing,”arXiv preprint arXiv:2507.11096, 2025
arXiv 2025
-
[27]
Simple and controllable music gen- eration,
J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y . Adi, and A. D ´efossez, “Simple and controllable music gen- eration,” inAdvances in Neural Information Processing Systems, vol. 36, 2024
2024
-
[28]
Analyzable chain-of-musical- thought prompting for high-fidelity music generation,
M. W. Lam, Y . Xing, W. You, J. Wu, Z. Yin, F. Jiang, H. Liu, F. Liu, X. Li, W.-T. Luet al., “Analyzable chain-of-musical- thought prompting for high-fidelity music generation,”arXiv preprint arXiv:2503.19611, 2025
arXiv 2025
-
[29]
Wavcraft: Audio edit- ing and generation with large language models,
J. Liang, H. Zhang, H. Liu, Y . Cao, Q. Kong, X. Liu, W. Wang, M. D. Plumbley, H. Phan, and E. Benetos, “Wavcraft: Audio edit- ing and generation with large language models,” inICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024
2024
-
[30]
V oicecraft: Zero-shot speech editing and text-to-speech in the wild,
P. Peng, P.-Y . Huang, S.-W. Li, A. Mohamed, and D. Harwath, “V oicecraft: Zero-shot speech editing and text-to-speech in the wild,” inProceedings of the 62nd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 12 442–12 462
2024
-
[31]
Guiding audio editing with audio language model,
Z. Lan, Y . Hao, and M. Zhao, “Guiding audio editing with audio language model,”arXiv preprint arXiv:2509.21625, 2025
arXiv 2025
-
[32]
Audiochat: Unified audio sto- rytelling, editing, and understanding with transfusion forcing,
W. Chen, P. Seetharaman, R. Kumar, O. Nieto, S. Watan- abe, J. Salamon, and Z. Jin, “Audiochat: Unified audio sto- rytelling, editing, and understanding with transfusion forcing,” arXiv preprint arXiv:2602.17097, 2026
arXiv 2026
-
[33]
E2 tts: Embarrassingly easy fully non-autoregressive zeroshot tts,
S. E. Eskimez, X. Wang, M. Thakker, C. Li, C.-H. Tsai, Z. Xiao, H. Yang, Z. Zhu, M. Tang, X. Tanet al., “E2 tts: Embarrassingly easy fully non-autoregressive zeroshot tts,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 682– 689
2024
-
[34]
C. Yan, C. Jin, D. Huang, H. Yu, H. Peng, H. Zhan, J. Gao, J. Peng, J. Chen, J. Zhouet al., “Ming-uniaudio: Speech llm for joint understanding, generation and editing with unified represen- tation,”arXiv preprint arXiv:2511.05516, 2025
arXiv 2025
-
[35]
CosyV oice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training,
Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shi, K. An, G. Yang, Y . Li, Y . Chen, Z. Gao, Q. Chen, Y . Gu, M. Chen, Y . Chen, S. Zhang, W. Wang, and J. Ye, “CosyV oice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training,”arXiv preprint arXiv:2505.17589, May 2025
Pith/arXiv arXiv 2025
-
[36]
H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guo, X. Zhang, P. Zhang, B. Yang, J. Xu, J. Zhou, and J. Lin, “Qwen3-TTS Technical Report,”arXiv preprint arXiv:2601.15621, Jan. 2026
Pith/arXiv arXiv 2026
-
[37]
Cosyedit: High-fidelity audio editing with large lan- guage models,
e. a. Chen, “Cosyedit: High-fidelity audio editing with large lan- guage models,”arXiv preprint, 2026
2026
-
[38]
Scal- ing rich style-prompted text-to-speech datasets,
A. Diwan, Z. Zheng, D. Harwath, and E. Choi, “Scal- ing rich style-prompted text-to-speech datasets,”arXiv preprint arXiv:2503.04713, 2025
arXiv 2025
-
[39]
Bagpiper: Solv- ing open-ended audio tasks via rich captions,
J. Tian, H. Wang, B.-H. Su, C.-y. Huang, Q. Wang, J. Shi, W. Chen, X. Gong, S. Arora, C.-J. Liet al., “Bagpiper: Solv- ing open-ended audio tasks via rich captions,”arXiv preprint arXiv:2602.05220, 2026
Pith/arXiv arXiv 2026
-
[40]
Fusionaudio-1.2 m: Towards fine-grained audio captioning with multimodal contextual fusion,
S. Chen, X. Xie, Z. Chen, L. Zhao, O. Lee, Z. Su, Q. Sun, and B. Wang, “Fusionaudio-1.2 m: Towards fine-grained audio captioning with multimodal contextual fusion,”arXiv preprint arXiv:2506.01111, 2025
arXiv 2025
-
[41]
Omni-captioner: Data pipeline, mod- els, and benchmark for omni detailed perception,
Z. Ma, R. Xu, Z. Xing, Y . Chu, Y . Wang, J. He, J. Xu, P.-A. Heng, K. Yu, J. Linet al., “Omni-captioner: Data pipeline, mod- els, and benchmark for omni detailed perception,”arXiv preprint arXiv:2510.12720, 2025
arXiv 2025
-
[42]
Zeta: Zero-shot expressive text-to- audio generation and editing,
H. Manor and T. Michaeli, “Zeta: Zero-shot expressive text-to- audio generation and editing,”arXiv preprint, 2024
2024
-
[43]
Cosyvoice 2: Scalable streaming speech synthesis with large language models,
Z. Duet al., “Cosyvoice 2: Scalable streaming speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024
Pith/arXiv arXiv 2024
-
[44]
Mimo-audio: Audio language models are few-shot learners,
D. Zhang, G. Wang, J. Xue, K. Fang, L. Zhao, R. Ma, S. Ren, S. Liu, T. Guo, W. Zhuanget al., “Mimo-audio: Audio language models are few-shot learners,”arXiv preprint arXiv:2512.23808, 2025
arXiv 2025
-
[45]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
Pith/arXiv arXiv 2025
-
[46]
Codec does matter: Exploring the semantic shortcoming of codec for audio language model,
Z. Ye, P. Sun, J. Lei, H. Lin, X. Tan, Z. Dai, Q. Kong, J. Chen, J. Pan, Q. Liuet al., “Codec does matter: Exploring the semantic shortcoming of codec for audio language model,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, 2025, pp. 25 697–25 705
2025
-
[47]
Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation,
H. He, Z. Shang, C. Wang, X. Li, Y . Gu, H. Hua, L. Liu, C. Yang, J. Li, P. Shiet al., “Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 885–890
2024
-
[48]
Audio set: An ontology and human-labeled dataset for audio events,
J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 776–780
2017
-
[49]
Wavcaps: A chatgpt-assisted weakly- labelled audio captioning dataset for audio-language multimodal research,
X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumb- ley, Y . Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly- labelled audio captioning dataset for audio-language multimodal research,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, 2024
2024
-
[50]
Audiocaps: Generat- ing captions for audios in the wild,
C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generat- ing captions for audios in the wild,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol- ume 1 (Long and Short Papers), 2019, pp. 119–132
2019
-
[51]
Classifier-free diffusion guidance,
J. Ho and T. Salimans, “Classifier-free diffusion guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Down- stream Applications, 2021
2021
-
[52]
Koel-tts: Enhanc- ing llm based speech generation with preference alignment and classifier free guidance,
S. S. Hussain, P. Neekhara, X. Yang, E. Casanova, S. Ghosh, R. Fejgin, M. T. Desta, R. Valle, and J. Li, “Koel-tts: Enhanc- ing llm based speech generation with preference alignment and classifier free guidance,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 21 230–21 245
2025
-
[53]
Stochastic beams and where to find them: The gumbel-top-k trick for sampling se- quences without replacement,
W. Kool, H. Van Hoof, and M. Welling, “Stochastic beams and where to find them: The gumbel-top-k trick for sampling se- quences without replacement,” inInternational Conference on Machine Learning. PMLR, 2019, pp. 3499–3508
2019
-
[54]
Lib- rispeech: An asr corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210
2015
-
[55]
Robust speech recognition via large-scale weak su- pervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 28 492–28 518
2023
-
[56]
Wavlm: Large-scale self- supervised pre-training for full stack speech processing,
S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022
2022
-
[57]
Seed-tts: A family of high-quality versa- tile speech generation models,
P. Anastassiouet al., “Seed-tts: A family of high-quality versa- tile speech generation models,”arXiv preprint arXiv:2406.02430, 2024
Pith/arXiv arXiv 2024
-
[58]
Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,
C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6493–6497
2021
-
[59]
emotion2vec: Self-supervised pre-training for speech emotion representation,
Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech emotion representation,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 15 747–15 760
2024
-
[60]
Versa: A versatile eval- uation toolkit for speech, audio, and music,
J. Shi, H.-j. Shim, J. Tian, S. Arora, H. Wu, D. Petermann, J. Q. Yip, Y . Zhang, Y . Tang, W. Zhanget al., “Versa: A versatile eval- uation toolkit for speech, audio, and music,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)...
2025
-
[61]
Fr ´echet au- dio distance: A reference-free metric for evaluating music en- hancement algorithms,
K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fr ´echet au- dio distance: A reference-free metric for evaluating music en- hancement algorithms,” inProc. InterSpeech 2019, 2019
2019
-
[62]
Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,
Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5
2023
-
[63]
Audioldm 2: Learn- ing holistic audio generation with self-supervised pretraining,
H. Liu, Y . Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian, Y . Wang, W. Wang, Y . Wang, and M. D. Plumbley, “Audioldm 2: Learn- ing holistic audio generation with self-supervised pretraining,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 32, pp. 2871–2883, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.