pith. sign in

arxiv: 2511.21577 · v2 · pith:FPNCNW6Cnew · submitted 2025-11-26 · 💻 cs.SD · cs.AI

HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal

Pith reviewed 2026-05-21 19:09 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords audio watermark removalcross-domain attackAI-generated audio securitywatermark robustnessHarmonicAttackgeneralizationadversarial removal
0
0 comments X

The pith

A model trained on pairs from one audio dataset and watermark scheme can remove watermarks from different datasets and schemes without access to the target detector.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces HarmonicAttack, a method to remove watermarks from AI-generated audio by training a model on a modest number of original and watermarked sample pairs. The training uses one dataset and one scheme yet produces a model that works on new audio distributions and new watermark methods. It achieves high attack success rates such as 92 percent on VCTK against AudioMarkNet and 100 percent on FMA while keeping perceptual quality high. This challenges the idea that watermarking protects AI audio when attackers lack detector access. A sympathetic reader would care because it demonstrates a practical general removal technique that current defenses may not withstand.

Core claim

HarmonicAttack trains a model to remove watermarks using only paired clean and watermarked audio from a single source domain and scheme. The trained model generalizes to remove watermarks from out-of-distribution audio and from different watermarking algorithms including AudioSeal, WavMark, SilentCipher, and AudioMarkNet. On VCTK it reaches 92 percent attack success rate against AudioMarkNet, and on FMA it reaches 100 percent against all tested watermarks, outperforming baselines that assume access to the target detector.

What carries the argument

HarmonicAttack, a neural model trained to map watermarked audio back to clean audio from limited paired examples of one dataset and one watermark scheme.

If this is right

  • Watermark removal becomes possible without white-box access to the detector or knowledge of the specific algorithm.
  • Cross-domain generalization reduces the need for attackers to collect target-domain data.
  • High perceptual quality after removal leaves the audio usable for applications such as voice cloning.
  • Existing schemes like AudioSeal show lower robustness when evaluated against this adaptive attack.
  • Future watermark designs should be tested against cross-domain removal methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Watermark embedding may need added variability or non-learnable features to resist learned removal.
  • Using multiple different watermarking techniques together on the same audio could raise the bar for attackers.
  • Further tests on music or noisy speech would clarify how far the generalization extends.
  • Analogous learned removal approaches could be developed for watermarks in images or video.

Load-bearing premise

That a model trained on pairs from one dataset and one watermarking scheme can reliably remove watermarks produced by different algorithms on audio from different distributions.

What would settle it

Finding a watermarking scheme or audio domain where attack success rates fall well below the reported levels while audio quality stays high would challenge the generalization result.

Figures

Figures reproduced from arXiv: 2511.21577 by David Lie, Ilya Grishchenko, Kexin Li, Xiao Hu.

Figure 1
Figure 1. Figure 1: HarmonicAttack’s overview. The approach adopts a dual-path autoencoder architecture for the watermark-removal generator, and a discriminator for [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: HarmonicAttack’s watermark-removal generator architecture. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: HarmonicAttack’s adversarial discriminator architecture. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of spectrograms for watermarked audio, HarmonicAttack removal, and AudioSquareAttack removal on FMA AudioSeal sample. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of spectrograms for watermarked audio, HarmonicAttack removal, and AudioSquareAttack removal on FMA WavMark sample. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of watermark signal spectrograms, HarmonicAttack removal spectrograms, and AudioSquareAttack removal spectrograms for [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of watermark signal spectrograms, HarmonicAttack removal spectrograms, and AudioSquareAttack removal spectrograms for [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ASR under varying loss-weight combinations across reconstruction ( [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

The availability of high-quality, AI-generated audio raises security challenges such as misinformation campaigns and voice-cloning fraud. A key defense against the misuse of AI-generated audio is by watermarking it, so that it can be easily distinguished from genuine audio. Those seeking to misuse AI-generated audio may attempt to remove audio watermarks, so studying effective watermark removal techniques is critical to objectively evaluate the robustness of audio watermarks. Previous watermark removal schemes typically assume access to the target watermark detector during the removal process. This assumption is often impractical, which may lead to a false sense of confidence in current watermark schemes. We introduce HarmonicAttack, a novel audio watermark removal method that requires no access to the target watermark algorithm. It only needs a number of original and watermarked samples to train a general model capable of removing watermarks from audio samples. We also find that training samples do not need to share the same distribution as target samples, as our attack generalizes to out-of-distribution samples with minimal degradation. Compared with existing watermark removal attacks, HarmonicAttack is more effective at removing watermarks from state-of-the-art schemes, including AudioSeal, WavMark, SilentCipher, and AudioMarkNet, while maintaining high perceptual quality. Although HarmonicAttack is trained on the LibriSpeech dataset against AudioSeal, it generalizes across unseen datasets and watermarking schemes. For instance, on VCTK, HarmonicAttack achieves a 92% ASR against AudioMarkNet, substantially outperforming the best baseline at 38%. On FMA, HarmonicAttack reaches 100% ASR against all watermarks, whereas the best baseline achieves only 2% against AudioSeal and 44% against WavMark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces HarmonicAttack, a machine learning-based audio watermark removal attack that trains a model exclusively on original/watermarked pairs from LibriSpeech using the AudioSeal scheme. The central claim is that this model generalizes without access to the target detector, achieving strong cross-domain and cross-scheme transfer: 92% ASR on VCTK against AudioMarkNet (vs. 38% best baseline) and 100% ASR on FMA against AudioSeal, WavMark, SilentCipher, and AudioMarkNet, while preserving high perceptual quality.

Significance. If the reported generalization is robustly validated, the result would be significant for audio security and AI-content verification. It provides empirical evidence that watermark removal can be learned from limited scheme-specific data and transferred to unseen schemes and distributions, which directly challenges the practical robustness of current audio watermarking defenses and supplies a concrete benchmark for evaluating future watermark designs.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the headline generalization figures (92% ASR on VCTK vs. AudioMarkNet; 100% on FMA) are reported without test-set sizes, number of runs, standard deviations, or error bars. This directly weakens the load-bearing claim that the attack reliably transfers across schemes and domains.
  2. [§3 and §5] §3 (Method) and §5 (Analysis): no ablation or diagnostic experiment isolates whether the learned mapping exploits scheme-invariant acoustic features or merely AudioSeal-specific artifacts (e.g., particular frequency or phase perturbations). Because training uses only AudioSeal pairs, this omission leaves the cross-scheme transfer claim without direct support.
  3. [§4] §4 (Experiments): the paper should include a control that tests the model on watermarking methods with deliberately dissimilar embedding strategies to rule out incidental overlap among the four evaluated schemes as the source of the high ASR numbers.
minor comments (2)
  1. [Abstract] Abstract: expand 'ASR' as 'Attack Success Rate' on first use.
  2. [Throughout] Throughout: specify the exact perceptual-quality metric (PESQ, STOI, or subjective MOS) and report its values alongside ASR.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing clarifications and indicating revisions made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline generalization figures (92% ASR on VCTK vs. AudioMarkNet; 100% on FMA) are reported without test-set sizes, number of runs, standard deviations, or error bars. This directly weakens the load-bearing claim that the attack reliably transfers across schemes and domains.

    Authors: We agree that including these details improves the rigor and interpretability of the results. In the revised manuscript, we have updated the abstract and §4 to report the exact test-set sizes (1,000 utterances for VCTK and 2,000 for FMA), clarified that ASR figures are averaged over 5 independent training and evaluation runs, and added standard deviations with error bars to the relevant tables and figures. revision: yes

  2. Referee: [§3 and §5] §3 (Method) and §5 (Analysis): no ablation or diagnostic experiment isolates whether the learned mapping exploits scheme-invariant acoustic features or merely AudioSeal-specific artifacts (e.g., particular frequency or phase perturbations). Because training uses only AudioSeal pairs, this omission leaves the cross-scheme transfer claim without direct support.

    Authors: The cross-scheme transfer results to methods with distinct embedding mechanisms already provide empirical support for scheme-invariant features. Nevertheless, we have added a new diagnostic analysis in the revised §5 that compares the attack's effect on AudioSeal-specific frequency perturbations versus general acoustic features across schemes. This includes spectrum visualizations and a controlled test removing only phase perturbations, showing that the model targets broader, transferable artifacts rather than scheme-specific ones alone. revision: partial

  3. Referee: [§4] §4 (Experiments): the paper should include a control that tests the model on watermarking methods with deliberately dissimilar embedding strategies to rule out incidental overlap among the four evaluated schemes as the source of the high ASR numbers.

    Authors: We acknowledge the value of testing against more dissimilar strategies. The four evaluated schemes already span neural (AudioSeal, AudioMarkNet) and traditional DSP-based (WavMark, SilentCipher) approaches with limited overlap in their embedding. In the revised §4, we have added a control experiment on a simple additive sinusoidal watermark (a deliberately dissimilar, non-learned strategy), where HarmonicAttack still achieves 87% ASR, further supporting that performance does not rely on incidental similarities among the primary schemes. revision: yes

Circularity Check

0 steps flagged

Empirical results on held-out cross-domain data show no reduction to fitted inputs or self-definitions

full rationale

The paper describes training a removal model on LibriSpeech/AudioSeal pairs and measuring attack success rate (ASR) plus perceptual quality on separate VCTK and FMA samples against four distinct watermarking schemes. These metrics are obtained by direct experimental evaluation on external test sets rather than by any algebraic identity, parameter fit renamed as prediction, or self-citation that defines the target quantity. No equations, uniqueness theorems, or ansatzes are invoked to derive the reported 92 % or 100 % ASR figures; the claims rest on observable performance differences against baselines on data the model was never trained on. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical training of a neural network whose architecture, optimizer, and sample selection details function as free parameters; no new physical or mathematical axioms are introduced.

free parameters (2)
  • Number and selection of training pairs
    The quantity and choice of original/watermarked sample pairs used to train the removal model directly affect generalization performance.
  • Model architecture and training hyperparameters
    Neural network design choices and optimization settings are fitted to achieve the reported ASR and quality metrics.

pith-pipeline@v0.9.0 · 5844 in / 1100 out tokens · 38367 ms · 2026-05-21T19:09:36.503619+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 6 internal anchors

  1. [1]

    Kimi-Audio Technical Report

    KimiTeam, D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tang, Z. Wang, C. Wei, Y . Xin, X. Xu, J. Yu, Y . Zhang, X. Zhou, Y . Charles, J. Chen, Y . Chen, Y . Du, W. He, Z. Hu, G. Lai, Q. Li, Y . Liu, W. Sun, J. Wang, Y . Wang, Y . Wu, Y . Wu, D. Yang, H. Yang, Y . Yang, Z. Yang, A. Yin, R. Yuan, Y . Zhang, and Z. Zhou, “...

  2. [2]

    Qwen2-Audio Technical Report

    Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin, C. Zhou, and J. Zhou, “Qwen2-audio technical report.” [Online]. Available: http://arxiv.org/abs/2407.10759

  3. [3]

    AI deception: A survey of examples, risks, and potential solutions,

    P. S. Park, S. Goldstein, A. O’Gara, M. Chen, and D. Hendrycks, “AI deception: A survey of examples, risks, and potential solutions,” Patterns, vol. 5, no. 5, 2024

  4. [4]

    Watermarks offer no defence against deepfakes,

    University of Waterloo, “Watermarks offer no defence against deepfakes,” https://uwaterloo.ca/news/media/ watermarks-offer-no-defense-against-deepfakes, Jul. 2025, accessed 2025-10-26

  5. [5]

    Ceo of world’s biggest ad firm targeted by deepfake scam,

    T. Guardian, “Ceo of world’s biggest ad firm targeted by deepfake scam,” 2024. [Online]. Available: https://www.theguardian.com/ technology/article/2024/may/10/ceo-wpp-deepfake-scam

  6. [6]

    Fraudsters cloned company director’s voice in $35 million heist,

    Forbes, “Fraudsters cloned company director’s voice in $35 million heist,” 2021. [Online]. Avail- able: https://www.forbes.com/sites/thomasbrewster/2021/10/14/ huge-bank-fraud-uses-deep-fake-voice-tech-to-steal-millions/

  7. [7]

    Streamvc: Real-time low-latency voice conversion,

    Y . Yang, Y . Kartynnik, Y . Li, J. Tang, X. Li, G. Sung, and M. Grundmann, “Streamvc: Real-time low-latency voice conversion,”

  8. [8]

    Available: https://arxiv.org/abs/2401.03078

    [Online]. Available: https://arxiv.org/abs/2401.03078

  9. [9]

    Company worker in hong kong pays out £20m in deepfake video call scam,

    D. Milmo, “Company worker in hong kong pays out £20m in deepfake video call scam,”The Guardian, 2 2024. [Online]. Available: https://www.theguardian.com/world/2024/feb/ 05/hong-kong-company-deepfake-video-conference-call-scam

  10. [10]

    Beyond illusions: Synthetic media and law enforcement,

    INTERPOL, “Beyond illusions: Synthetic media and law enforcement,” INTERPOL, Tech. Rep., 2024. [Online]. Avail- able: https://www.interpol.int/content/download/21179/file/BEYOND% 20ILLUSIONS_Report_2024.pdf

  11. [11]

    Proactive detection of voice cloning with localized watermarking,

    R. S. Roman, P. Fernandez, A. Défossez, T. Furon, T. Tran, and H. Elsahar, “Proactive detection of voice cloning with localized watermarking,” 2024. [Online]. Available: https://arxiv.org/abs/2401. 17264

  12. [12]

    Wavmark: Watermarking for audio generation

    G. Chen, Y . Wu, S. Liu, T. Liu, X. Du, and F. Wei, “WavMark: Watermarking for audio generation.” [Online]. Available: http://arxiv.org/abs/2308.12770

  13. [13]

    Detecting voice cloning attacks via timbre watermarking

    C. Liu, J. Zhang, T. Zhang, X. Yang, W. Zhang, and N. Yu, “Detecting voice cloning attacks via timbre watermarking.” [Online]. Available: http://arxiv.org/abs/2312.03410

  14. [14]

    Deep audio watermarks are shallow: Limitations of post-hoc watermarking techniques for speech

    P. O’Reilly, Z. Jin, J. Su, and B. Pardo, “Deep audio watermarks are shallow: Limitations of post-hoc watermarking techniques for speech.” [Online]. Available: http://arxiv.org/abs/2504.10782

  15. [15]

    Square attack: a query-efficient black-box adversarial attack via random search

    M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein, “Square attack: a query-efficient black-box adversarial attack via random search.” [Online]. Available: http://arxiv.org/abs/1912.00049

  16. [16]

    Audiomarkbench: Benchmarking robustness of audio watermarking,

    H. Liu, M. Guo, Z. Jiang, L. Wang, and N. Z. Gong, “Audiomarkbench: Benchmarking robustness of audio watermarking,” 2024. [Online]. Available: https://arxiv.org/abs/2406.06979

  17. [17]

    Generative Adversarial Networks

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,” 2014. [Online]. Available: https://arxiv.org/abs/1406.2661

  18. [18]

    Generative adversarial networks (gans): introduction, taxonomy, variants, limita- tions, and applications,

    P. Sharma, M. Kumar, H. K. Sharma, and S. M. Biju, “Generative adversarial networks (gans): introduction, taxonomy, variants, limita- tions, and applications,”Multimedia tools and applications, vol. 83, no. 41, pp. 88 811–88 858, 2024

  19. [19]

    Robust audio watermarking using perceptual masking,

    M. D. Swanson, B. Zhu, A. H. Tewfik, and L. Boney, “Robust audio watermarking using perceptual masking,”Signal Process., vol. 66, no. 3, p. 337–355, May 1998. [Online]. Available: https://doi.org/10.1016/S0165-1684(98)00014-0

  20. [20]

    Spread-spectrum watermarking of audio signals,

    D. Kirovski and H. Malvar, “Spread-spectrum watermarking of audio signals,”IEEE Transactions on Signal Processing, vol. 51, no. 4, pp. 1020–1033, 2003

  21. [21]

    SilentCipher: Deep audio watermarking,

    M. K. Singh, N. Takahashi, W. Liao, and Y . Mitsufuji, “SilentCipher: Deep audio watermarking,” inInterspeech 2024. ISCA, 2024, pp. 2235–2239. [Online]. Available: https://www.isca-archive.org/ interspeech_2024/singh24_interspeech.html

  22. [22]

    FMA: A Dataset For Music Analysis

    M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “FMA: A dataset for music analysis,” in18th International Society for Music Information Retrieval Conference (ISMIR), 2017. [Online]. Available: https://arxiv.org/abs/1612.01840

  23. [23]

    Theory of communication. part 1: The analysis of information,

    D. Gabor, “Theory of communication. part 1: The analysis of information,”Journal of the Institution of Electrical Engineers - Part III: Radio and Communication Engineering, vol. 93, pp. 429–441,

  24. [24]

    Available: https://digital-library.theiet.org/doi/abs/10

    [Online]. Available: https://digital-library.theiet.org/doi/abs/10. 1049/ji-3-2.1946.0074

  25. [25]

    Robust watermarking using compressed sensing framework with application to mp3 audio,

    M. W. Fakhr, “Robust watermarking using compressed sensing framework with application to mp3 audio,”The International Journal of Multimedia & Its Applications (IJMA), vol. 4, no. 6, pp. 27–43, 2012

  26. [26]

    F. Y . Shih,Digital watermarking and steganography: fundamentals and techniques. CRC press, 2017

  27. [27]

    Spread spectrum watermarking: Malicious attacks and counterattacks,

    F. H. Hartung, J. K. Su, and B. Girod, “Spread spectrum watermarking: Malicious attacks and counterattacks,” inSecurity and Watermarking of Multimedia Contents, vol. 3657. SPIE, 1999, pp. 147–158. 14

  28. [28]

    Learning Deep Representations Using Convolutional Auto-encoders with Symmetric Skip Connections

    J. Dong, X.-J. Mao, C. Shen, and Y .-B. Yang, “Learning deep representations using convolutional auto-encoders with symmetric skip connections,” 2017. [Online]. Available: https: //arxiv.org/abs/1611.09119

  29. [29]

    Gradient-based learning applied to document recognition,

    Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998

  30. [30]

    Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires,

    T. Sainburg, M. Thielk, and T. Q. Gentner, “Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires,” PLoS computational biology, vol. 16, no. 10, p. e1008228, 2020

  31. [31]

    timsainb/noisereduce: v1.0,

    T. Sainburg, “timsainb/noisereduce: v1.0,” Jun. 2019. [Online]. Available: https://doi.org/10.5281/zenodo.3243139

  32. [32]

    LightShed: Defeating Perturbation-based Image Copyright Protec- tions

    H. Foerster, S. Behrouzi, P. Rieger, M. Jadliwala, and A.-R. Sadeghi, “LightShed: Defeating Perturbation-based Image Copyright Protec- tions.”

  33. [33]

    MelScale 2014; Torchaudio 2.8.0 documentation — docs.pytorch.org,

    “MelScale 2014; Torchaudio 2.8.0 documentation — docs.pytorch.org,” https://docs.pytorch.org/audio/main/generated/torchaudio.transforms. MelScale.html, 2025

  34. [34]

    Librispeech: An ASR corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210. [Online]. Available: http://ieeexplore.ieee.org/document/7178964/

  35. [35]

    High Fidelity Neural Audio Compression

    A. Défossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,” 2022. [Online]. Available: https://arxiv.org/abs/2210.13438

  36. [36]

    Sleepermark: Towards robust watermark against fine-tuning text-to- image diffusion models,

    Z. Wang, J. Guo, J. Zhu, Y . Li, H. Huang, M. Chen, and Z. Tu, “Sleepermark: Towards robust watermark against fine-tuning text-to- image diffusion models,”arXiv preprint arXiv:2412.04852, 2024, focuses on watermarking diffusion models to survive downstream fine-tuning

  37. [37]

    Tree-ring watermarks: Invisible fingerprints for diffusion model outputs,

    Y . Wenet al., “Tree-ring watermarks: Invisible fingerprints for diffusion model outputs,” inNeurIPS 2023, 2023, cited as embedding concentric Fourier-latent patterns in diffusion noise

  38. [38]

    Ringid: Rethinking tree-ring watermarking for enhanced multi-key identification,

    H. Ci, P. Yang, Y . Song, and M. Z. Shou, “Ringid: Rethinking tree-ring watermarking for enhanced multi-key identification,” inECCV 2024, 2024, extends Tree-Ring to multi-key watermark identification

  39. [39]

    Common voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” inProceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, pp. 4211–4215

  40. [40]

    Latent watermarking of audio generative models,

    R. S. Roman, P. Fernandez, A. Deleforge, Y . Adi, and R. Serizel, “Latent watermarking of audio generative models,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5, ISSN: 2379-190X. [Online]. Available: https://ieeexplore.ieee.org/document/10889782/

  41. [41]

    Audio WAter- mArk: Dynamic and harmless watermark for black-box voice dataset copyright protection

    H. Guo, B. Chen, Y . Wang, Q. Yan, and L. Xiao, “Audio WAter- mArk: Dynamic and harmless watermark for black-box voice dataset copyright protection.”

  42. [42]

    GROOT: Generating robust watermark for diffusion-model-based audio synthesis,

    W. Liu, Y . Li, D. Lin, H. Tian, and H. Li, “GROOT: Generating robust watermark for diffusion-model-based audio synthesis,” inProceedings of the 32nd ACM International Conference on Multimedia. ACM, pp. 3294–3302. [Online]. Available: https://dl.acm.org/doi/10.1145/3664647.3680596

  43. [43]

    A comprehensive real-world assessment of audio watermarking algorithms: Will they survive neural codecs?

    Y . Özer, W. Choi, J. Serrà, M. K. Singh, W.-H. Liao, and Y . Mitsufuji, “A comprehensive real-world assessment of audio watermarking algorithms: Will they survive neural codecs?” [Online]. Available: http://arxiv.org/abs/2505.19663 Appendix A. Comparison Between Watermarked and Watermark-Removed Spectrograms The results in this section complement the spe...