pith. sign in

arxiv: 2606.21365 · v1 · pith:DFXFJ53Jnew · submitted 2026-06-19 · 💻 cs.SD · cs.AI

LambdaMark: Semantic Audio Watermarking for Robustness and Radioactivity

Pith reviewed 2026-06-26 12:56 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords audio watermarkingsemantic latent representationsrobustnessradioactivityspeech generationadversarial attacksvoice cloning
0
0 comments X

The pith

Embedding multi-bit watermarks in semantic audio latents produces the only scheme robust to all removal attacks while also transferring to finetuned models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LambdaMark to solve the problem of voice cloning by embedding detectable multi-bit signals directly into the semantic latent space of audio rather than into waveforms or spectrograms. This placement is intended to make the marks harder to strip with signal-level edits and more likely to be absorbed as meaningful features when a downstream speech model is retrained on the watermarked examples. The system consists of a lightweight encoder that adds message-dependent perturbations and a decoder that recovers the bits, both optimized by a loss that keeps audio quality high while raising detection accuracy and attack resistance. Experiments position LambdaMark as the sole method that survives every tested removal attack and that continues to be recoverable from audio synthesized by models fine-tuned on the watermarked data.

Core claim

LambdaMark is the first generic radioactive watermarking scheme that embeds multi-bit watermark information into semantic audio latent representations. Unlike prior methods that operate on low-level signals, the watermarks carry semantic interpretation and are therefore more likely to be learned by a downstream model through finetuning. The scheme uses a lightweight encoder to inject message-dependent perturbations and a decoder to detect presence and recover bits; both are trained with a multi-component loss that preserves fidelity, improves bit recovery, and increases robustness to distortions and adversarial removal. The resulting watermarks achieve near-perfect robustness under common di

What carries the argument

Lightweight encoder that injects multi-bit message-dependent perturbations into semantic audio representations, paired with a decoder, and trained by a multi-component loss balancing fidelity, bit recovery rate, and robustness.

If this is right

  • Watermarked audio trains speech models that reproduce the embedded bits in their own generated outputs.
  • Detection accuracy stays near-perfect after common signal distortions and all evaluated adversarial removal attempts.
  • The same decoder works on audio synthesized by models that were fine-tuned on the watermarked examples.
  • The watermark remains effective without requiring changes to the downstream generation architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same semantic-embedding strategy could be tested on image or video generators to create transferable watermarks across modalities.
  • Radioactivity may scale with model size; larger finetuning runs could be measured to see whether bit recovery improves or saturates.
  • Practical use would require evaluating whether the added perturbations affect perceptual quality under real-world listening conditions not covered in the paper.
  • If semantic latents prove sufficient for radioactivity, future work could explore whether the same principle applies to other forms of generative content without audio-specific encoders.

Load-bearing premise

Embedding watermark information into semantic audio latent representations will cause downstream speech generation models to learn and reproduce the watermark when finetuned on the watermarked audio.

What would settle it

A finetuned speech generation model whose outputs contain no recoverable watermark bits from the original LambdaMark embedding, or a removal attack that defeats the decoder while leaving audio quality intact.

Figures

Figures reproduced from arXiv: 2606.21365 by David Lie, Ilya Grishchenko, Kexin Li, Xiao Hu.

Figure 1
Figure 1. Figure 1: LambdaMark’s design overview. Record speech V Dorig D wm Add watermark Watermark protected Perform Adversarial Removal Robust and Radioactive Non-robust and/or Non-radioactive Adv Fail Adv Succeed Dadv Finetune Madv Perform Adversarial D Removal adv Robust Radioactivity Non-robust Radioactivity [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LambdaMark protects Dorig from unauthorized voice cloning by embedding robust and radioactive watermarks. We further assume that V does not require access to Madv or its finetuning method. For the watermark to remain detectable under this chal￾lenging threat model, it must satisfy both robustness and robust radioactivity. Specifically, it must survive common distortions, downstream finetuning, and adversar… view at source ↗
Figure 3
Figure 3. Figure 3: Mel spectrogram visualizations of watermarked audio produced by LambdaMark and AudioSeal across three [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Linear-frequency STFT spectrogram visualizations of watermarked audio produced by LambdaMark and AudioSeal [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
read the original abstract

Recent advances in generative audio have made voice cloning increasingly effortless, enabling voice fraud, impersonation, and other forms of unauthorized use. A common attack finetunes a speech generation model on recordings of a target speaker, allowing the model to synthesize speech in that speaker's voice. Audio watermarking offers a promising defense by embedding detectable signals into audio. A practical watermark must satisfy two key properties: robustness and radioactivity. Existing audio watermarking methods typically embed signals into low-level representations, such as waveforms or spectrograms, which makes them vulnerable to signal-level manipulations and limits their transfer to downstream models. We introduce LambdaMark -- the first generic radioactive watermarking scheme. Unlike all previous approaches, LambdaMark achieves generic radioactivity by embedding multi-bit watermark information into semantic audio latent representations. Our watermarks have semantic interpretation and are thus more likely to be learned by a downstream model through finetuning. LambdaMark includes a lightweight watermark encoder to inject multi-bit message-dependent perturbations into semantic audio representations and a decoder to detect watermark presence and recover the embedded bit information. Encoder and decoder are trained using a custom multi-component loss that preserves fidelity of the watermarked audio, increases bit-level recovery rate, and improves robustness against common distortions and adversarial removal attempts. Experiments show that LambdaMark achieves near-perfect robustness under common distortions. LambdaMark is also the only watermark that is robust against all evaluated removal attacks. Furthermore, LambdaMark exhibits general and robust radioactivity and remains robust to distortions and adversarial removal attacks even on the generated outputs of those finetuned models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims to introduce LambdaMark, the first generic radioactive audio watermarking scheme, which embeds multi-bit messages into semantic audio latent representations via a lightweight encoder-decoder pair trained with a multi-component loss. It reports near-perfect robustness to common distortions, unique robustness against all evaluated removal attacks (unlike prior low-level embedding methods), and general radioactivity such that the watermark transfers to and remains detectable in outputs of speech generation models finetuned on the watermarked audio.

Significance. If the empirical claims on robustness and especially radioactivity hold after detailed validation, the work would advance defenses against voice cloning and generative audio misuse by demonstrating a watermark that survives finetuning transfer. The semantic-latent placement for improved radioactivity is a novel engineering direction relative to waveform/spectrogram baselines.

major comments (2)
  1. [Abstract] Abstract: The central claim that LambdaMark 'is the only watermark that is robust against all evaluated removal attacks' is load-bearing but rests on unspecified experiments; the abstract supplies no enumeration of baselines, attack set, or metrics, preventing assessment of whether post-hoc attack selection occurred.
  2. [Abstract] Abstract: The radioactivity claim depends on the premise that semantic latents 'have semantic interpretation and are thus more likely to be learned by a downstream model through finetuning.' No ablation against waveform or spectrogram baselines, no mechanistic argument, and no reference to transfer-learning results are supplied; if the finetuning objective does not align with the semantic encoder, transfer can be near zero regardless of original-audio robustness.
minor comments (1)
  1. The abstract refers to a 'custom multi-component loss' without naming the components or their relative weights; these should be stated explicitly with equations in the methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying the manuscript's content and outlining planned revisions to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that LambdaMark 'is the only watermark that is robust against all evaluated removal attacks' is load-bearing but rests on unspecified experiments; the abstract supplies no enumeration of baselines, attack set, or metrics, preventing assessment of whether post-hoc attack selection occurred.

    Authors: We agree that the abstract does not enumerate the experiments. The full details of the baselines, the complete set of removal attacks (covering signal processing distortions and multiple adversarial removal methods), and the evaluation metrics are provided in Sections 4.2 and 4.3. To address concerns about post-hoc selection, we will revise the abstract to briefly indicate the categories of attacks considered. revision: yes

  2. Referee: [Abstract] Abstract: The radioactivity claim depends on the premise that semantic latents 'have semantic interpretation and are thus more likely to be learned by a downstream model through finetuning.' No ablation against waveform or spectrogram baselines, no mechanistic argument, and no reference to transfer-learning results are supplied; if the finetuning objective does not align with the semantic encoder, transfer can be near zero regardless of original-audio robustness.

    Authors: Section 5 reports empirical evidence of radioactivity through finetuning transfer experiments on the watermarked audio. The premise regarding semantic latents is motivated by their design. We acknowledge, however, that the manuscript does not include explicit ablations of radioactivity against waveform or spectrogram baselines, a detailed mechanistic argument, or citations to transfer-learning results. We will add a discussion subsection providing a mechanistic rationale based on the properties of semantic representations and include relevant references from the transfer-learning literature. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical engineering claims rest on experimental validation, not self-referential derivations or fitted inputs

full rationale

The paper introduces LambdaMark as an empirical audio watermarking scheme whose central claims (robustness to attacks and radioactivity under finetuning) are supported by experimental results rather than any derivation chain. The abstract and provided text contain no equations, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems. The statement that semantic latents 'are thus more likely to be learned' is a design rationale, not a mathematical reduction to the authors' prior inputs. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

Abstract-only; ledger populated from stated design choices. The method assumes standard supervised training of encoder-decoder networks and that semantic latents are sufficiently rich to carry persistent message information.

free parameters (1)
  • multi-component loss weights
    Weights balancing fidelity, bit recovery, and robustness are introduced to train the system; their specific values are not reported.
axioms (2)
  • domain assumption Semantic audio latent representations exist and can be extracted by existing audio encoders.
    The scheme relies on the existence and utility of semantic latents; invoked in the description of where the watermark is embedded.
  • standard math Standard neural network optimization will converge to a solution satisfying the multi-component loss.
    Training procedure is described as using a custom loss; no proof or convergence argument is supplied.
invented entities (1)
  • LambdaMark watermark encoder and decoder pair no independent evidence
    purpose: Inject and recover multi-bit messages in semantic space
    New neural modules introduced by the paper; no independent evidence outside the training procedure is provided.

pith-pipeline@v0.9.1-grok · 5814 in / 1477 out tokens · 19293 ms · 2026-06-26T12:56:39.781469+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 1 canonical work pages

  1. [1]

    Qwen2-Audio Technical Report,

    Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin, C. Zhou, and J. Zhou, “Qwen2-Audio Technical Report,”

  2. [2]

    Available: http://arxiv.org/abs/2407.10759

    [Online]. Available: http://arxiv.org/abs/2407.10759

  3. [3]

    Kimi-Audio Technical Report,

    KimiTeam, D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tang, Z. Wang, C. Wei, Y . Xin, X. Xu, J. Yu, Y . Zhang, X. Zhou, Y . Charles, J. Chen, Y . Chen, Y . Du, W. He, Z. Hu, G. Lai, Q. Li, Y . Liu, W. Sun, J. Wang, Y . Wang, Y . Wu, Y . Wu, D. Yang, H. Yang, Y . Yang, Z. Yang, A. Yin, R. Yuan, Y . Zhang, and Z. Zhou, “...

  4. [4]

    Available: http://arxiv.org/abs/2504.18425

    [Online]. Available: http://arxiv.org/abs/2504.18425

  5. [5]

    Qwen3-TTS Technical Report,

    H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guo, X. Zhang, P. Zhang, B. Yang, J. Xu, J. Zhou, and J. Lin, “Qwen3-TTS Technical Report,” 2026. [Online]. Available: https://arxiv.org/abs/2601.15621

  6. [6]

    Fraudsters Cloned Company Director’s V oice In $35 Million Heist,

    Forbes, “Fraudsters Cloned Company Director’s V oice In $35 Million Heist,” 2021. [Online]. Available: https://www.forbes.com/sites/thoma sbrewster/2021/10/14/huge-bank-fraud-uses-deep-fake-voice-tech-t o-steal-millions/

  7. [7]

    CEO of world’s biggest ad firm targeted by deepfake scam,

    “CEO of world’s biggest ad firm targeted by deepfake scam,” 2024. [Online]. Available: https://www.theguardian.com/technology/article/2 024/may/10/ceo-wpp-deepfake-scam

  8. [8]

    Company worker in Hong Kong pays out £20m in deepfake video call scam,

    D. Milmo, “Company worker in Hong Kong pays out £20m in deepfake video call scam,”The Guardian, 2 2024. [Online]. Available: https://www.theguardian.com/world/2024/feb/05/hong-kong-compa ny-deepfake-video-conference-call-scam

  9. [9]

    AI deception: A survey of examples, risks, and potential solutions,

    P. S. Park, S. Goldstein, A. O’Gara, M. Chen, and D. Hendrycks, “AI deception: A survey of examples, risks, and potential solutions,” Patterns (N. Y.), vol. 5, no. 5, p. 100988, May 2024

  10. [10]

    Record Com- panies Bring Landmark Cases for Responsible AI Against Suno and Udio in Boston and New York Federal Courts, Respectively,

    Recording Industry Association of America (RIAA), “Record Com- panies Bring Landmark Cases for Responsible AI Against Suno and Udio in Boston and New York Federal Courts, Respectively,” RIAA Press Release, Jun. 2024

  11. [11]

    V oice actors and generative AI: Legal challenges and emerging protections,

    E. Chuks-Okeke, A. Adetunji, and B. Leong, “V oice actors and generative AI: Legal challenges and emerging protections,”IAPP News, Dec 2023. [Online]. Available: https://iapp.org/news/a/voice-a ctors-and-generative-ai-legal-challenges-and-emerging-protections

  12. [12]

    Unauthorized V oice Cloning: The Legal Response in the Intersection of Performers’ Rights, Sound Recording Protec- tion, and Image Rights in the Age of AI,

    C. Kanellopoulou, “Unauthorized V oice Cloning: The Legal Response in the Intersection of Performers’ Rights, Sound Recording Protec- tion, and Image Rights in the Age of AI,” Dissertation, Stockholm University, 2025

  13. [13]

    An AI-generated band got 1m plays on Spotify. Now music insiders say listeners should be warned,

    “An AI-generated band got 1m plays on Spotify. Now music insiders say listeners should be warned,” The Guardian (Tech), 2025

  14. [14]

    The League of Women V oters is suing those involved in robocalls sent to New Hampshire voters,

    Holly Ramer, “The League of Women V oters is suing those involved in robocalls sent to New Hampshire voters,” https://apnews.com/artic le/new-hampshire-primary-biden-robocalls-f7cc3c2610ab6ccdfaa26 d2a8ea1cbdb, Mar. 2024

  15. [15]

    League of Women V oters and Free Speech For People Applaud Judgment for Plaintiffs in New Hampshire V oter Intimidation Lawsuit,

    League of Women V oters, “League of Women V oters and Free Speech For People Applaud Judgment for Plaintiffs in New Hampshire V oter Intimidation Lawsuit,” https://www.lwv.org/newsroom/press-releases/ league-women-voters-and-free-speech-people-applaud-judgment-p laintiffs-new, Nov. 2025

  16. [16]

    V oice actors can pursue some claims over AI voiceovers, US court says,

    B. Brittain, “V oice actors can pursue some claims over AI voiceovers, US court says,” https://www.reuters.com/legal/litigation/voice-actor s-can-pursue-some-claims-over-ai-voiceovers-us-court-says-2025-0 7-10/, Jul. 2025

  17. [17]

    Music AI startups Suno and Udio slam record label lawsuits in court filings,

    ——, “Music AI startups Suno and Udio slam record label lawsuits in court filings,” https://www.reuters.com/legal/litigation/music-ai-start ups-suno-udio-slam-record-label-lawsuits-court-filings-2024-08-01/, Aug. 2024

  18. [18]

    Universal Music settles copyright dispute with AI firm Udio,

    Rishabh Jaiswal, “Universal Music settles copyright dispute with AI firm Udio,” https://www.reuters.com/business/media-telecom/universal -music-settles-copyright-dispute-with-ai-firm-udio-2025-10-30/, Oct. 2025

  19. [19]

    Warner Music Group settles copyright case with Suno for licensed AI music,

    Reuters, “Warner Music Group settles copyright case with Suno for licensed AI music,” https://www.reuters.com/legal/litigation/warner-m usic-group-settles-copyright-case-with-suno-licensed-ai-music-202 5-11-25/, Nov. 2025

  20. [20]

    UMG Recordings, Inc. et al. v. Uncharted Labs, Inc. et al., No. 1:2024cv04777, Document 156,

    United States District Court for the Southern District of New York, “UMG Recordings, Inc. et al. v. Uncharted Labs, Inc. et al., No. 1:2024cv04777, Document 156,” https://law.justia.com/cases/federal/d istrict-courts/new-york/nysdce/1%3A2024cv04777/623701/156/, Apr. 2026, order and Opinion Denying Defendants’ Motion to Dismiss

  21. [21]

    Proactive Detection of V oice Cloning with Localized Watermarking,

    R. S. Roman, P. Fernandez, A. Défossez, T. Furon, T. Tran, and H. Elsahar, “Proactive Detection of V oice Cloning with Localized Watermarking,” 2024. [Online]. Available: https://arxiv.org/abs/2401.1 7264

  22. [22]

    WavMark: Watermarking for audio generation,

    G. Chen, Y . Wu, S. Liu, T. Liu, X. Du, and F. Wei, “WavMark: Watermarking for audio generation,” 2024. [Online]. Available: http://arxiv.org/abs/2308.12770

  23. [23]

    SilentCipher: Deep audio watermarking,

    M. K. Singh, N. Takahashi, W. Liao, and Y . Mitsufuji, “SilentCipher: Deep audio watermarking,” inInterspeech 2024. ISCA: ISCA, Sep. 2024, pp. 2235–2239

  24. [24]

    Detecting V oice Cloning Attacks via Timbre Watermarking,

    C. Liu, J. Zhang, T. Zhang, X. Yang, W. Zhang, and N. Yu, “Detecting V oice Cloning Attacks via Timbre Watermarking,” 2023. [Online]. Available: http://arxiv.org/abs/2312.03410

  25. [25]

    Latent Watermarking of Audio Generative Models,

    R. S. Roman, P. Fernandez, A. Deleforge, Y . Adi, and R. Serizel, “Latent Watermarking of Audio Generative Models,” 2024. [Online]. Available: https://arxiv.org/abs/2409.02915

  26. [26]

    GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis,

    W. Liu, Y . Li, D. Lin, H. Tian, and H. Li, “GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis,”

  27. [27]

    Available: https://arxiv.org/abs/2407.10471

    [Online]. Available: https://arxiv.org/abs/2407.10471

  28. [28]

    Au- dioMarkNet: audio watermarking for deepfake speech detection,

    W. Zong, Y .-W. Chow, W. Susilo, J. Baek, and S. Camtepe, “Au- dioMarkNet: audio watermarking for deepfake speech detection,” in Proceedings of the 34th USENIX Conference on Security Symposium, ser. SEC ’25. USA: USENIX Association, 2025

  29. [29]

    Robust audio watermarking using perceptual masking,

    M. D. Swanson, B. Zhu, A. H. Tewfik, and L. Boney, “Robust audio watermarking using perceptual masking,”Signal Processing, vol. 66, no. 3, pp. 337–355, May 1998

  30. [30]

    Spread-spectrum watermarking of audio signals,

    D. Kirovski and H. Malvar, “Spread-spectrum watermarking of audio signals,”IEEE Transactions on Signal Processing, vol. 51, no. 4, pp. 1020–1033, 2003

  31. [31]

    HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal,

    K. Li, X. Hu, I. Grishchenko, and D. Lie, “HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal,” 2025. [Online]. Available: https://arxiv.org/abs/2511.21577

  32. [32]

    AudioMarkBench: Benchmarking Robustness of Audio Watermarking,

    H. Liu, M. Guo, Z. Jiang, L. Wang, and N. Z. Gong, “AudioMarkBench: Benchmarking Robustness of Audio Watermarking,”

  33. [33]

    Available: https://arxiv.org/abs/2406.06979

    [Online]. Available: https://arxiv.org/abs/2406.06979

  34. [34]

    Deep Audio Watermarks are Shallow: Limitations of Post-Hoc Watermarking Techniques for Speech,

    P. O’Reilly, Z. Jin, J. Su, and B. Pardo, “Deep Audio Watermarks are Shallow: Limitations of Post-Hoc Watermarking Techniques for Speech,” 2025. [Online]. Available: http://arxiv.org/abs/2504.10782

  35. [35]

    A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?

    Y . Özer, W. Choi, J. Serrà, M. K. Singh, W.-H. Liao, and Y . Mitsufuji, “A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?” 2025. [Online]. Available: http://arxiv.org/abs/2505.19663

  36. [36]

    Tree-Ring Watermarks: Fingerprints for Diffusion Images that are Invisible and Robust,

    Y . Wen, J. Kirchenbauer, J. Geiping, and T. Goldstein, “Tree-Ring Watermarks: Fingerprints for Diffusion Images that are Invisible and Robust,” 2023. [Online]. Available: https://arxiv.org/abs/2305.20030

  37. [37]

    RingID: Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification,

    H. Ci, P. Yang, Y . Song, and M. Z. Shou, “RingID: Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification,”

  38. [38]

    Available: https://arxiv.org/abs/2404.14055

    [Online]. Available: https://arxiv.org/abs/2404.14055

  39. [39]

    HMARK: Radioactive Multi-Bit Semantic-Latent Watermarking for Diffusion Models,

    K. Li, G. Ding, I. Grishchenko, and D. Lie, “HMARK: Radioactive Multi-Bit Semantic-Latent Watermarking for Diffusion Models,” 2025. [Online]. Available: https://arxiv.org/abs/2512.00094

  40. [40]

    Discovering Interpretable Directions in the Semantic Latent Space of Diffusion Models,

    R. Haas, I. Huberman-Spiegelglas, R. Mulayoff, S. Graßhof, S. S. Brandt, and T. Michaeli, “Discovering Interpretable Directions in the Semantic Latent Space of Diffusion Models,” 2024. [Online]. Available: https://arxiv.org/abs/2303.11073 15

  41. [41]

    Scaling up masked audio encoder learning for general audio classification,

    H. Dinkel, Z. Yan, Y . Wang, J. Zhang, Y . Wang, and B. Wang, “Scaling up masked audio encoder learning for general audio classification,”

  42. [42]

    Available: https://arxiv.org/abs/2406.06992

    [Online]. Available: https://arxiv.org/abs/2406.06992

  43. [43]

    SemanticV ocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents,

    Z. Xie, C. Li, Q. Jin, X. Xu, G. Yang, W. Wang, M. Wu, D. Yu, and Y . Zou, “SemanticV ocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents,” 2026. [Online]. Available: https://arxiv.org/abs/2602.23333

  44. [44]

    CLAP: Learning audio concepts from natural language supervision,

    B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “CLAP: Learning audio concepts from natural language supervision,” Jun. 2022

  45. [45]

    AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining,

    H. Liu, Y . Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian, Y . Wang, W. Wang, Y . Wang, and M. D. Plumbley, “AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining,” Aug. 2023

  46. [46]

    Simple and Controllable Music Generation,

    J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y . Adi, and A. Défossez, “Simple and Controllable Music Generation,” Jun. 2023

  47. [47]

    MiDashengLM: Efficient Audio Understanding with General Audio Captions,

    H. Dinkel, G. Li, J. Liu, J. Luan, Y . Niu, X. Sun, T. Wang, Q. Xiao, J. Zhang, and J. Zhou, “MiDashengLM: Efficient Audio Understanding with General Audio Captions,” 2026. [Online]. Available: https://arxiv.org/abs/2508.03983

  48. [48]

    Librispeech: An ASR corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” Apr. 2015

  49. [49]

    CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),

    J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” 2019

  50. [50]

    YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,

    E. Casanova, J. Weber, C. Shulby, A. C. Junior, E. Gölge, and M. A. Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot V oice Conversion for everyone,” 2023. [Online]. Available: https://arxiv.org/abs/2112.02418

  51. [51]

    Robust Speech Recognition via Large-Scale Weak Supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” 2022. [Online]. Available: https://arxiv.org/abs/2212.0 4356

  52. [52]

    NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,

    G. Mittag, B. Naderi, A. Chehadi, and S. Möller, “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” inInterspeech

  53. [53]

    NISQA: A deep CNN-Self-Attention model for multidimensional speech quality prediction with crowdsourced datasets,

    ISCA, Aug. 2021, pp. 2127–2131. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2021-299

  54. [54]

    ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric,

    M. Chinen, F. S. C. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines, “ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric,” 2020. [Online]. Available: https://arxiv.org/abs/2004.09584

  55. [55]

    Square Attack: a query-efficient black-box adversarial attack via random search,

    M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein, “Square Attack: a query-efficient black-box adversarial attack via random search,” 2020. [Online]. Available: https://arxiv.org/abs/1912.00049

  56. [56]

    Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens,

    X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Feng, W. Bian, Z. Ye, S. Cheng, R. Yuan, Z. Zhao, X. Zhu, J. Pan, L. Xue, P. Zhu, Y . Chen, Z. Li, X. Chen, L. Xie, Y . Guo, and W. Xue, “Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens,”

  57. [57]

    Available: https://arxiv.org/abs/2503.01710 Appendix A

    [Online]. Available: https://arxiv.org/abs/2503.01710 Appendix A. Hyperparameter Choices for the Distortions We evaluate robustness under a comprehensive suite of audio distortions. For Gaussian noise, we use additive white Gaussian noise at SNRs of 20 dB and 30 dB. For resampling, we downsample the audio to 50% of the original sampling rate and then upsa...