arxiv: 2605.07481 · v1 · submitted 2026-05-08 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Vaporizer: Breaking Watermarking Schemes for Large Language Model Outputs

Jonathan Hong Jin Ng , Anh Tu Ngo , Anupam Chattopadhyay

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:47 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM watermarkingwatermark attacksparaphrasingmachine translationsemantic preservationAI content detectionrobustnesstext modification

0 comments

The pith

Watermarking schemes for large language model outputs can be broken by semantic-preserving text modifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how well current watermarking methods protect LLM-generated text from being altered to remove the watermark. It applies attacks such as changing individual words, translating the text to another language and back, and using AI to paraphrase it. These changes are designed to keep the overall meaning the same. Evaluation uses scores that measure how similar the new text is to the original in meaning and how readable it remains. The findings show that watermarks can be removed in most cases with these reasonable efforts, indicating that current schemes are not as secure as claimed.

Core claim

Through an extensive set of modified text attacks involving lexical alterations, machine translation, and neural paraphrasing, the watermark can be removed from LLM outputs while preserving semantic content, as confirmed by BERT scores, Flesch indices, and other measures. This holds across different watermarking models, showing a common vulnerability.

What carries the argument

The Vaporizer attack framework, which applies multiple text modification strategies to evade watermark detection while preserving content.

Load-bearing premise

The semantic preservation metrics employed, including BERT scores and readability measures, accurately reflect that the modified text maintains its original meaning and practical utility.

What would settle it

Observing a watermarking scheme where none of the attacks succeed in removing the watermark while all modified texts score highly on semantic similarity metrics.

Figures

Figures reproduced from arXiv: 2605.07481 by Anh Tu Ngo, Anupam Chattopadhyay, Jonathan Hong Jin Ng.

**Figure 2.** Figure 2: Comprehensive results of Provable Robust [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Incremental effects of text modifications on [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 10.** Figure 10: Incremental effects of text modifications on [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 5.** Figure 5: Incremental effects of text modifications on [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Incremental effects of text modifications on [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Incremental effects of text modifications on [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Incremental effects of text modifications on [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 15.** Figure 15: Incremental effects of text modifications on [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗

**Figure 22.** Figure 22: Incremental effects of text modifications [PITH_FULL_IMAGE:figures/full_fig_p012_22.png] view at source ↗

**Figure 17.** Figure 17: Incremental effects of text modifications on [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗

**Figure 18.** Figure 18: Comprehensive results of Publicly Detectable Watermarking vulnerability across attack methods. 0 1 2 3 4 5 6 7 8 9 1011121314151617181920 Number of Changes 8 9 10 11 12 13 Complexity Complexity vs Changes 0 1 2 3 4 5 6 7 8 9 1011121314151617181920 Number of Changes 10 15 20 25 30 35 Grammar Errors Grammar Errors vs Changes 0 1 2 3 4 5 6 7 8 9 1011121314151617181920 Number of Changes 0.945 0.950 0.955 0.… view at source ↗

**Figure 19.** Figure 19: Incremental effects of text modifications [PITH_FULL_IMAGE:figures/full_fig_p012_19.png] view at source ↗

**Figure 20.** Figure 20: Incremental effects of text modifications [PITH_FULL_IMAGE:figures/full_fig_p012_20.png] view at source ↗

read the original abstract

In this paper, we investigate the recent state-of-the-art schemes for watermarking large language models (LLMs) outputs. These techniques are claimed to be robust, scalable and production-grade, aimed at promoting responsible usage of LLMs. We analyse the effectiveness of these watermarking techniques against an extensive collection of modified text attacks, which perform targeted semantic changes without altering the general meaning of the text content. Our approach encompasses multiple attack strategies, which include lexical alterations, machine translation, and even neural paraphrasing. The attack efficacy is measured with two target criteria - successful removal of the watermark and preservation of semantic content. We evaluate semantic preservation through BERT scores, text complexity measures, grammatical errors, and Flesch Reading Ease indices. The experimental results reveal varying levels of effectiveness among different watermarking models, with the same underlying result that it is possible to remove the watermark with reasonable effort. This study sheds light on the strengths and weaknesses of existing LLM watermarking systems, suggesting how they should be constructed to improve security of available schemes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Vaporizer gives useful empirical data on bypassing recent LLM watermarks but its semantic preservation claims rest on automated metrics that often miss real meaning shifts.

read the letter

The main thing to know is that this paper runs a set of straightforward attacks—lexical edits, machine translation, and neural paraphrasing—against several recent watermarking schemes and finds that the watermarks can usually be stripped while the text scores well on standard similarity checks. That supplies concrete numbers on practical weaknesses where most prior work stayed at the level of claims or limited tests. It is new in the targeted combination and the focus on schemes positioned as production-ready. The attack descriptions and the breakdown by model are clear and directly useful for anyone thinking about detection methods. The results show varying success rates across schemes, which is the kind of comparative data the field needs. The soft spot is exactly the one flagged in the stress-test note. The authors judge semantic preservation with BERT scores, Flesch indices, grammar counts, and complexity measures, but these proxies are known to rate texts as similar even when nuance, emphasis, or factual implications have drifted. No human equivalence ratings or downstream utility checks appear, so the claim that the modified outputs retain original meaning and usefulness is not strongly supported. The abstract also omits the actual removal rates, sample sizes, and statistical details, which makes it harder to gauge how decisive the evidence is. This work is aimed at people building or evaluating provenance tools for generated text. It deserves peer review because the question is timely and the experiments, despite the metric limitations, point to concrete gaps that need addressing. I would send it out with a request to add human validation on the preserved texts.

Referee Report

1 major / 1 minor

Summary. The paper claims to break state-of-the-art LLM watermarking schemes by applying attacks such as lexical alterations, machine translation, and neural paraphrasing. These attacks are said to remove the watermark successfully while preserving the semantic content of the original text, as verified through metrics including BERT scores, text complexity measures, grammatical error counts, and Flesch Reading Ease indices. The experimental results show varying robustness among the tested watermarking models, leading to the conclusion that current schemes can be defeated with reasonable effort and offering suggestions for more secure constructions.

Significance. If the results hold under stronger validation, this work is significant because it supplies concrete empirical evidence of vulnerabilities in LLM watermarking techniques proposed for responsible AI deployment and content provenance. The multi-attack evaluation across lexical, translation, and paraphrase vectors usefully exposes relative weaknesses among schemes. The purely empirical character with direct measurements (no circular derivations) is a strength, but the preservation side of the two-criterion success definition rests on automated proxies whose limitations are well-documented in the NLP literature.

major comments (1)

[Experimental Evaluation] The central claim requires both watermark removal and semantic/utility preservation. The evaluation (described in the abstract and experimental sections) relies on BERTScore, Flesch indices, complexity measures, and grammar counts to assert preservation. These proxies are known to remain high even when paraphrasing or translation alters implications, factual emphasis, or downstream utility; no human equivalence ratings, adversarial-meaning examples, or task-specific utility checks (e.g., QA accuracy on the modified text) are reported. This directly undermines the two-criterion success definition and is load-bearing for the paper's conclusions.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least summary quantitative results (e.g., average watermark detection rates or metric deltas per scheme and attack) rather than qualitative statements alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the experimental validation of semantic preservation merits strengthening and will revise the manuscript to address this directly.

read point-by-point responses

Referee: The central claim requires both watermark removal and semantic/utility preservation. The evaluation (described in the abstract and experimental sections) relies on BERTScore, Flesch indices, complexity measures, and grammar counts to assert preservation. These proxies are known to remain high even when paraphrasing or translation alters implications, factual emphasis, or downstream utility; no human equivalence ratings, adversarial-meaning examples, or task-specific utility checks (e.g., QA accuracy on the modified text) are reported. This directly undermines the two-criterion success definition and is load-bearing for the paper's conclusions.

Authors: We acknowledge that automated metrics such as BERTScore, while standard in the NLP literature for assessing semantic similarity, have documented limitations and may not fully capture changes in factual emphasis, implications, or downstream task utility. Our original evaluation combined multiple complementary proxies (BERTScore for semantic overlap, Flesch and complexity measures for readability, and grammar counts for surface quality) to provide a multi-faceted view, but we agree this falls short of definitive proof for the preservation criterion. In the revised version we will add a human evaluation component: a small-scale study with annotators rating semantic equivalence and meaning preservation on a subset of attacked outputs, along with qualitative examples illustrating preserved versus altered implications. We will also note the absence of task-specific utility tests as a limitation and suggest it as future work. These additions will directly bolster the two-criterion success definition without altering the core empirical findings on watermark removal. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical attack evaluation with direct measurements

full rationale

The paper is an empirical study that applies attacks (lexical change, MT, neural paraphrase) to existing watermarking schemes and reports success rates plus semantic-preservation scores from BERT, Flesch, grammar counts, etc. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described methodology. All claims rest on observable experimental outcomes rather than any reduction to prior inputs by construction, so none of the enumerated circularity patterns apply.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical security evaluation that relies on standard experimental methods and existing semantic similarity tools rather than new theoretical postulates or fitted parameters.

pith-pipeline@v0.9.0 · 5479 in / 982 out tokens · 39529 ms · 2026-05-11T01:47:34.380704+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our approach encompasses multiple attack strategies, which include lexical alterations, machine translation, and even neural paraphrasing... evaluated through BERT scores, text complexity measures, grammatical errors, and Flesch Reading Ease indices.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pegasus Paraphrase achieved the highest success rate (23%) with the lowest mean Z-score (0.93)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Figueredo, A. J. and Wolf, P. S. A. , title =. Human Nature , volume =. 2009 , doi=

work page 2009
[2]

and AghaKouchak, A

Hao, Z. and AghaKouchak, A. and Nakhjiri, N. and Farahmand, A , year =. Global integrated drought monitoring and prediction system (

work page
[3]

Provable Robust Watermarking for

Xuandong Zhao and Prabhanjan Vijendra Ananth and Lei Li and Yu-Xiang Wang , booktitle=. Provable Robust Watermarking for. 2024 , url=

work page 2024
[4]

doi:10.62056/ahmpdkp10 , author=

Publicly-Detectable Watermarking for Language Models , journal=. doi:10.62056/ahmpdkp10 , author=

work page doi:10.62056/ahmpdkp10
[5]

2024 , month = oct, journal =

Scalable Watermarking for Identifying Large Language Model Outputs , author =. 2024 , month = oct, journal =

work page 2024
[6]

What Are Large Language Models (LLMs)? , year = 2023, url =

work page 2023
[7]

The Secret Language of AI Watermarks for GenAI Text , year = 2023, url =

work page 2023
[8]

Yu , title =

Aiwei Liu and Leyi Pan and Yijian Lu and Jingjing Li and Xuming Hu and Xi Zhang and Lijie Wen and Irwin King and Hui Xiong and Philip S. Yu , title =. ACM Computing Surveys , volume =. 2024 , doi =

work page 2024
[9]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =

Abe Bohan Hou and Jingyu Zhang and Tianxing He and Yichen Wang and Yung-Sung Chuang and Hongwei Wang and Lingfeng Shen and Benjamin Van Durme and Daniel Khashabi and Yulia Tsvetkov , title =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers...

work page 2024
[10]

A watermark for large language models.arXiv preprint arXiv:2301.10226, 2023a

A Watermark for Large Language Models , author=. arXiv preprint arXiv:2301.10226 , year=

work page arXiv
[11]

2023 , url =

Miranda Christ and Sam Gunn and Or Zamir , title =. 2023 , url =

work page 2023
[12]

Attention Is

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and ukasz Kaiser,. Attention Is. Advances in. 2017 , volume =

work page 2017