pith. machine review for the scientific record. sign in

arxiv: 2604.11546 · v1 · submitted 2026-04-13 · 💻 cs.CR

Recognition: unknown

RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

Hanbo Huang, Hao Zheng, Shiyu Liang, Xuan Gong, Yiran Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3

classification 💻 cs.CR
keywords LLM watermarkingspoofing attacksreinforcement learningblack-box evaluationtext attributionparaphrase generationdistributional robustness
0
0 comments X

The pith

A lightweight RL attack spoofs LLM watermarks at 62 percent success using only 100 examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to test whether current LLM watermarking methods can resist black-box attempts to remove their detectable signals through paraphrasing. It first defines a local capacity bottleneck that sets a theoretical limit on how much probability mass can shift under small distribution changes without altering meaning. From this foundation the authors create RLSpoofer, a reinforcement learning procedure that trains on only 100 human-watermarked paraphrase pairs and needs no access to watermark internals or detectors. When applied to a 4 billion parameter model the method reaches a 62 percent spoof success rate on PF-marked text while keeping semantic shifts small, far above the 6 percent achieved by models trained on up to 10,000 samples. The result indicates that many existing watermark designs can be bypassed with modest resources, which would require designers to adopt stronger constraints or hybrid detection strategies.

Core claim

RLSpoofer trains a reinforcement learning policy on 100 human-watermarked paraphrase pairs in a fully black-box setting and enables a 4B model to achieve a 62.0 percent spoof success rate on PF-marked texts with minimal semantic shift, greatly exceeding the 6 percent rate of baseline models trained on up to 10,000 samples.

What carries the argument

The local capacity bottleneck, which bounds the probability mass reallocatable under KL-constrained local updates while preserving semantic fidelity and thereby guides the RL policy toward effective spoofing paraphrases.

If this is right

  • Watermark schemes must be evaluated against lightweight black-box RL attacks rather than only data-intensive or white-box baselines.
  • Successful spoofing with 100 examples implies that robustness claims based on larger training sets alone are insufficient.
  • Detectors will need to account for paraphrased versions of watermarked text to maintain reliability.
  • New watermark designs should increase the local capacity required for any effective semantic-preserving rewrite.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Reliable attribution may require combining watermarks with non-distributional signals such as stylistic or metadata checks.
  • The same RL approach could serve as a general test for other text authenticity methods beyond watermarking.
  • If the bottleneck bound holds across different watermark families, the 62 percent success rate may generalize to additional schemes.
  • Scaling the training data modestly while keeping the black-box constraint might push success rates even higher.

Load-bearing premise

The local capacity bottleneck limits how much the output distribution can change under small KL updates without destroying semantic meaning, so the RL agent can discover watermark-removing paraphrases from limited data.

What would settle it

A watermark scheme that maintains near-zero spoof success rates when subjected to the same RLSpoofer training regime of 100 pairs and identical KL bounds would falsify the fragility claim.

Figures

Figures reproduced from arXiv: 2604.11546 by Hanbo Huang, Hao Zheng, Shiyu Liang, Xuan Gong, Yiran Zhang.

Figure 1
Figure 1. Figure 1: Overview of RLSpoofer. Given human–watermarked rewrite pairs, RLSpoofer jointly [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) illustrates the detection score distributions for EWD and SWEET watermarks across [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) and (b) compare SSR of RLSpoofer under two [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross watermark transferability on Qwen3-4B. RLSpoofer demonstrates strongly asymmetric and directional cross-watermark transfer. We evaluate zero-shot transferability with Qwen3-4B by training RLSpoofer on one watermarking scheme and testing it on the others. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity of RLSpoofer across component weights on validation set. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Watermarked EWD example [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Unwatermarked EWD example [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Watermarked EWD example [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Unwatermarked EWD example 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Watermarked EWD example [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Unwatermarked EWD example [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Watermarked EWD example [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Unwatermarked EWD example 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
read the original abstract

Large language model (LLM) watermarking has emerged as a promising approach for detecting and attributing AI-generated text, yet its robustness to black-box spoofing remains insufficiently evaluated. Existing evaluation methods often demand extensive datasets and white-box access to algorithmic internals, limiting their practical applicability. In this paper, we study watermark resilience against spoofing fundamentally from a distributional perspective. We first establish a \textit{local capacity bottleneck}, which theoretically characterizes the probability mass that can be reallocated under KL-bounded local updates while preserving semantic fidelity. Building on this, we propose RLSpoofer, a reinforcement learning-based black-box spoofing attack that requires only 100 human-watermarked paraphrase training pairs and zero access to the watermarking internals or detectors. Despite weak supervision, it empowers a 4B model to achieve a 62.0\% spoof success rate with minimal semantic shift on PF-marked texts, dwarfing the 6\% of baseline models trained on up to 10,000 samples. Our findings expose the fragile spoofing resistance of current LLM watermarking paradigms, providing a lightweight evaluation framework and stressing the urgent need for more robust schemes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that current LLM watermarking schemes are vulnerable to black-box spoofing. It introduces a theoretical 'local capacity bottleneck' characterizing re-allocatable probability mass under KL-bounded local updates that preserve semantics, then proposes RLSpoofer: an RL-based attack using only 100 human-watermarked paraphrase pairs (no watermark internals or detector access) that trains a 4B model to 62% spoof success rate with minimal semantic shift on PF-marked texts, far above the 6% achieved by baselines trained on up to 10k samples.

Significance. If the central empirical result and its theoretical grounding hold, the work would be significant for the field: it supplies a lightweight, practical black-box evaluator that dramatically lowers the cost of assessing watermark spoofing resilience and exposes a concrete weakness in existing paradigms. The minimal-data RL approach and the distributional framing are strengths that could influence future watermark design and evaluation standards.

major comments (2)
  1. [Theory and RL training sections (abstract + §3)] The local capacity bottleneck is asserted as the theoretical foundation explaining why 100 pairs suffice for 62% success (abstract and likely §2/§3). However, the RL training procedure (policy optimization section) gives no indication that KL divergence, trust-region, or other local-update constraints are enforced during gradient updates. Without such enforcement the learned policy can make non-local changes, rendering the 62% result compatible with ordinary RL overfitting rather than the claimed distributional property.
  2. [Experimental evaluation (§4)] Empirical results report a 62.0% spoof success rate and 'minimal semantic shift' but supply no statistical significance tests, run-to-run variance, test-set size, or concrete metric for semantic fidelity (e.g., no mention of BLEU, embedding cosine, or human evaluation protocol). This makes the comparison to the 6% baseline difficult to interpret and weakens the central performance claim.
minor comments (2)
  1. [Abstract and introduction] Expand 'PF-marked' on first use and clarify whether PF refers to a specific watermarking scheme (e.g., in the abstract and §1).
  2. [Theoretical section] Provide the explicit mathematical statement or derivation of the local capacity bottleneck (including any equation defining the KL-bounded mass) rather than leaving it as a high-level assertion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important areas for clarification and strengthening. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Theory and RL training sections (abstract + §3)] The local capacity bottleneck is asserted as the theoretical foundation explaining why 100 pairs suffice for 62% success (abstract and likely §2/§3). However, the RL training procedure (policy optimization section) gives no indication that KL divergence, trust-region, or other local-update constraints are enforced during gradient updates. Without such enforcement the learned policy can make non-local changes, rendering the 62% result compatible with ordinary RL overfitting rather than the claimed distributional property.

    Authors: We acknowledge that the policy optimization procedure does not explicitly enforce KL divergence bounds or trust-region constraints on the gradient updates. The local capacity bottleneck is presented as a theoretical characterization of the limited probability mass that can be reallocated while preserving semantics under KL-bounded local changes; it motivates why a small number of paraphrase pairs can be sufficient. Empirically, the 62% spoof success rate achieved with only 100 pairs substantially exceeds the 6% obtained by baselines trained on up to 10,000 samples, which is difficult to reconcile with generic overfitting. The reward function combines spoofing success with a semantic similarity term that implicitly penalizes large distributional shifts. In the revision we will add post-hoc analysis of the KL divergence between the learned policy and the base model on held-out text, together with a clearer discussion of how the semantic reward approximates the local-update regime, to make the link between theory and practice explicit. revision: partial

  2. Referee: [Experimental evaluation (§4)] Empirical results report a 62.0% spoof success rate and 'minimal semantic shift' but supply no statistical significance tests, run-to-run variance, test-set size, or concrete metric for semantic fidelity (e.g., no mention of BLEU, embedding cosine, or human evaluation protocol). This makes the comparison to the 6% baseline difficult to interpret and weakens the central performance claim.

    Authors: We agree that the experimental reporting would be strengthened by additional statistical detail and quantitative semantic metrics. In the revised manuscript we will report statistical significance tests for the performance gap versus baselines, include standard deviations and variance across multiple independent training runs, state the exact test-set size, and supply concrete semantic-fidelity numbers (BLEU, embedding cosine similarity) along with a description of the human evaluation protocol used to verify minimal semantic shift. These additions will make the 62% result and its comparison to the 6% baseline more robust and interpretable. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical attack performance independent of theoretical characterization

full rationale

The paper's central result is the measured 62% spoof success rate on external PF-marked test texts using a 4B model trained on 100 paraphrase pairs. The local capacity bottleneck is introduced as a theoretical characterization of re-allocatable probability mass under KL-bounded updates, but the reported success rate is obtained via direct evaluation on held-out data rather than derived from or reduced to that characterization by construction. No equations, fitted parameters renamed as predictions, or self-citation chains collapse the empirical outcome to the inputs. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the newly introduced local capacity bottleneck concept and standard RL assumptions about policy optimization under semantic constraints; no free parameters are explicitly fitted in the abstract description.

axioms (1)
  • domain assumption KL-bounded local updates preserve semantic fidelity when probability mass is reallocated within the local capacity bottleneck
    Invoked to justify that the RL policy can generate effective paraphrases without large semantic drift
invented entities (1)
  • local capacity bottleneck no independent evidence
    purpose: Theoretical bound on reallocatable probability mass under KL constraints while keeping semantics intact
    New concept introduced to characterize spoofing limits; no independent falsifiable prediction supplied in abstract

pith-pipeline@v0.9.0 · 5511 in / 1223 out tokens · 72096 ms · 2026-05-10T15:17:36.122098+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 23 canonical work pages · 6 internal anchors

  1. [1]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  2. [2]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  3. [3]

    Can llm-generated misinformation be detected?arXiv preprint arXiv:2309.13788, 2023

    Canyu Chen and Kai Shu. Can llm-generated misinformation be detected?arXiv preprint arXiv:2309.13788, 2023

  4. [4]

    Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends

    Zhenhua Xu, Xubin Yue, Zhebo Wang, Qichen Liu, Xixiang Zhao, Jingxuan Zhang, Wenjun Zeng, Wengpeng Xing, Dezhang Kong, Changting Lin, et al. Copyright protection for large lan- guage models: A survey of methods, challenges, and trends.arXiv preprint arXiv:2508.11548, 2025

  5. [5]

    arXiv preprint arXiv:2305.17493

    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Ander- son. The curse of recursion: Training on generated data makes models forget.arXiv preprint arXiv:2305.17493, 2023

  6. [6]

    Chatting and cheating: Ensuring academic integrity in the era of chatgpt.Innovations in education and teaching international, 61(2):228–239, 2024

    Debby RE Cotton, Peter A Cotton, and J Reuben Shipway. Chatting and cheating: Ensuring academic integrity in the era of chatgpt.Innovations in education and teaching international, 61(2):228–239, 2024

  7. [7]

    A watermark for large language models

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InInternational Conference on Machine Learning, pages 17061–17084. PMLR, 2023

  8. [8]

    Permute-and-flip: An optimally robust and watermarkable decoder for llms.arXiv preprint arXiv:2402.05864,

    Xuandong Zhao, Lei Li, and Yu-Xiang Wang. Permute-and-flip: An optimally stable and watermarkable decoder for llms.arXiv preprint arXiv:2402.05864, 2024

  9. [9]

    An entropy-based text water- marking detection method.arXiv preprint arXiv:2403.13485, 2024

    Yijian Lu, Aiwei Liu, Dianzhi Yu, Jingjing Li, and Irwin King. An entropy-based text water- marking detection method.arXiv preprint arXiv:2403.13485, 2024

  10. [10]

    Who wrote this code? watermarking for code generation

    Taehyun Lee, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, and Gunhee Kim. Who wrote this code? watermarking for code generation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4890–4911, 2024

  11. [11]

    Watermark stealing in large language models.arXiv preprint arXiv:2402.19361, 2024

    Nikola Jovanovi´c, Robin Staab, and Martin Vechev. Watermark stealing in large language models.arXiv preprint arXiv:2402.19361, 2024

  12. [12]

    De-mark: Watermark removal in large language models.arXiv preprint arXiv:2410.13808, 2024

    Ruibo Chen, Yihan Wu, Junfeng Guo, and Heng Huang. De-mark: Watermark removal in large language models.arXiv preprint arXiv:2410.13808, 2024

  13. [13]

    Ditto: A spoofing attack framework on watermarked llms via knowledge distillation.arXiv preprint arXiv:2510.10987, 2025

    Hyeseon An, Shinwoo Park, Suyeon Woo, and Yo-Sub Han. Ditto: A spoofing attack framework on watermarked llms via knowledge distillation.arXiv preprint arXiv:2510.10987, 2025

  14. [14]

    On the learnability of watermarks for language models.arXiv preprint arXiv:2312.04469,

    Chenchen Gu, Xiang Lisa Li, Percy Liang, and Tatsunori Hashimoto. On the learnability of watermarks for language models.arXiv preprint arXiv:2312.04469, 2023

  15. [15]

    Attacking llm watermarks by exploiting their strengths

    Qi Pang, Shengyuan Hu, Wenting Zheng, and Virginia Smith. Attacking llm watermarks by exploiting their strengths. InICLR 2024 Workshop on Secure and Trustworthy Large Language Models, 2024

  16. [16]

    Gloaguen, N

    Thibaud Gloaguen, Nikola Jovanovi´c, Robin Staab, and Martin Vechev. Discovering spoofing attempts on language model watermarks.arXiv preprint arXiv:2410.02693, 2024

  17. [17]

    Rlcracker: Exposing the vulnerability of llm watermarks with adaptive rl attacks.arXiv preprint arXiv:2509.20924, 2025

    Hanbo Huang, Yiran Zhang, Hao Zheng, Xuan Gong, Yihan Li, Lin Liu, and Shiyu Liang. Rlcracker: Exposing the vulnerability of llm watermarks with adaptive rl attacks.arXiv preprint arXiv:2509.20924, 2025. 10

  18. [18]

    Can watermarked llms be identified by users via crafted prompts? arXiv preprint arXiv:2410.03168, 2024

    Aiwei Liu, Sheng Guan, Yiming Liu, Leyi Pan, Yifei Zhang, Liancheng Fang, Lijie Wen, Philip S Yu, and Xuming Hu. Can watermarked llms be identified by users via crafted prompts? arXiv preprint arXiv:2410.03168, 2024

  19. [19]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  20. [20]

    Paraphrastic representations at scale

    John Wieting, Kevin Gimpel, Graham Neubig, and Taylor Berg-Kirkpatrick. Paraphrastic representations at scale. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 379–388, 2022

  21. [21]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  22. [22]

    Provable robust watermarking for AI-generated text.arXiv preprint arXiv:2306.17439, 2023

    Xuandong Zhao, Prabhanjan Ananth, Lei Li, and Yu-Xiang Wang. Provable robust watermarking for ai-generated text.arXiv preprint arXiv:2306.17439, 2023

  23. [23]

    arXiv preprint arXiv:2509.21057 (2025)

    Jiahao Huo, Shuliang Liu, Bin Wang, Junyan Zhang, Yibo Yan, Aiwei Liu, Xuming Hu, and Mingxun Zhou. Pmark: Towards robust and distortion-free semantic-level watermarking with channel constraints.arXiv preprint arXiv:2509.21057, 2025

  24. [24]

    Markllm: An open-source toolkit for llm watermarking.arXiv preprint arXiv:2405.10051,

    Leyi Pan, Aiwei Liu, Zhiwei He, Zitian Gao, Xuandong Zhao, Yijian Lu, Binglin Zhou, Shuliang Liu, Xuming Hu, Lijie Wen, et al. Markllm: An open-source toolkit for llm watermarking. arXiv preprint arXiv:2405.10051, 2024

  25. [25]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  26. [26]

    Ghostbuster: Detecting text ghost- written by large language models

    Vivek Verma, Eve Fleisig, Nicholas Tomlin, and Dan Klein. Ghostbuster: Detecting text ghost- written by large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1702–1717, 2024

  27. [27]

    Paraphras- ing evades detectors of ai-generated text, but retrieval is an effective defense.Advances in Neural Information Processing Systems, 36:27469–27500, 2023

    Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphras- ing evades detectors of ai-generated text, but retrieval is an effective defense.Advances in Neural Information Processing Systems, 36:27469–27500, 2023

  28. [28]

    Markmywords: Analyzing and evaluating language model watermarks

    Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, and David Wagner. Markmywords: Analyzing and evaluating language model watermarks. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 68–91. IEEE, 2025

  29. [29]

    Optimizing adaptive attacks against watermarks for language models.arXiv preprint arXiv:2410.02440, 2024

    Abdulrahman Diaa, Toluwani Aremu, and Nils Lukas. Optimizing adaptive attacks against watermarks for language models.arXiv preprint arXiv:2410.02440, 2024

  30. [30]

    Revealing weaknesses in text watermarking through self-information rewrite attacks.arXiv preprint arXiv:2505.05190, 2025

    Yixin Cheng, Hongcheng Guo, Yangming Li, and Leonid Sigal. Revealing weaknesses in text watermarking through self-information rewrite attacks.arXiv preprint arXiv:2505.05190, 2025

  31. [31]

    A sur- vey on llm-generated text detection: Necessity, methods, and future directions.Computational Linguistics, 51(1):275–338, 2025

    Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Lidia Sam Chao, and Derek Fai Wong. A sur- vey on llm-generated text detection: Necessity, methods, and future directions.Computational Linguistics, 51(1):275–338, 2025

  32. [32]

    Preventing and detecting misinformation generated by large language models

    Aiwei Liu, Qiang Sheng, and Xuming Hu. Preventing and detecting misinformation generated by large language models. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3001–3004, 2024. 11

  33. [33]

    {REMARK-LLM}: A robust and efficient watermarking framework for generative large lan- guage models

    Ruisi Zhang, Shehzeen Samarah Hussain, Paarth Neekhara, and Farinaz Koushanfar. {REMARK-LLM}: A robust and efficient watermarking framework for generative large lan- guage models. In33rd USENIX Security Symposium (USENIX Security 24), pages 1813–1830, 2024

  34. [34]

    k-semstamp: A clustering-based semantic watermark for detection of machine-generated text

    Abe Hou, Jingyu Zhang, Yichen Wang, Daniel Khashabi, and Tianxing He. k-semstamp: A clustering-based semantic watermark for detection of machine-generated text. InFindings of the Association for Computational Linguistics: ACL 2024, pages 1706–1715, 2024

  35. [35]

    Robust distortion- free watermarks for language models.arXiv preprint arXiv:2307.15593, 2023

    Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion- free watermarks for language models.arXiv preprint arXiv:2307.15593, 2023

  36. [36]

    Bileve: Securing text provenance in large language models against spoofing with bi-level signature.Advances in Neural Information Processing Systems, 37:56054–56075, 2024

    Tong Zhou, Xuandong Zhao, Xiaolin Xu, and Shaolei Ren. Bileve: Securing text provenance in large language models against spoofing with bi-level signature.Advances in Neural Information Processing Systems, 37:56054–56075, 2024

  37. [37]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguist...

  38. [38]

    Trl: Transformer reinforcement learning

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, and Nathan Lambert. Trl: Transformer reinforcement learning. https://github.com/lvwerra/ trl, 2020

  39. [39]

    algorithm_name

    Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1. 12 A Details for the distributional surrogate A.1 Absolute continuity and finite KL divergences In this subsection, we state the regularity conditions under which the distributional surrogate in Section 2 is well defined and can be ex...