arxiv: 2604.11546 · v1 · submitted 2026-04-13 · 💻 cs.CR

Recognition: unknown

RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

Hanbo Huang, Hao Zheng, Shiyu Liang, Xuan Gong, Yiran Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3

classification 💻 cs.CR

keywords LLM watermarkingspoofing attacksreinforcement learningblack-box evaluationtext attributionparaphrase generationdistributional robustness

0 comments

The pith

A lightweight RL attack spoofs LLM watermarks at 62 percent success using only 100 examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to test whether current LLM watermarking methods can resist black-box attempts to remove their detectable signals through paraphrasing. It first defines a local capacity bottleneck that sets a theoretical limit on how much probability mass can shift under small distribution changes without altering meaning. From this foundation the authors create RLSpoofer, a reinforcement learning procedure that trains on only 100 human-watermarked paraphrase pairs and needs no access to watermark internals or detectors. When applied to a 4 billion parameter model the method reaches a 62 percent spoof success rate on PF-marked text while keeping semantic shifts small, far above the 6 percent achieved by models trained on up to 10,000 samples. The result indicates that many existing watermark designs can be bypassed with modest resources, which would require designers to adopt stronger constraints or hybrid detection strategies.

Core claim

RLSpoofer trains a reinforcement learning policy on 100 human-watermarked paraphrase pairs in a fully black-box setting and enables a 4B model to achieve a 62.0 percent spoof success rate on PF-marked texts with minimal semantic shift, greatly exceeding the 6 percent rate of baseline models trained on up to 10,000 samples.

What carries the argument

The local capacity bottleneck, which bounds the probability mass reallocatable under KL-constrained local updates while preserving semantic fidelity and thereby guides the RL policy toward effective spoofing paraphrases.

If this is right

Watermark schemes must be evaluated against lightweight black-box RL attacks rather than only data-intensive or white-box baselines.
Successful spoofing with 100 examples implies that robustness claims based on larger training sets alone are insufficient.
Detectors will need to account for paraphrased versions of watermarked text to maintain reliability.
New watermark designs should increase the local capacity required for any effective semantic-preserving rewrite.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Reliable attribution may require combining watermarks with non-distributional signals such as stylistic or metadata checks.
The same RL approach could serve as a general test for other text authenticity methods beyond watermarking.
If the bottleneck bound holds across different watermark families, the 62 percent success rate may generalize to additional schemes.
Scaling the training data modestly while keeping the black-box constraint might push success rates even higher.

Load-bearing premise

The local capacity bottleneck limits how much the output distribution can change under small KL updates without destroying semantic meaning, so the RL agent can discover watermark-removing paraphrases from limited data.

What would settle it

A watermark scheme that maintains near-zero spoof success rates when subjected to the same RLSpoofer training regime of 100 pairs and identical KL bounds would falsify the fragility claim.

Figures

Figures reproduced from arXiv: 2604.11546 by Hanbo Huang, Hao Zheng, Shiyu Liang, Xuan Gong, Yiran Zhang.

**Figure 2.** Figure 2: (a) illustrates the detection score distributions for EWD and SWEET watermarks across [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: (a) and (b) compare SSR of RLSpoofer under two [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Cross watermark transferability on Qwen3-4B. RLSpoofer demonstrates strongly asymmetric and directional cross-watermark transfer. We evaluate zero-shot transferability with Qwen3-4B by training RLSpoofer on one watermarking scheme and testing it on the others. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity of RLSpoofer across component weights on validation set. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Watermarked EWD example [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Unwatermarked EWD example [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Watermarked EWD example [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Unwatermarked EWD example 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Watermarked EWD example [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Unwatermarked EWD example [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Watermarked EWD example [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Unwatermarked EWD example 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

read the original abstract

Large language model (LLM) watermarking has emerged as a promising approach for detecting and attributing AI-generated text, yet its robustness to black-box spoofing remains insufficiently evaluated. Existing evaluation methods often demand extensive datasets and white-box access to algorithmic internals, limiting their practical applicability. In this paper, we study watermark resilience against spoofing fundamentally from a distributional perspective. We first establish a \textit{local capacity bottleneck}, which theoretically characterizes the probability mass that can be reallocated under KL-bounded local updates while preserving semantic fidelity. Building on this, we propose RLSpoofer, a reinforcement learning-based black-box spoofing attack that requires only 100 human-watermarked paraphrase training pairs and zero access to the watermarking internals or detectors. Despite weak supervision, it empowers a 4B model to achieve a 62.0\% spoof success rate with minimal semantic shift on PF-marked texts, dwarfing the 6\% of baseline models trained on up to 10,000 samples. Our findings expose the fragile spoofing resistance of current LLM watermarking paradigms, providing a lightweight evaluation framework and stressing the urgent need for more robust schemes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that current LLM watermarking schemes are vulnerable to black-box spoofing. It introduces a theoretical 'local capacity bottleneck' characterizing re-allocatable probability mass under KL-bounded local updates that preserve semantics, then proposes RLSpoofer: an RL-based attack using only 100 human-watermarked paraphrase pairs (no watermark internals or detector access) that trains a 4B model to 62% spoof success rate with minimal semantic shift on PF-marked texts, far above the 6% achieved by baselines trained on up to 10k samples.

Significance. If the central empirical result and its theoretical grounding hold, the work would be significant for the field: it supplies a lightweight, practical black-box evaluator that dramatically lowers the cost of assessing watermark spoofing resilience and exposes a concrete weakness in existing paradigms. The minimal-data RL approach and the distributional framing are strengths that could influence future watermark design and evaluation standards.

major comments (2)

[Theory and RL training sections (abstract + §3)] The local capacity bottleneck is asserted as the theoretical foundation explaining why 100 pairs suffice for 62% success (abstract and likely §2/§3). However, the RL training procedure (policy optimization section) gives no indication that KL divergence, trust-region, or other local-update constraints are enforced during gradient updates. Without such enforcement the learned policy can make non-local changes, rendering the 62% result compatible with ordinary RL overfitting rather than the claimed distributional property.
[Experimental evaluation (§4)] Empirical results report a 62.0% spoof success rate and 'minimal semantic shift' but supply no statistical significance tests, run-to-run variance, test-set size, or concrete metric for semantic fidelity (e.g., no mention of BLEU, embedding cosine, or human evaluation protocol). This makes the comparison to the 6% baseline difficult to interpret and weakens the central performance claim.

minor comments (2)

[Abstract and introduction] Expand 'PF-marked' on first use and clarify whether PF refers to a specific watermarking scheme (e.g., in the abstract and §1).
[Theoretical section] Provide the explicit mathematical statement or derivation of the local capacity bottleneck (including any equation defining the KL-bounded mass) rather than leaving it as a high-level assertion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important areas for clarification and strengthening. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Theory and RL training sections (abstract + §3)] The local capacity bottleneck is asserted as the theoretical foundation explaining why 100 pairs suffice for 62% success (abstract and likely §2/§3). However, the RL training procedure (policy optimization section) gives no indication that KL divergence, trust-region, or other local-update constraints are enforced during gradient updates. Without such enforcement the learned policy can make non-local changes, rendering the 62% result compatible with ordinary RL overfitting rather than the claimed distributional property.

Authors: We acknowledge that the policy optimization procedure does not explicitly enforce KL divergence bounds or trust-region constraints on the gradient updates. The local capacity bottleneck is presented as a theoretical characterization of the limited probability mass that can be reallocated while preserving semantics under KL-bounded local changes; it motivates why a small number of paraphrase pairs can be sufficient. Empirically, the 62% spoof success rate achieved with only 100 pairs substantially exceeds the 6% obtained by baselines trained on up to 10,000 samples, which is difficult to reconcile with generic overfitting. The reward function combines spoofing success with a semantic similarity term that implicitly penalizes large distributional shifts. In the revision we will add post-hoc analysis of the KL divergence between the learned policy and the base model on held-out text, together with a clearer discussion of how the semantic reward approximates the local-update regime, to make the link between theory and practice explicit. revision: partial
Referee: [Experimental evaluation (§4)] Empirical results report a 62.0% spoof success rate and 'minimal semantic shift' but supply no statistical significance tests, run-to-run variance, test-set size, or concrete metric for semantic fidelity (e.g., no mention of BLEU, embedding cosine, or human evaluation protocol). This makes the comparison to the 6% baseline difficult to interpret and weakens the central performance claim.

Authors: We agree that the experimental reporting would be strengthened by additional statistical detail and quantitative semantic metrics. In the revised manuscript we will report statistical significance tests for the performance gap versus baselines, include standard deviations and variance across multiple independent training runs, state the exact test-set size, and supply concrete semantic-fidelity numbers (BLEU, embedding cosine similarity) along with a description of the human evaluation protocol used to verify minimal semantic shift. These additions will make the 62% result and its comparison to the 6% baseline more robust and interpretable. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical attack performance independent of theoretical characterization

full rationale

The paper's central result is the measured 62% spoof success rate on external PF-marked test texts using a 4B model trained on 100 paraphrase pairs. The local capacity bottleneck is introduced as a theoretical characterization of re-allocatable probability mass under KL-bounded updates, but the reported success rate is obtained via direct evaluation on held-out data rather than derived from or reduced to that characterization by construction. No equations, fitted parameters renamed as predictions, or self-citation chains collapse the empirical outcome to the inputs. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the newly introduced local capacity bottleneck concept and standard RL assumptions about policy optimization under semantic constraints; no free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption KL-bounded local updates preserve semantic fidelity when probability mass is reallocated within the local capacity bottleneck
Invoked to justify that the RL policy can generate effective paraphrases without large semantic drift

invented entities (1)

local capacity bottleneck no independent evidence
purpose: Theoretical bound on reallocatable probability mass under KL constraints while keeping semantics intact
New concept introduced to characterize spoofing limits; no independent falsifiable prediction supplied in abstract

pith-pipeline@v0.9.0 · 5511 in / 1223 out tokens · 72096 ms · 2026-05-10T15:17:36.122098+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 23 canonical work pages · 6 internal anchors

[1]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Can llm-generated misinformation be detected?arXiv preprint arXiv:2309.13788, 2023

Canyu Chen and Kai Shu. Can llm-generated misinformation be detected?arXiv preprint arXiv:2309.13788, 2023

work page arXiv 2023
[4]

Copyright Protection for Large Language Models: A Survey of Methods, Challenges, and Trends

Zhenhua Xu, Xubin Yue, Zhebo Wang, Qichen Liu, Xixiang Zhao, Jingxuan Zhang, Wenjun Zeng, Wengpeng Xing, Dezhang Kong, Changting Lin, et al. Copyright protection for large lan- guage models: A survey of methods, challenges, and trends.arXiv preprint arXiv:2508.11548, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

arXiv preprint arXiv:2305.17493

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Ander- son. The curse of recursion: Training on generated data makes models forget.arXiv preprint arXiv:2305.17493, 2023

work page arXiv 2023
[6]

Chatting and cheating: Ensuring academic integrity in the era of chatgpt.Innovations in education and teaching international, 61(2):228–239, 2024

Debby RE Cotton, Peter A Cotton, and J Reuben Shipway. Chatting and cheating: Ensuring academic integrity in the era of chatgpt.Innovations in education and teaching international, 61(2):228–239, 2024

2024
[7]

A watermark for large language models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InInternational Conference on Machine Learning, pages 17061–17084. PMLR, 2023

2023
[8]

Permute-and-flip: An optimally robust and watermarkable decoder for llms.arXiv preprint arXiv:2402.05864,

Xuandong Zhao, Lei Li, and Yu-Xiang Wang. Permute-and-flip: An optimally stable and watermarkable decoder for llms.arXiv preprint arXiv:2402.05864, 2024

work page arXiv 2024
[9]

An entropy-based text water- marking detection method.arXiv preprint arXiv:2403.13485, 2024

Yijian Lu, Aiwei Liu, Dianzhi Yu, Jingjing Li, and Irwin King. An entropy-based text water- marking detection method.arXiv preprint arXiv:2403.13485, 2024

work page arXiv 2024
[10]

Who wrote this code? watermarking for code generation

Taehyun Lee, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, and Gunhee Kim. Who wrote this code? watermarking for code generation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4890–4911, 2024

2024
[11]

Watermark stealing in large language models.arXiv preprint arXiv:2402.19361, 2024

Nikola Jovanovi´c, Robin Staab, and Martin Vechev. Watermark stealing in large language models.arXiv preprint arXiv:2402.19361, 2024

work page arXiv 2024
[12]

De-mark: Watermark removal in large language models.arXiv preprint arXiv:2410.13808, 2024

Ruibo Chen, Yihan Wu, Junfeng Guo, and Heng Huang. De-mark: Watermark removal in large language models.arXiv preprint arXiv:2410.13808, 2024

work page arXiv 2024
[13]

Ditto: A spoofing attack framework on watermarked llms via knowledge distillation.arXiv preprint arXiv:2510.10987, 2025

Hyeseon An, Shinwoo Park, Suyeon Woo, and Yo-Sub Han. Ditto: A spoofing attack framework on watermarked llms via knowledge distillation.arXiv preprint arXiv:2510.10987, 2025

work page arXiv 2025
[14]

On the learnability of watermarks for language models.arXiv preprint arXiv:2312.04469,

Chenchen Gu, Xiang Lisa Li, Percy Liang, and Tatsunori Hashimoto. On the learnability of watermarks for language models.arXiv preprint arXiv:2312.04469, 2023

work page arXiv 2023
[15]

Attacking llm watermarks by exploiting their strengths

Qi Pang, Shengyuan Hu, Wenting Zheng, and Virginia Smith. Attacking llm watermarks by exploiting their strengths. InICLR 2024 Workshop on Secure and Trustworthy Large Language Models, 2024

2024
[16]

Gloaguen, N

Thibaud Gloaguen, Nikola Jovanovi´c, Robin Staab, and Martin Vechev. Discovering spoofing attempts on language model watermarks.arXiv preprint arXiv:2410.02693, 2024

work page arXiv 2024
[17]

Rlcracker: Exposing the vulnerability of llm watermarks with adaptive rl attacks.arXiv preprint arXiv:2509.20924, 2025

Hanbo Huang, Yiran Zhang, Hao Zheng, Xuan Gong, Yihan Li, Lin Liu, and Shiyu Liang. Rlcracker: Exposing the vulnerability of llm watermarks with adaptive rl attacks.arXiv preprint arXiv:2509.20924, 2025. 10

work page arXiv 2025
[18]

Can watermarked llms be identified by users via crafted prompts? arXiv preprint arXiv:2410.03168, 2024

Aiwei Liu, Sheng Guan, Yiming Liu, Leyi Pan, Yifei Zhang, Liancheng Fang, Lijie Wen, Philip S Yu, and Xuming Hu. Can watermarked llms be identified by users via crafted prompts? arXiv preprint arXiv:2410.03168, 2024

work page arXiv 2024
[19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Paraphrastic representations at scale

John Wieting, Kevin Gimpel, Graham Neubig, and Taylor Berg-Kirkpatrick. Paraphrastic representations at scale. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 379–388, 2022

2022
[21]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Provable robust watermarking for AI-generated text.arXiv preprint arXiv:2306.17439, 2023

Xuandong Zhao, Prabhanjan Ananth, Lei Li, and Yu-Xiang Wang. Provable robust watermarking for ai-generated text.arXiv preprint arXiv:2306.17439, 2023

work page arXiv 2023
[23]

arXiv preprint arXiv:2509.21057 (2025)

Jiahao Huo, Shuliang Liu, Bin Wang, Junyan Zhang, Yibo Yan, Aiwei Liu, Xuming Hu, and Mingxun Zhou. Pmark: Towards robust and distortion-free semantic-level watermarking with channel constraints.arXiv preprint arXiv:2509.21057, 2025

work page arXiv 2025
[24]

Markllm: An open-source toolkit for llm watermarking.arXiv preprint arXiv:2405.10051,

Leyi Pan, Aiwei Liu, Zhiwei He, Zitian Gao, Xuandong Zhao, Yijian Lu, Binglin Zhou, Shuliang Liu, Xuming Hu, Lijie Wen, et al. Markllm: An open-source toolkit for llm watermarking. arXiv preprint arXiv:2405.10051, 2024

work page arXiv 2024
[25]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

2020
[26]

Ghostbuster: Detecting text ghost- written by large language models

Vivek Verma, Eve Fleisig, Nicholas Tomlin, and Dan Klein. Ghostbuster: Detecting text ghost- written by large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1702–1717, 2024

2024
[27]

Paraphras- ing evades detectors of ai-generated text, but retrieval is an effective defense.Advances in Neural Information Processing Systems, 36:27469–27500, 2023

Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphras- ing evades detectors of ai-generated text, but retrieval is an effective defense.Advances in Neural Information Processing Systems, 36:27469–27500, 2023

2023
[28]

Markmywords: Analyzing and evaluating language model watermarks

Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, and David Wagner. Markmywords: Analyzing and evaluating language model watermarks. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 68–91. IEEE, 2025

2025
[29]

Optimizing adaptive attacks against watermarks for language models.arXiv preprint arXiv:2410.02440, 2024

Abdulrahman Diaa, Toluwani Aremu, and Nils Lukas. Optimizing adaptive attacks against watermarks for language models.arXiv preprint arXiv:2410.02440, 2024

work page arXiv 2024
[30]

Revealing weaknesses in text watermarking through self-information rewrite attacks.arXiv preprint arXiv:2505.05190, 2025

Yixin Cheng, Hongcheng Guo, Yangming Li, and Leonid Sigal. Revealing weaknesses in text watermarking through self-information rewrite attacks.arXiv preprint arXiv:2505.05190, 2025

work page arXiv 2025
[31]

A sur- vey on llm-generated text detection: Necessity, methods, and future directions.Computational Linguistics, 51(1):275–338, 2025

Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Lidia Sam Chao, and Derek Fai Wong. A sur- vey on llm-generated text detection: Necessity, methods, and future directions.Computational Linguistics, 51(1):275–338, 2025

2025
[32]

Preventing and detecting misinformation generated by large language models

Aiwei Liu, Qiang Sheng, and Xuming Hu. Preventing and detecting misinformation generated by large language models. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3001–3004, 2024. 11

2024
[33]

{REMARK-LLM}: A robust and efficient watermarking framework for generative large lan- guage models

Ruisi Zhang, Shehzeen Samarah Hussain, Paarth Neekhara, and Farinaz Koushanfar. {REMARK-LLM}: A robust and efficient watermarking framework for generative large lan- guage models. In33rd USENIX Security Symposium (USENIX Security 24), pages 1813–1830, 2024

2024
[34]

k-semstamp: A clustering-based semantic watermark for detection of machine-generated text

Abe Hou, Jingyu Zhang, Yichen Wang, Daniel Khashabi, and Tianxing He. k-semstamp: A clustering-based semantic watermark for detection of machine-generated text. InFindings of the Association for Computational Linguistics: ACL 2024, pages 1706–1715, 2024

2024
[35]

Robust distortion- free watermarks for language models.arXiv preprint arXiv:2307.15593, 2023

Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion- free watermarks for language models.arXiv preprint arXiv:2307.15593, 2023

work page arXiv 2023
[36]

Bileve: Securing text provenance in large language models against spoofing with bi-level signature.Advances in Neural Information Processing Systems, 37:56054–56075, 2024

Tong Zhou, Xuandong Zhao, Xiaolin Xu, and Shaolei Ren. Bileve: Securing text provenance in large language models against spoofing with bi-level signature.Advances in Neural Information Processing Systems, 37:56054–56075, 2024

2024
[37]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguist...

work page internal anchor Pith review arXiv 2024
[38]

Trl: Transformer reinforcement learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, and Nathan Lambert. Trl: Transformer reinforcement learning. https://github.com/lvwerra/ trl, 2020

2020
[39]

algorithm_name

Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1. 12 A Details for the distributional surrogate A.1 Absolute continuity and finite KL divergences In this subsection, we state the regularity conditions under which the distributional surrogate in Section 2 is well defined and can be ex...

2025