Recognition: unknown
R-CoT: A Reasoning-Layer Watermark via Redundant Chain-of-Thought in Large Language Models
Pith reviewed 2026-05-07 15:49 UTC · model grok-4.3
The pith
Watermarks can be embedded in an LLM's reasoning path to survive fine-tuning and output perturbations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By using redundant chain-of-thought and a dual-trajectory optimization mechanism, the R-CoT framework internalizes the watermark as a distinct reasoning policy within the shared parameter space, embedding it into the model's stable reasoning path rather than superficial output distributions. Experimental results show high watermark effectiveness and strong robustness, with the true positive rate consistently remaining above 95% under fine-tuning and other post-training operations.
What carries the argument
The Redundant Chain-of-Thought (R-CoT) with dual-trajectory optimization based on GRPO, which enables the native and watermark reasoning paths to coexist in shared parameters and internalizes the watermark as a distinct policy.
If this is right
- The watermark achieves high effectiveness compared to existing methods.
- True positive rate remains above 95% under fine-tuning with only marginal degradation.
- The watermark resists removal from output-level perturbations.
- The watermark is embedded in the stable reasoning path of the model.
Where Pith is reading between the lines
- This approach could be extended to internalize other policies, such as safety guidelines, as redundant reasoning paths.
- It connects to broader efforts in making model behaviors robust to post-training adaptations.
- One testable extension is applying similar redundancy to verify other properties like factual accuracy in reasoning.
- Adversarial attacks would need to target the reasoning layer specifically to evade detection.
Load-bearing premise
The dual-trajectory optimization with GRPO successfully internalizes the watermark as a distinct reasoning policy without substantially degrading the model's native reasoning performance or introducing detectable artifacts that could be removed.
What would settle it
Observe the true positive rate for watermark detection after applying fine-tuning or other post-training operations; if it falls significantly below 95%, the claim of strong robustness would be falsified.
Figures
read the original abstract
Large language models (LLMs) are widely deployed in multiple scenarios due to reasoning capabilities. In order to prevent the models from being misused, watermarking is generally employed to ensure ownership. However, most existing watermarking methods rely on superficial modifications to the model's output distribution, rendering the watermark vulnerable to perturbation and removal. To overcome this challenge, this paper introduces a reasoning-layer framework termed Redundant Chain-of-Thought (R-CoT), which embeds watermarks into the reasoning path. A dual-trajectory optimization mechanism based on GRPO enables the native and the watermark reasoning path to coexist within a shared parameter space, internalizing the watermark as a distinct reasoning policy. Therefore, the watermark is embedded into the model's stable reasoning path, avoiding the watermark failure caused by output-level perturbations. Experimental results show that, compared with existing methods, R-CoT achieves high watermark effectiveness and strong robustness. Under fine-tuning and other post-training operations, the true positive rate (TPR) consistently remains above 95%, exhibiting only marginal degradation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes R-CoT, a reasoning-layer watermarking framework for LLMs that embeds watermarks into a redundant Chain-of-Thought reasoning path. It uses a dual-trajectory optimization based on GRPO to allow native and watermark reasoning policies to coexist in shared parameters, internalizing the watermark as a stable policy. This is claimed to make the watermark robust to output perturbations and post-training operations, with experiments showing TPR consistently above 95% under fine-tuning and similar attacks, outperforming existing output-level methods.
Significance. If the central claim holds—that dual-trajectory GRPO successfully internalizes a separable, stable watermark reasoning policy without substantial performance degradation—this would be a notable advance in LLM watermarking. It directly targets the vulnerability of output-distribution methods to removal via perturbations or fine-tuning, potentially enabling more reliable ownership verification in deployed models. The approach of leveraging reasoning paths rather than token logits is conceptually promising and could influence future work on model-level provenance.
major comments (2)
- [§4.2] §4.2 (Dual-Trajectory GRPO): The description of the optimization does not specify the exact loss terms, reward formulation, or separation constraints (e.g., trajectory divergence penalties or policy regularization) that would ensure the watermark reasoning path remains non-entangled with native reasoning. Without these, the method risks reducing to conditional output watermarking, inheriting the same removal vulnerabilities highlighted in the stress-test concern.
- [§5] §5 (Experiments): The reported TPR >95% under fine-tuning and post-training is a central empirical claim, yet the section provides insufficient detail on experimental setup (e.g., fine-tuning datasets and hyperparameters, number of independent runs, statistical significance testing, or ablation studies isolating the contribution of redundant CoT vs. GRPO). This makes it difficult to assess whether the robustness is genuine or sensitive to post-hoc choices.
minor comments (2)
- [§3] The notation for the two trajectories (native vs. watermark) could be clarified with explicit symbols in the method equations to avoid ambiguity when discussing shared parameters.
- [§5] Figure 3 (or equivalent results plot) would benefit from error bars or confidence intervals to visually convey the consistency of the TPR claims across runs.
Simulated Author's Rebuttal
We sincerely thank the referee for the constructive and detailed feedback. The comments identify areas where additional clarity and rigor will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Dual-Trajectory GRPO): The description of the optimization does not specify the exact loss terms, reward formulation, or separation constraints (e.g., trajectory divergence penalties or policy regularization) that would ensure the watermark reasoning path remains non-entangled with native reasoning. Without these, the method risks reducing to conditional output watermarking, inheriting the same removal vulnerabilities highlighted in the stress-test concern.
Authors: We agree that the current description in §4.2 would benefit from greater mathematical precision. In the revised manuscript we will expand this section to include the exact loss terms, reward formulation for each trajectory, and the separation constraints (including any divergence penalties or regularization terms) used in the dual-trajectory GRPO. These additions will make explicit how the native and watermark reasoning policies are kept distinct within shared parameters and will directly address the concern that the approach could collapse to conditional output watermarking. revision: yes
-
Referee: [§5] §5 (Experiments): The reported TPR >95% under fine-tuning and post-training is a central empirical claim, yet the section provides insufficient detail on experimental setup (e.g., fine-tuning datasets and hyperparameters, number of independent runs, statistical significance testing, or ablation studies isolating the contribution of redundant CoT vs. GRPO). This makes it difficult to assess whether the robustness is genuine or sensitive to post-hoc choices.
Authors: We acknowledge that §5 currently lacks sufficient experimental detail to allow full evaluation of the robustness claims. In the revised version we will augment this section with the specific fine-tuning datasets and hyperparameters, the number of independent runs performed, the statistical significance testing applied to the TPR results, and ablation studies that isolate the contribution of redundant CoT from the GRPO component. These additions will provide the necessary transparency and allow readers to assess the reliability of the reported robustness. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's central claim rests on a dual-trajectory GRPO optimization that internalizes a watermark as a distinct reasoning policy in shared parameters, with robustness demonstrated via post-training experiments (TPR >95%). No equations or steps in the provided abstract reduce the reported effectiveness or robustness to a fitted parameter by construction, a self-referential definition, or a load-bearing self-citation chain. The optimization is presented as an external procedure whose outcome is validated empirically rather than assumed tautologically. No renaming of known results or ansatz smuggling via citation appears. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Redundant Chain-of-Thought (R-CoT) reasoning path
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Turning your weakness into a strength: Watermarking deep neural net- works by backdooring
[Adiet al., 2018 ] Yossi Adi, Carsten Baum, Moustapha Cisse, Benny Pinkas, and Joseph Keshet. Turning your weakness into a strength: Watermarking deep neural net- works by backdooring. In27th USENIX security sympo- sium (USENIX Security 18), pages 1615–1631,
2018
-
[2]
[Bahri and Wieting, 2024] Dara Bahri and John Wieting. A watermark for black-box language models.arXiv preprint arXiv:2410.02099,
-
[3]
Postmark: A robust blackbox watermark for large lan- guage models
[Changet al., 2024 ] Yapei Chang, Kalpesh Krishna, Amir Houmansadr, John Frederick Wieting, and Mohit Iyyer. Postmark: A robust blackbox watermark for large lan- guage models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8969–8987,
2024
-
[4]
Training verifiers to solve math word problems,
[Cobbeet al., 2021 ] Karl Cobbe, Vineet Kosaraju, Moham- mad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems,
2021
-
[5]
[Dabiriaghdam and Wang, 2025] Amirhossein Dabiriagh- dam and Lele Wang. Simmark: A robust sentence-level similarity-based watermarking algorithm for large language models.arXiv preprint arXiv:2502.02787,
-
[6]
Seal: Subspace-anchored watermarks for llm ownership,
[Daiet al., 2025 ] Yanbo Dai, Zongjie Li, Zhenlan Ji, and Shuai Wang. Seal: Subspace-anchored watermarks for llm ownership,
2025
-
[7]
Ai agents under threat: A survey of key security challenges and future pathways.ACM Computing Surveys, 57(7):1–36,
[Denget al., 2025 ] Zehang Deng, Yongjian Guo, Changzhou Han, Wanlun Ma, Junwu Xiong, Sheng Wen, and Yang Xiang. Ai agents under threat: A survey of key security challenges and future pathways.ACM Computing Surveys, 57(7):1–36,
2025
-
[8]
Schelten, Amy Yang, Angela Fan, et al
[Dubeyet al., 2024 ] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, A. Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models
2024
-
[9]
[Guoet al., 2025a ] Junfeng Guo, Yiming Li, Ruibo Chen, Yihan Wu, Chenxi Liu, Yanshuo Chen, and Heng Huang. Towards copyright protection for knowledge bases of retrieval-augmented language models via ownership veri- fication with reasoning.arXiv preprint arXiv:2502.10440,
-
[10]
Llm-adapters: An adapter family for parameter-efficient fine- tuning of large language models,
[Huet al., 2023 ] Zhiqiang Hu, Yihuai Lan, Lei Wang, Wanyu Xu, Ee-Peng Lim, Roy Ka-Wei Lee, Lidong Bing, and Soujanya Poria. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933,
-
[11]
[Jinet al., 2024 ] Naizhu Jin, Zhong Li, Yinggang Guo, Chao Su, Tian Zhang, and Qingkai Zeng. Saber: Model-agnostic backdoor attack on chain-of-thought in neural code gener- ation.arXiv preprint arXiv:2412.05829,
-
[12]
A watermark for large language models
[Kirchenbaueret al., 2023 ] John Kirchenbauer, Jonas Geip- ing, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Gold- stein. A watermark for large language models. InInter- national Conference on Machine Learning, pages 17061– 17084. PMLR,
2023
-
[13]
Retrieval-augmented generation for knowledge-intensive nlp tasks
[Lewiset al., 2020 ] Patrick Lewis, Ethan Perez, Aleksan- dra Piktus, Fabio Petroni, Vladimir Karpukhin, Na- man Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, edito...
2020
-
[14]
[Liet al., 2024 ] Shen Li, Liuyi Yao, Jinyang Gao, Lan Zhang, and Yaliang Li. Double-i watermark: Protect- ing model copyright for llm fine-tuning.arXiv preprint arXiv:2402.14883,
-
[15]
Modelshield: Adaptive and robust watermark against model extraction attack.IEEE Transactions on Information Forensics and Security, 20:1767–1782,
[Panget al., 2025 ] Kaiyi Pang, Tao Qi, Chuhan Wu, Minhao Bai, Minghu Jiang, and Yongfeng Huang. Modelshield: Adaptive and robust watermark against model extraction attack.IEEE Transactions on Information Forensics and Security, 20:1767–1782,
2025
-
[16]
[Penget al., 2023 ] Wenjun Peng, Jingwei Yi, Fangzhao Wu, Shangxi Wu, Bin Zhu, Lingjuan Lyu, Binxing Jiao, Tong Xu, Guangzhong Sun, and Xing Xie. Are you copying my model? protecting the copyright of large language models for eaas via backdoor watermark.arXiv preprint arXiv:2305.10036,
-
[17]
Watermarking datasets for llm fine-tuning
[Qiuet al., 2025 ] Jing Qiu, Xi Yang, Shuai Li, Kejiang Chen, Weiming Zhang, and Nenghai Yu. Watermarking datasets for llm fine-tuning. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,
2025
-
[18]
[Rohet al., 2025 ] Jaechul Roh, Varun Gandhi, Shivani Anilkumar, and Arin Garg. Break-the-chain: Reasoning failures in llms via adversarial prompting in code genera- tion.arXiv preprint arXiv:2506.06971,
-
[19]
[Shaoet al., 2024 ] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical rea- soning in open language models,
2024
-
[20]
[Songet al., 2025 ] Hongru Song, Yu-an Liu, Ruqing Zhang, Jiafeng Guo, and Yixing Fan. Chain-of-thought poison- ing attacks against r1-based retrieval-augmented genera- tion systems.arXiv preprint arXiv:2505.16367,
-
[21]
[Team and others, 2024] Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3),
work page internal anchor Pith review arXiv 2024
-
[22]
Towards codable text watermarking for large lan- guage models.arXiv preprint arXiv:2307.15992,
[Wanget al., 2023 ] Lean Wang, Wenkai Yang, Deli Chen, Hao Zhou, Yankai Lin, Fandong Meng, Jie Zhou, and Xu Sun. Towards codable text watermarking for large lan- guage models.arXiv preprint arXiv:2307.15992,
-
[23]
Weda: Exploring copyright protec- tion for large language model downstream alignment
[Wanget al., 2024 ] Shen Wang, Jialiang Dong, Longfei Wu, and Zhitao Guan. Weda: Exploring copyright protec- tion for large language model downstream alignment. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
2024
-
[24]
Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837,
[Weiet al., 2022 ] Jason Wei, Xuezhi Wang, Dale Schuur- mans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837,
2022
-
[25]
Cotguard: Using chain-of-thought triggering for copyright protection in multi-agent llm systems,
[Wenet al., 2025 ] Yan Wen, Junfeng Guo, and Heng Huang. Cotguard: Using chain-of-thought triggering for copyright protection in multi-agent llm systems,
2025
-
[26]
[Xianget al., 2024 ] Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. Badchain: Backdoor chain-of-thought prompting for large language models.arXiv preprint arXiv:2401.12242,
-
[27]
Watermarking text generated by black-box language models.arXiv preprint arXiv:2305.08883,
[Yanget al., 2023 ] Xi Yang, Kejiang Chen, Weiming Zhang, Chang Liu, Yuang Qi, Jie Zhang, Han Fang, and Nenghai Yu. Watermarking text generated by black-box language models.arXiv preprint arXiv:2305.08883,
-
[28]
{REMARK-LLM}: A robust and efficient watermarking framework for generative large language models
[Zhanget al., 2024 ] Ruisi Zhang, Shehzeen Samarah Hussain, Paarth Neekhara, and Farinaz Koushanfar. {REMARK-LLM}: A robust and efficient watermarking framework for generative large language models. In33rd USENIX Security Symposium (USENIX Security 24), pages 1813–1830,
2024
-
[29]
Provable robust watermarking for ai- generated text.arXiv preprint arXiv:2306.17439, 2023
[Zhaoet al., 2023 ] Xuandong Zhao, Prabhanjan Ananth, Lei Li, and Yu-Xiang Wang. Provable robust watermarking for ai-generated text.arXiv preprint arXiv:2306.17439,
-
[30]
ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs
[Zhaoet al., 2025 ] Gejian Zhao, Hanzhou Wu, Xinpeng Zhang, and Athanasios V Vasilakos. Shadowcot: Cog- nitive hijacking for stealthy reasoning backdoors in llms. arXiv preprint arXiv:2504.05605, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.