arxiv: 2604.25247 · v1 · submitted 2026-04-28 · 💻 cs.CR

Recognition: unknown

R-CoT: A Reasoning-Layer Watermark via Redundant Chain-of-Thought in Large Language Models

Ziming Zhang , Li Li , Guorui Feng , Hanzhou Wu , Xinpeng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:49 UTC · model grok-4.3

classification 💻 cs.CR

keywords watermarkinglarge language modelschain-of-thoughtreasoningrobustnessownershipfine-tuning

0 comments

The pith

Watermarks can be embedded in an LLM's reasoning path to survive fine-tuning and output perturbations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces R-CoT to address the vulnerability of existing LLM watermarks that rely on output modifications and can be easily removed. It establishes a dual-trajectory optimization using GRPO so that a watermark-carrying reasoning path coexists with the normal one in the model's parameters. This makes the watermark part of the stable reasoning process. If successful, it allows reliable detection of model ownership even after the model is fine-tuned or its outputs are altered, with experiments showing over 95% true positive rate.

Core claim

By using redundant chain-of-thought and a dual-trajectory optimization mechanism, the R-CoT framework internalizes the watermark as a distinct reasoning policy within the shared parameter space, embedding it into the model's stable reasoning path rather than superficial output distributions. Experimental results show high watermark effectiveness and strong robustness, with the true positive rate consistently remaining above 95% under fine-tuning and other post-training operations.

What carries the argument

The Redundant Chain-of-Thought (R-CoT) with dual-trajectory optimization based on GRPO, which enables the native and watermark reasoning paths to coexist in shared parameters and internalizes the watermark as a distinct policy.

If this is right

The watermark achieves high effectiveness compared to existing methods.
True positive rate remains above 95% under fine-tuning with only marginal degradation.
The watermark resists removal from output-level perturbations.
The watermark is embedded in the stable reasoning path of the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could be extended to internalize other policies, such as safety guidelines, as redundant reasoning paths.
It connects to broader efforts in making model behaviors robust to post-training adaptations.
One testable extension is applying similar redundancy to verify other properties like factual accuracy in reasoning.
Adversarial attacks would need to target the reasoning layer specifically to evade detection.

Load-bearing premise

The dual-trajectory optimization with GRPO successfully internalizes the watermark as a distinct reasoning policy without substantially degrading the model's native reasoning performance or introducing detectable artifacts that could be removed.

What would settle it

Observe the true positive rate for watermark detection after applying fine-tuning or other post-training operations; if it falls significantly below 95%, the claim of strong robustness would be falsified.

Figures

Figures reproduced from arXiv: 2604.25247 by Guorui Feng, Hanzhou Wu, Li Li, Xinpeng Zhang, Ziming Zhang.

**Figure 1.** Figure 1: Overall framework of the proposed reasoning-layer watermarking method based on Redundant Chain-of-Thought (R-CoT). The view at source ↗

**Figure 2.** Figure 2: Trigger-Conditioned Reasoning-Level Watermark in an LLM. The figure illustrates the proposed R-CoT watermarking mechanism, view at source ↗

**Figure 3.** Figure 3: Robustness of the proposed watermark under input pertur view at source ↗

read the original abstract

Large language models (LLMs) are widely deployed in multiple scenarios due to reasoning capabilities. In order to prevent the models from being misused, watermarking is generally employed to ensure ownership. However, most existing watermarking methods rely on superficial modifications to the model's output distribution, rendering the watermark vulnerable to perturbation and removal. To overcome this challenge, this paper introduces a reasoning-layer framework termed Redundant Chain-of-Thought (R-CoT), which embeds watermarks into the reasoning path. A dual-trajectory optimization mechanism based on GRPO enables the native and the watermark reasoning path to coexist within a shared parameter space, internalizing the watermark as a distinct reasoning policy. Therefore, the watermark is embedded into the model's stable reasoning path, avoiding the watermark failure caused by output-level perturbations. Experimental results show that, compared with existing methods, R-CoT achieves high watermark effectiveness and strong robustness. Under fine-tuning and other post-training operations, the true positive rate (TPR) consistently remains above 95%, exhibiting only marginal degradation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R-CoT embeds watermarks in a redundant reasoning path via GRPO dual optimization, which could improve robustness over output-level methods if the separation actually holds.

read the letter

The paper's main contribution is a shift from output-distribution watermarking to a reasoning-layer approach. It introduces redundant Chain-of-Thought trajectories and uses GRPO to train the model so that a watermark-carrying reasoning path coexists with the native one in the same parameters. The claim is that this makes the mark stable under fine-tuning and output perturbations because it lives in the reasoning policy rather than final-token logits. That direction addresses a clear weakness in existing LLM watermarking, where small changes or retraining often erase the signal. The reported TPR above 95% after post-training operations is the concrete result they highlight, and the abstract positions it as outperforming prior methods on robustness. If the full experiments control for reasoning quality and show the watermark path is not just a conditional output shift, this could be a practical step for ownership verification in deployed models. The soft spots are in the mechanism and the evidence. The abstract does not detail the loss terms, trajectory separation constraints, or regularization that would prevent the watermark path from entangling with native reasoning or becoming removable by targeted CoT attacks. Without those specifics or ablations on model performance degradation, it is hard to judge whether the dual optimization truly internalizes a distinct policy. The experimental claims also lack reported baselines, statistical tests, or controls for post-hoc choices, which leaves the high TPR numbers difficult to evaluate. This work is aimed at researchers in AI security and watermarking who need more robust ownership tools. A reader focused on practical deployment would find the idea worth testing, but the current presentation requires the full experimental section and code to assess whether the robustness is real or an artifact of the optimization. I would send it to peer review so referees can check the GRPO implementation and the separation claims against the data.

Referee Report

2 major / 2 minor

Summary. The paper proposes R-CoT, a reasoning-layer watermarking framework for LLMs that embeds watermarks into a redundant Chain-of-Thought reasoning path. It uses a dual-trajectory optimization based on GRPO to allow native and watermark reasoning policies to coexist in shared parameters, internalizing the watermark as a stable policy. This is claimed to make the watermark robust to output perturbations and post-training operations, with experiments showing TPR consistently above 95% under fine-tuning and similar attacks, outperforming existing output-level methods.

Significance. If the central claim holds—that dual-trajectory GRPO successfully internalizes a separable, stable watermark reasoning policy without substantial performance degradation—this would be a notable advance in LLM watermarking. It directly targets the vulnerability of output-distribution methods to removal via perturbations or fine-tuning, potentially enabling more reliable ownership verification in deployed models. The approach of leveraging reasoning paths rather than token logits is conceptually promising and could influence future work on model-level provenance.

major comments (2)

[§4.2] §4.2 (Dual-Trajectory GRPO): The description of the optimization does not specify the exact loss terms, reward formulation, or separation constraints (e.g., trajectory divergence penalties or policy regularization) that would ensure the watermark reasoning path remains non-entangled with native reasoning. Without these, the method risks reducing to conditional output watermarking, inheriting the same removal vulnerabilities highlighted in the stress-test concern.
[§5] §5 (Experiments): The reported TPR >95% under fine-tuning and post-training is a central empirical claim, yet the section provides insufficient detail on experimental setup (e.g., fine-tuning datasets and hyperparameters, number of independent runs, statistical significance testing, or ablation studies isolating the contribution of redundant CoT vs. GRPO). This makes it difficult to assess whether the robustness is genuine or sensitive to post-hoc choices.

minor comments (2)

[§3] The notation for the two trajectories (native vs. watermark) could be clarified with explicit symbols in the method equations to avoid ambiguity when discussing shared parameters.
[§5] Figure 3 (or equivalent results plot) would benefit from error bars or confidence intervals to visually convey the consistency of the TPR claims across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback. The comments identify areas where additional clarity and rigor will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [§4.2] §4.2 (Dual-Trajectory GRPO): The description of the optimization does not specify the exact loss terms, reward formulation, or separation constraints (e.g., trajectory divergence penalties or policy regularization) that would ensure the watermark reasoning path remains non-entangled with native reasoning. Without these, the method risks reducing to conditional output watermarking, inheriting the same removal vulnerabilities highlighted in the stress-test concern.

Authors: We agree that the current description in §4.2 would benefit from greater mathematical precision. In the revised manuscript we will expand this section to include the exact loss terms, reward formulation for each trajectory, and the separation constraints (including any divergence penalties or regularization terms) used in the dual-trajectory GRPO. These additions will make explicit how the native and watermark reasoning policies are kept distinct within shared parameters and will directly address the concern that the approach could collapse to conditional output watermarking. revision: yes
Referee: [§5] §5 (Experiments): The reported TPR >95% under fine-tuning and post-training is a central empirical claim, yet the section provides insufficient detail on experimental setup (e.g., fine-tuning datasets and hyperparameters, number of independent runs, statistical significance testing, or ablation studies isolating the contribution of redundant CoT vs. GRPO). This makes it difficult to assess whether the robustness is genuine or sensitive to post-hoc choices.

Authors: We acknowledge that §5 currently lacks sufficient experimental detail to allow full evaluation of the robustness claims. In the revised version we will augment this section with the specific fine-tuning datasets and hyperparameters, the number of independent runs performed, the statistical significance testing applied to the TPR results, and ablation studies that isolate the contribution of redundant CoT from the GRPO component. These additions will provide the necessary transparency and allow readers to assess the reliability of the reported robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claim rests on a dual-trajectory GRPO optimization that internalizes a watermark as a distinct reasoning policy in shared parameters, with robustness demonstrated via post-training experiments (TPR >95%). No equations or steps in the provided abstract reduce the reported effectiveness or robustness to a fitted parameter by construction, a self-referential definition, or a load-bearing self-citation chain. The optimization is presented as an external procedure whose outcome is validated empirically rather than assumed tautologically. No renaming of known results or ansatz smuggling via citation appears. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract provides insufficient detail to enumerate specific free parameters or axioms; GRPO optimization hyperparameters and any loss weighting between native and watermark trajectories are likely free parameters but unstated.

invented entities (1)

Redundant Chain-of-Thought (R-CoT) reasoning path no independent evidence
purpose: To carry the watermark signal alongside native reasoning within shared parameters
Introduced as the core new construct; no independent evidence outside the method itself

pith-pipeline@v0.9.0 · 5493 in / 1136 out tokens · 43315 ms · 2026-05-07T15:49:20.659007+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Turning your weakness into a strength: Watermarking deep neural net- works by backdooring

[Adiet al., 2018 ] Yossi Adi, Carsten Baum, Moustapha Cisse, Benny Pinkas, and Joseph Keshet. Turning your weakness into a strength: Watermarking deep neural net- works by backdooring. In27th USENIX security sympo- sium (USENIX Security 18), pages 1615–1631,

2018
[2]

and Wieting, J

[Bahri and Wieting, 2024] Dara Bahri and John Wieting. A watermark for black-box language models.arXiv preprint arXiv:2410.02099,

work page arXiv 2024
[3]

Postmark: A robust blackbox watermark for large lan- guage models

[Changet al., 2024 ] Yapei Chang, Kalpesh Krishna, Amir Houmansadr, John Frederick Wieting, and Mohit Iyyer. Postmark: A robust blackbox watermark for large lan- guage models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8969–8987,

2024
[4]

Training verifiers to solve math word problems,

[Cobbeet al., 2021 ] Karl Cobbe, Vineet Kosaraju, Moham- mad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems,

2021
[5]

Simmark: A robust sentence-level similarity-based watermarking algorithm for large language models.arXiv preprint arXiv:2502.02787,

[Dabiriaghdam and Wang, 2025] Amirhossein Dabiriagh- dam and Lele Wang. Simmark: A robust sentence-level similarity-based watermarking algorithm for large language models.arXiv preprint arXiv:2502.02787,

work page arXiv 2025
[6]

Seal: Subspace-anchored watermarks for llm ownership,

[Daiet al., 2025 ] Yanbo Dai, Zongjie Li, Zhenlan Ji, and Shuai Wang. Seal: Subspace-anchored watermarks for llm ownership,

2025
[7]

Ai agents under threat: A survey of key security challenges and future pathways.ACM Computing Surveys, 57(7):1–36,

[Denget al., 2025 ] Zehang Deng, Yongjian Guo, Changzhou Han, Wanlun Ma, Junwu Xiong, Sheng Wen, and Yang Xiang. Ai agents under threat: A survey of key security challenges and future pathways.ACM Computing Surveys, 57(7):1–36,

2025
[8]

Schelten, Amy Yang, Angela Fan, et al

[Dubeyet al., 2024 ] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, A. Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models

2024
[9]

Towards copyright protection for knowledge bases of retrieval-augmented language models via ownership veri- fication with reasoning.arXiv preprint arXiv:2502.10440,

[Guoet al., 2025a ] Junfeng Guo, Yiming Li, Ruibo Chen, Yihan Wu, Chenxi Liu, Yanshuo Chen, and Heng Huang. Towards copyright protection for knowledge bases of retrieval-augmented language models via ownership veri- fication with reasoning.arXiv preprint arXiv:2502.10440,

work page arXiv
[10]

Llm-adapters: An adapter family for parameter-efficient fine- tuning of large language models,

[Huet al., 2023 ] Zhiqiang Hu, Yihuai Lan, Lei Wang, Wanyu Xu, Ee-Peng Lim, Roy Ka-Wei Lee, Lidong Bing, and Soujanya Poria. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933,

work page arXiv 2023
[11]

Saber: Model-agnostic backdoor attack on chain-of-thought in neural code gener- ation.arXiv preprint arXiv:2412.05829,

[Jinet al., 2024 ] Naizhu Jin, Zhong Li, Yinggang Guo, Chao Su, Tian Zhang, and Qingkai Zeng. Saber: Model-agnostic backdoor attack on chain-of-thought in neural code gener- ation.arXiv preprint arXiv:2412.05829,

work page arXiv 2024
[12]

A watermark for large language models

[Kirchenbaueret al., 2023 ] John Kirchenbauer, Jonas Geip- ing, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Gold- stein. A watermark for large language models. InInter- national Conference on Machine Learning, pages 17061– 17084. PMLR,

2023
[13]

Retrieval-augmented generation for knowledge-intensive nlp tasks

[Lewiset al., 2020 ] Patrick Lewis, Ethan Perez, Aleksan- dra Piktus, Fabio Petroni, Vladimir Karpukhin, Na- man Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, edito...

2020
[14]

Double-i watermark: Protect- ing model copyright for llm fine-tuning.arXiv preprint arXiv:2402.14883,

[Liet al., 2024 ] Shen Li, Liuyi Yao, Jinyang Gao, Lan Zhang, and Yaliang Li. Double-i watermark: Protect- ing model copyright for llm fine-tuning.arXiv preprint arXiv:2402.14883,

work page arXiv 2024
[15]

Modelshield: Adaptive and robust watermark against model extraction attack.IEEE Transactions on Information Forensics and Security, 20:1767–1782,

[Panget al., 2025 ] Kaiyi Pang, Tao Qi, Chuhan Wu, Minhao Bai, Minghu Jiang, and Yongfeng Huang. Modelshield: Adaptive and robust watermark against model extraction attack.IEEE Transactions on Information Forensics and Security, 20:1767–1782,

2025
[16]

Are you copying my model? protecting the copyright of large language models for eaas via backdoor watermark.arXiv preprint arXiv:2305.10036,

[Penget al., 2023 ] Wenjun Peng, Jingwei Yi, Fangzhao Wu, Shangxi Wu, Bin Zhu, Lingjuan Lyu, Binxing Jiao, Tong Xu, Guangzhong Sun, and Xing Xie. Are you copying my model? protecting the copyright of large language models for eaas via backdoor watermark.arXiv preprint arXiv:2305.10036,

work page arXiv 2023
[17]

Watermarking datasets for llm fine-tuning

[Qiuet al., 2025 ] Jing Qiu, Xi Yang, Shuai Li, Kejiang Chen, Weiming Zhang, and Nenghai Yu. Watermarking datasets for llm fine-tuning. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,

2025
[18]

Break-the-chain: Reasoning failures in llms via adversarial prompting in code genera- tion.arXiv preprint arXiv:2506.06971,

[Rohet al., 2025 ] Jaechul Roh, Varun Gandhi, Shivani Anilkumar, and Arin Garg. Break-the-chain: Reasoning failures in llms via adversarial prompting in code genera- tion.arXiv preprint arXiv:2506.06971,

work page arXiv 2025
[19]

[Shaoet al., 2024 ] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical rea- soning in open language models,

2024
[20]

Chain-of-thought poison- ing attacks against r1-based retrieval-augmented genera- tion systems.arXiv preprint arXiv:2505.16367,

[Songet al., 2025 ] Hongru Song, Yu-an Liu, Ruqing Zhang, Jiafeng Guo, and Yixing Fan. Chain-of-thought poison- ing attacks against r1-based retrieval-augmented genera- tion systems.arXiv preprint arXiv:2505.16367,

work page arXiv 2025
[21]

Qwen2 Technical Report

[Team and others, 2024] Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3),

work page internal anchor Pith review arXiv 2024
[22]

Towards codable text watermarking for large lan- guage models.arXiv preprint arXiv:2307.15992,

[Wanget al., 2023 ] Lean Wang, Wenkai Yang, Deli Chen, Hao Zhou, Yankai Lin, Fandong Meng, Jie Zhou, and Xu Sun. Towards codable text watermarking for large lan- guage models.arXiv preprint arXiv:2307.15992,

work page arXiv 2023
[23]

Weda: Exploring copyright protec- tion for large language model downstream alignment

[Wanget al., 2024 ] Shen Wang, Jialiang Dong, Longfei Wu, and Zhitao Guan. Weda: Exploring copyright protec- tion for large language model downstream alignment. IEEE/ACM Transactions on Audio, Speech, and Language Processing,

2024
[24]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837,

[Weiet al., 2022 ] Jason Wei, Xuezhi Wang, Dale Schuur- mans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837,

2022
[25]

Cotguard: Using chain-of-thought triggering for copyright protection in multi-agent llm systems,

[Wenet al., 2025 ] Yan Wen, Junfeng Guo, and Heng Huang. Cotguard: Using chain-of-thought triggering for copyright protection in multi-agent llm systems,

2025
[26]

Badchain: Backdoor chain-of-thought prompting for large language models.arXiv preprint arXiv:2401.12242,

[Xianget al., 2024 ] Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. Badchain: Backdoor chain-of-thought prompting for large language models.arXiv preprint arXiv:2401.12242,

work page arXiv 2024
[27]

Watermarking text generated by black-box language models.arXiv preprint arXiv:2305.08883,

[Yanget al., 2023 ] Xi Yang, Kejiang Chen, Weiming Zhang, Chang Liu, Yuang Qi, Jie Zhang, Han Fang, and Nenghai Yu. Watermarking text generated by black-box language models.arXiv preprint arXiv:2305.08883,

work page arXiv 2023
[28]

{REMARK-LLM}: A robust and efficient watermarking framework for generative large language models

[Zhanget al., 2024 ] Ruisi Zhang, Shehzeen Samarah Hussain, Paarth Neekhara, and Farinaz Koushanfar. {REMARK-LLM}: A robust and efficient watermarking framework for generative large language models. In33rd USENIX Security Symposium (USENIX Security 24), pages 1813–1830,

2024
[29]

Provable robust watermarking for ai- generated text.arXiv preprint arXiv:2306.17439, 2023

[Zhaoet al., 2023 ] Xuandong Zhao, Prabhanjan Ananth, Lei Li, and Yu-Xiang Wang. Provable robust watermarking for ai-generated text.arXiv preprint arXiv:2306.17439,

work page arXiv 2023
[30]

ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs

[Zhaoet al., 2025 ] Gejian Zhao, Hanzhou Wu, Xinpeng Zhang, and Athanasios V Vasilakos. Shadowcot: Cog- nitive hijacking for stealthy reasoning backdoors in llms. arXiv preprint arXiv:2504.05605, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025