Recognition: 2 theorem links
· Lean TheoremSequential Behavioral Watermarking for LLM Agents
Pith reviewed 2026-05-13 01:40 UTC · model grok-4.3
The pith
SeqWM watermarks LLM agent trajectories by embedding signals in history-conditioned transitions for reliable, corruption-resistant detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By treating agent trajectories as sequences and watermarking the history-conditioned transition patterns using a secret key, SeqWM allows statistical verification of whether a trajectory was produced by a watermarked agent without requiring alignment to specific positions. This results in robust detection that holds up under trajectory corruption, unlike methods that index watermarks by round or step.
What carries the argument
SeqWM framework that embeds watermark signals into history-conditioned transition patterns of agent actions and verifies them position-agnostically using random-key baselines.
If this is right
- Enables establishment of provenance and ownership from observed agent behavior alone.
- Preserves the utility and performance of the original agent policy.
- Maintains detection reliability even when trajectories are perturbed, truncated, or observed without alignment.
- Works consistently across diverse agent benchmarks and various LLM backbones.
Where Pith is reading between the lines
- The approach might be combined with text-based watermarking to provide multi-layer protection for agent outputs.
- It could support applications in regulated environments where tracing AI-generated actions is required.
- Testing in more varied perturbation scenarios beyond the benchmarks could further validate its real-world applicability.
Load-bearing premise
That modifying history-conditioned transition patterns for watermarking does not cause the agent to notice or reduce its utility, and that the random-key baseline can distinguish watermarked trajectories reliably even under untested real-world perturbations.
What would settle it
Experiments showing a significant drop in detection accuracy or agent utility when trajectories are subjected to corruption types not included in the paper's benchmarks, such as specific environmental noise or partial observations.
Figures
read the original abstract
LLM-based agents act through sequences of executable decisions, but their trajectories provide little evidence of which agent or policy produced them, making provenance, ownership, and unauthorized reuse difficult to establish from observed behavior alone. This motivates watermarking signals embedded directly into agent behavior rather than only into generated text, since text watermarking cannot capture the action-level decisions that define agent execution. Recent agent watermarking methods address this gap by moving the watermark from generated text to behavioral choices. However, by treating each action step as an independent trial, they overlook trajectory structure and become fragile when trajectories are perturbed, truncated, or observed without reliable alignment. We propose SeqWM, a sequential behavioral watermarking framework that embeds signals into history-conditioned transition patterns and verifies trajectories position-agnostically against random-key baselines. Experiments across diverse agent benchmarks and LLM backbones show that SeqWM consistently achieves reliable detection while preserving agent utility, and remains robust under trajectory corruption where round-indexed behavioral watermarks collapse.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SeqWM, a sequential behavioral watermarking framework for LLM agents. It embeds watermark signals into history-conditioned transition patterns (rather than treating actions as independent trials) and performs position-agnostic verification by comparing observed trajectories against random-key baselines. Experiments across diverse agent benchmarks and LLM backbones are reported to show reliable detection, preserved agent utility, and robustness to trajectory corruption, truncation, and misalignment, where prior round-indexed behavioral watermarking methods fail.
Significance. If the experimental claims hold after detailed verification, the work would meaningfully advance provenance and ownership attribution for autonomous LLM agents, a growing concern in deployment settings. The shift to history-conditioned, position-agnostic verification addresses a clear limitation of existing per-action or text-only watermarking approaches and could support more reliable detection under realistic perturbations.
major comments (2)
- [Abstract] Abstract: The central robustness claim—that SeqWM produces reliable separation from random-key baselines even after trajectory corruption—depends on the unstated details of baseline construction (number of random keys, history featurization for transition patterns) and the statistical testing procedure. Without these, it is impossible to assess whether the reported detection rates survive multiple-testing correction or perturbations that preserve local statistics while disrupting global alignment.
- [Experiments] Experiments section: The claim that SeqWM 'remains robust under trajectory corruption where round-indexed behavioral watermarks collapse' is load-bearing for the contribution, yet the abstract provides no quantitative comparison (e.g., detection-rate tables before/after specific corruptions such as truncation, prompt variation, or partial observability) or confirmation that the random-key baselines were calibrated on the same history-conditioned distribution as the watermarked trajectories.
minor comments (1)
- [Abstract] The abstract uses the phrase 'position-agnostically' without a brief parenthetical gloss on how verification avoids reliance on step indices; adding one sentence would improve immediate clarity for readers familiar with round-indexed baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment point by point below, with clarifications drawn from the manuscript and revisions to improve accessibility of key details.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central robustness claim—that SeqWM produces reliable separation from random-key baselines even after trajectory corruption—depends on the unstated details of baseline construction (number of random keys, history featurization for transition patterns) and the statistical testing procedure. Without these, it is impossible to assess whether the reported detection rates survive multiple-testing correction or perturbations that preserve local statistics while disrupting global alignment.
Authors: We acknowledge that the abstract's brevity leaves certain technical parameters implicit. The full manuscript details baseline construction in Section 3.2 (multiple random keys applied to the same history-conditioned transition model) and the statistical testing procedure in Section 3.3 (a single likelihood-ratio test per trajectory against the empirical baseline distribution). Because verification consists of one test per trajectory, no multiple-testing correction applies. Experiments in Section 4.3 explicitly evaluate perturbations that could preserve local statistics (e.g., truncation, local swaps, and misalignment) while disrupting global alignment, showing maintained separation. To make these elements explicit without lengthening the abstract excessively, we have added a concise clause describing baseline calibration and the position-agnostic test. revision: yes
-
Referee: [Experiments] Experiments section: The claim that SeqWM 'remains robust under trajectory corruption where round-indexed behavioral watermarks collapse' is load-bearing for the contribution, yet the abstract provides no quantitative comparison (e.g., detection-rate tables before/after specific corruptions such as truncation, prompt variation, or partial observability) or confirmation that the random-key baselines were calibrated on the same history-conditioned distribution as the watermarked trajectories.
Authors: Quantitative before/after comparisons under the listed corruptions appear in Section 4.3 and the associated tables, which report detection rates for SeqWM versus round-indexed baselines on truncation, prompt variation, and partial-observability settings. The random-key baselines are generated from the identical history-conditioned transition distribution used for watermark embedding (Section 3.1), ensuring fair calibration. While abstracts conventionally omit tables, we have revised the abstract to include a brief summary of the key robustness metrics (e.g., retained detection under truncation) so that the central claim is supported at the abstract level as well. revision: yes
Circularity Check
No significant circularity in SeqWM construction or claims
full rationale
The paper introduces SeqWM as a new sequential behavioral watermarking framework embedding signals into history-conditioned transition patterns with position-agnostic verification against random-key baselines. No equations, derivations, or self-referential steps are shown that reduce predictions or uniqueness claims to fitted inputs or prior self-citations by construction. Experimental claims of reliable detection, utility preservation, and robustness under corruption are presented as empirical results compared to round-indexed baselines, which remain externally falsifiable. The method is framed as a construction rather than a derivation that collapses to its own definitions or fitted parameters.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation
Hyeseon An, Shinwoo Park, Suyeon Woo, and Yo-Sub Han. DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2026
work page 2026
-
[3]
Claude Code, 2025.https://claude.com/product/claude-code
Anthropic. Claude Code, 2025.https://claude.com/product/claude-code
work page 2025
-
[4]
Undetectable watermarks for language models
Miranda Christ, Sam Gunn, and Or Zamir. Undetectable watermarks for language models. In Proceedings of the 37th Conference on Learning Theory (COLT), 2024
work page 2024
-
[5]
Scalable watermarking for identifying large language model outputs.Nature, 2024
Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po-Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, et al. Scalable watermarking for identifying large language model outputs.Nature, 2024
work page 2024
-
[6]
AI agents under threat: A survey of key security challenges and future pathways
Zehang Deng, Yongjian Guo, Changzhou Han, Wanlun Ma, Junwu Xiong, Sheng Wen, and Yang Xiang. AI agents under threat: A survey of key security challenges and future pathways. ACM Computing Surveys, 2025
work page 2025
-
[7]
Jinyang Ding, Kejiang Chen, Yaofei Wang, Na Zhao, Weiming Zhang, and Nenghai Yu. Discop: Provably secure steganography in practice based on “distribution copies”. In2023 IEEE Symposium on Security and Privacy (SP), 2023
work page 2023
-
[8]
LLM fingerprinting via semantically conditioned watermarks
Thibaud Gloaguen, Robin Staab, Nikola Jovanovi´c, and Martin Vechev. LLM fingerprinting via semantically conditioned watermarks. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026
work page 2026
-
[9]
Mind2Web 2: Evaluating agentic search with agent-as-a-judge.arXiv preprint arXiv:2506.21506, 2025
Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, et al. Mind2Web 2: Evaluating agentic search with agent-as-a-judge.arXiv preprint arXiv:2506.21506, 2025
-
[10]
Zhiwei He, Binglin Zhou, Hongkun Hao, Aiwei Liu, Xing Wang, Zhaopeng Tu, Zhuosheng Zhang, and Rui Wang. Can watermarks survive translation? on the cross-lingual consistency of text watermark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024
work page 2024
-
[11]
SemStamp: A semantic watermark with paraphrastic robustness for text generation
Abe Bohan Hou, Jingyu Zhang, Tianxing He, Yichen Wang, Yung-Sung Chuang, Hongwei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, and Yulia Tsvetkov. SemStamp: A semantic watermark with paraphrastic robustness for text generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: ...
work page 2024
-
[12]
Agent guide: A simple agent behavioral watermarking framework.arXiv preprint arXiv:2504.05871, 2025
Kaibo Huang, Zipei Zhang, Zhongliang Yang, and Linna Zhou. Agent guide: A simple agent behavioral watermarking framework.arXiv preprint arXiv:2504.05871, 2025. 10
-
[13]
AgentMark: Utility-preserving behavioral watermarking for agents
Kaibo Huang, Jin Tan, Yukun Wei, Wanling Li, Zipei Zhang, Hui Tian, Zhongliang Yang, and Linna Zhou. AgentMark: Utility-preserving behavioral watermarking for agents. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2026. To appear
work page 2026
-
[14]
Watermark stealing in large language models
Nikola Jovanovi´c, Robin Staab, and Martin Vechev. Watermark stealing in large language models. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024
work page 2024
-
[15]
Jois, Matthew Green, and Aviel D
Gabriel Kaptchuk, Tushar M. Jois, Matthew Green, and Aviel D. Rubin. Meteor: Cryptographi- cally secure steganography for realistic distributions. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2021
work page 2021
-
[16]
A watermark for large language models
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023
work page 2023
-
[17]
On the reliability of watermarks for large language models
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. On the reliability of watermarks for large language models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[18]
Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion- free watermarks for language models.Transactions on Machine Learning Research (TMLR), 2024
work page 2024
-
[19]
API-Bank: A comprehensive benchmark for tool-augmented LLMs
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A comprehensive benchmark for tool-augmented LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
work page 2023
-
[20]
A framework for designing provably secure steganography
Guorui Liao, Jinshuai Yang, Weizhi Shao, and Yongfeng Huang. A framework for designing provably secure steganography. In34th USENIX Security Symposium (USENIX Security), 2025
work page 2025
-
[21]
A semantic invariant robust watermark for large language models
Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, and Lijie Wen. A semantic invariant robust watermark for large language models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[22]
Watermarking LLM Agent Trajectories
Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li, Zheng Liu, Zhou Yang, Chengkun Wei, and Wenzhi Chen. Watermarking LLM agent trajectories.arXiv preprint arXiv:2602.18700, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Microsoft copilot, 2023.https://copilot.microsoft.com/
Microsoft. Microsoft copilot, 2023.https://copilot.microsoft.com/
work page 2023
-
[24]
Gustavo A Oliva, Gopi Krishnan Rajbahadur, Aaditya Bhatia, Haoxiang Zhang, Yihao Chen, Zhilong Chen, Arthur Leung, Dayi Lin, Boyuan Chen, and Ahmed E Hassan. SPICE: An automated SWE-Bench labeling pipeline for issue clarity, test coverage, and effort estimation. arXiv preprint arXiv:2507.09108, 2025
-
[25]
Introducing deep research, 2025
OpenAI. Introducing deep research, 2025. https://openai.com/index/ introducing-deep-research/
work page 2025
-
[26]
Training software engineering agents and verifiers with SWE-Gym
Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with SWE-Gym. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025
work page 2025
-
[27]
No free lunch in LLM watermark- ing: Trade-offs in watermarking design choices
Qi Pang, Shengyuan Hu, Wenting Zheng, and Virginia Smith. No free lunch in LLM watermark- ing: Trade-offs in watermarking design choices. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[28]
A linguistics-aware llm watermark- ing via syntactic predictability
Shinwoo Park, Hyejin Park, Hyeseon An, and Yo-Sub Han. A linguistics-aware llm watermark- ing via syntactic predictability. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2026. To appear. 11
work page 2026
-
[29]
WaterMod: Modular Token- Rank Partitioning for Probability-Balanced LLM Watermarking
Shinwoo Park, Hyejin Park, Hyeseon An, and Yo-Sub Han. WaterMod: Modular Token- Rank Partitioning for Probability-Balanced LLM Watermarking. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026
work page 2026
-
[30]
Hey, that’s my model! introducing chain & hash, an LLM fingerprinting technique
Mark Russinovich, Yanan Cai, and Ahmed Salem. Hey, that’s my model! introducing chain & hash, an LLM fingerprinting technique. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026
work page 2026
-
[31]
CoProtector: Protect open-source code against unauthorized training usage with data poisoning
Zhensu Sun, Xiaoning Du, Fu Song, Mingze Ni, and Li Li. CoProtector: Protect open-source code against unauthorized training usage with data poisoning. InProceedings of the ACM Web Conference 2022 (WWW), 2022
work page 2022
-
[32]
CodeMark: Imperceptible watermarking for code datasets against neural code completion models
Zhensu Sun, Xiaoning Du, Fu Song, and Li Li. CodeMark: Imperceptible watermarking for code datasets against neural code completion models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2023
work page 2023
-
[33]
Multi-agent systems execute arbitrary malicious code.arXiv preprint arXiv:2503.12188, 2025
Harold Triedman, Rishi Jha, and Vitaly Shmatikov. Multi-agent systems execute arbitrary malicious code.arXiv preprint arXiv:2503.12188, 2025
-
[34]
Ip leakage attacks targeting llm-based multi-agent systems.arXiv preprint arXiv:2505.12442, 2025
Liwen Wang, Wenxuan Wang, Shuai Wang, Zongjie Li, Zhenlan Ji, Zongyi Lyu, Daoyuan Wu, and Shing-Chi Cheung. IP leakage attacks targeting LLM-based multi-agent systems.arXiv preprint arXiv:2505.12442, 2025
-
[35]
Liwen Wang, Zongjie Li, Yuchong Xie, Shuai Wang, Dongdong She, Wei Wang, and Juergen Rahmel. On protecting agentic systems’ intellectual property via watermarking.arXiv preprint arXiv:2602.08401, 2026
-
[36]
Yuan Xiao, Yuchen Chen, Shiqing Ma, Haocheng Huang, Chunrong Fang, Yanwei Chen, Weisong Sun, Yunfeng Zhu, Xiaofang Zhang, and Zhenyu Chen. DeCoMa: Detecting and purifying code dataset watermarks through dual channel code abstraction.Proceedings of the ACM on Software Engineering (ISSTA), 2025
work page 2025
-
[37]
Instructional fingerprinting of large language models
Jiashu Xu, Fei Wang, Mingyu Derek Ma, Pang Wei Koh, Chaowei Xiao, and Muhao Chen. Instructional fingerprinting of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2024
work page 2024
-
[38]
AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials
Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials. In The Thirteenth International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[39]
SWE-smith: Scaling data for software engineering agents
John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS), 2025
work page 2025
-
[40]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[41]
A survey on model extraction attacks and defenses for large language models
Kaixiang Zhao, Lincan Li, Kaize Ding, Neil Zhenqiang Gong, Yue Zhao, and Yushun Dong. A survey on model extraction attacks and defenses for large language models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Volume 2, 2025. A Social Impact and Limitations Reliable provenance for autonomous LLM agents is an ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.