arxiv: 2605.11036 · v1 · submitted 2026-05-11 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Sequential Behavioral Watermarking for LLM Agents

Dongsu Kim, Hyeseon An, Shinwoo Park, Yo-Sub Han

Pith reviewed 2026-05-13 01:40 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM agentsbehavioral watermarkingsequential watermarkingtrajectory provenanceAI ownershipagent securitywatermark detection

0 comments

The pith

SeqWM watermarks LLM agent trajectories by embedding signals in history-conditioned transitions for reliable, corruption-resistant detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of tracing the origin of LLM-based agents from their action sequences, which lack inherent provenance. It introduces SeqWM, a framework that embeds watermark signals into the patterns of transitions between actions, conditioned on the history of previous steps. Verification is done in a position-agnostic way by comparing against baselines generated with random keys. This method is shown through experiments to detect watermarked agents reliably across different setups while keeping the agents' performance levels the same. It outperforms previous approaches particularly when trajectories are altered or incomplete.

Core claim

By treating agent trajectories as sequences and watermarking the history-conditioned transition patterns using a secret key, SeqWM allows statistical verification of whether a trajectory was produced by a watermarked agent without requiring alignment to specific positions. This results in robust detection that holds up under trajectory corruption, unlike methods that index watermarks by round or step.

What carries the argument

SeqWM framework that embeds watermark signals into history-conditioned transition patterns of agent actions and verifies them position-agnostically using random-key baselines.

If this is right

Enables establishment of provenance and ownership from observed agent behavior alone.
Preserves the utility and performance of the original agent policy.
Maintains detection reliability even when trajectories are perturbed, truncated, or observed without alignment.
Works consistently across diverse agent benchmarks and various LLM backbones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach might be combined with text-based watermarking to provide multi-layer protection for agent outputs.
It could support applications in regulated environments where tracing AI-generated actions is required.
Testing in more varied perturbation scenarios beyond the benchmarks could further validate its real-world applicability.

Load-bearing premise

That modifying history-conditioned transition patterns for watermarking does not cause the agent to notice or reduce its utility, and that the random-key baseline can distinguish watermarked trajectories reliably even under untested real-world perturbations.

What would settle it

Experiments showing a significant drop in detection accuracy or agent utility when trajectories are subjected to corruption types not included in the paper's benchmarks, such as specific environmental noise or partial observations.

Figures

Figures reproduced from arXiv: 2605.11036 by Dongsu Kim, Hyeseon An, Shinwoo Park, Yo-Sub Han.

**Figure 1.** Figure 1: Overview of SeqWM. The injection step biases the agent’s action distribution using multiple history-conditioned guided subsets, and the detection step slides over the observed sequence and calibrates the score against random keys to produce a p-value. on truncated, deleted, or partially observed sequences without alignment information. Third, a random-key calibration procedure (Section 4.3) constructs an e… view at source ↗

**Figure 2.** Figure 2: Detection z-score versus bias strength on OASIS Reddit across three LLM backbones; [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Detection TPR as a function of random action deletion rate across three domains (ToolBench, [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

LLM-based agents act through sequences of executable decisions, but their trajectories provide little evidence of which agent or policy produced them, making provenance, ownership, and unauthorized reuse difficult to establish from observed behavior alone. This motivates watermarking signals embedded directly into agent behavior rather than only into generated text, since text watermarking cannot capture the action-level decisions that define agent execution. Recent agent watermarking methods address this gap by moving the watermark from generated text to behavioral choices. However, by treating each action step as an independent trial, they overlook trajectory structure and become fragile when trajectories are perturbed, truncated, or observed without reliable alignment. We propose SeqWM, a sequential behavioral watermarking framework that embeds signals into history-conditioned transition patterns and verifies trajectories position-agnostically against random-key baselines. Experiments across diverse agent benchmarks and LLM backbones show that SeqWM consistently achieves reliable detection while preserving agent utility, and remains robust under trajectory corruption where round-indexed behavioral watermarks collapse.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SeqWM's history-conditioned, position-agnostic watermarking is a clear step past round-indexed methods for agent trajectories, but the robustness edge rests on baseline separation that the abstract leaves underspecified.

read the letter

The main point is that this paper moves behavioral watermarking from independent per-action marks to history-conditioned transition patterns with position-agnostic verification against random-key baselines. That shift directly targets the fragility of earlier approaches when trajectories get truncated or misaligned, which matters for real agent deployments where provenance needs to survive partial observation.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SeqWM, a sequential behavioral watermarking framework for LLM agents. It embeds watermark signals into history-conditioned transition patterns (rather than treating actions as independent trials) and performs position-agnostic verification by comparing observed trajectories against random-key baselines. Experiments across diverse agent benchmarks and LLM backbones are reported to show reliable detection, preserved agent utility, and robustness to trajectory corruption, truncation, and misalignment, where prior round-indexed behavioral watermarking methods fail.

Significance. If the experimental claims hold after detailed verification, the work would meaningfully advance provenance and ownership attribution for autonomous LLM agents, a growing concern in deployment settings. The shift to history-conditioned, position-agnostic verification addresses a clear limitation of existing per-action or text-only watermarking approaches and could support more reliable detection under realistic perturbations.

major comments (2)

[Abstract] Abstract: The central robustness claim—that SeqWM produces reliable separation from random-key baselines even after trajectory corruption—depends on the unstated details of baseline construction (number of random keys, history featurization for transition patterns) and the statistical testing procedure. Without these, it is impossible to assess whether the reported detection rates survive multiple-testing correction or perturbations that preserve local statistics while disrupting global alignment.
[Experiments] Experiments section: The claim that SeqWM 'remains robust under trajectory corruption where round-indexed behavioral watermarks collapse' is load-bearing for the contribution, yet the abstract provides no quantitative comparison (e.g., detection-rate tables before/after specific corruptions such as truncation, prompt variation, or partial observability) or confirmation that the random-key baselines were calibrated on the same history-conditioned distribution as the watermarked trajectories.

minor comments (1)

[Abstract] The abstract uses the phrase 'position-agnostically' without a brief parenthetical gloss on how verification avoids reliance on step indices; adding one sentence would improve immediate clarity for readers familiar with round-indexed baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment point by point below, with clarifications drawn from the manuscript and revisions to improve accessibility of key details.

read point-by-point responses

Referee: [Abstract] Abstract: The central robustness claim—that SeqWM produces reliable separation from random-key baselines even after trajectory corruption—depends on the unstated details of baseline construction (number of random keys, history featurization for transition patterns) and the statistical testing procedure. Without these, it is impossible to assess whether the reported detection rates survive multiple-testing correction or perturbations that preserve local statistics while disrupting global alignment.

Authors: We acknowledge that the abstract's brevity leaves certain technical parameters implicit. The full manuscript details baseline construction in Section 3.2 (multiple random keys applied to the same history-conditioned transition model) and the statistical testing procedure in Section 3.3 (a single likelihood-ratio test per trajectory against the empirical baseline distribution). Because verification consists of one test per trajectory, no multiple-testing correction applies. Experiments in Section 4.3 explicitly evaluate perturbations that could preserve local statistics (e.g., truncation, local swaps, and misalignment) while disrupting global alignment, showing maintained separation. To make these elements explicit without lengthening the abstract excessively, we have added a concise clause describing baseline calibration and the position-agnostic test. revision: yes
Referee: [Experiments] Experiments section: The claim that SeqWM 'remains robust under trajectory corruption where round-indexed behavioral watermarks collapse' is load-bearing for the contribution, yet the abstract provides no quantitative comparison (e.g., detection-rate tables before/after specific corruptions such as truncation, prompt variation, or partial observability) or confirmation that the random-key baselines were calibrated on the same history-conditioned distribution as the watermarked trajectories.

Authors: Quantitative before/after comparisons under the listed corruptions appear in Section 4.3 and the associated tables, which report detection rates for SeqWM versus round-indexed baselines on truncation, prompt variation, and partial-observability settings. The random-key baselines are generated from the identical history-conditioned transition distribution used for watermark embedding (Section 3.1), ensuring fair calibration. While abstracts conventionally omit tables, we have revised the abstract to include a brief summary of the key robustness metrics (e.g., retained detection under truncation) so that the central claim is supported at the abstract level as well. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SeqWM construction or claims

full rationale

The paper introduces SeqWM as a new sequential behavioral watermarking framework embedding signals into history-conditioned transition patterns with position-agnostic verification against random-key baselines. No equations, derivations, or self-referential steps are shown that reduce predictions or uniqueness claims to fitted inputs or prior self-citations by construction. Experimental claims of reliable detection, utility preservation, and robustness under corruption are presented as empirical results compared to round-indexed baselines, which remain externally falsifiable. The method is framed as a construction rather than a derivation that collapses to its own definitions or fitted parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework implicitly assumes that agent trajectories possess detectable history-conditioned structure separable from random baselines.

pith-pipeline@v0.9.0 · 5466 in / 1011 out tokens · 22787 ms · 2026-05-13T01:40:44.139897+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 1 internal anchor

[1]

Watermarking GPT outputs

Scott Aaronson. Watermarking GPT outputs. 2022

work page 2022
[2]

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

Hyeseon An, Shinwoo Park, Suyeon Woo, and Yo-Sub Han. DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2026

work page 2026
[3]

Claude Code, 2025.https://claude.com/product/claude-code

Anthropic. Claude Code, 2025.https://claude.com/product/claude-code

work page 2025
[4]

Undetectable watermarks for language models

Miranda Christ, Sam Gunn, and Or Zamir. Undetectable watermarks for language models. In Proceedings of the 37th Conference on Learning Theory (COLT), 2024

work page 2024
[5]

Scalable watermarking for identifying large language model outputs.Nature, 2024

Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po-Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, et al. Scalable watermarking for identifying large language model outputs.Nature, 2024

work page 2024
[6]

AI agents under threat: A survey of key security challenges and future pathways

Zehang Deng, Yongjian Guo, Changzhou Han, Wanlun Ma, Junwu Xiong, Sheng Wen, and Yang Xiang. AI agents under threat: A survey of key security challenges and future pathways. ACM Computing Surveys, 2025

work page 2025
[7]

distribution copies

Jinyang Ding, Kejiang Chen, Yaofei Wang, Na Zhao, Weiming Zhang, and Nenghai Yu. Discop: Provably secure steganography in practice based on “distribution copies”. In2023 IEEE Symposium on Security and Privacy (SP), 2023

work page 2023
[8]

LLM fingerprinting via semantically conditioned watermarks

Thibaud Gloaguen, Robin Staab, Nikola Jovanovi´c, and Martin Vechev. LLM fingerprinting via semantically conditioned watermarks. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026
[9]

Mind2Web 2: Evaluating agentic search with agent-as-a-judge.arXiv preprint arXiv:2506.21506, 2025

Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, et al. Mind2Web 2: Evaluating agentic search with agent-as-a-judge.arXiv preprint arXiv:2506.21506, 2025

work page arXiv 2025
[10]

Can watermarks survive translation? on the cross-lingual consistency of text watermark for large language models

Zhiwei He, Binglin Zhou, Hongkun Hao, Aiwei Liu, Xing Wang, Zhaopeng Tu, Zhuosheng Zhang, and Rui Wang. Can watermarks survive translation? on the cross-lingual consistency of text watermark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024
[11]

SemStamp: A semantic watermark with paraphrastic robustness for text generation

Abe Bohan Hou, Jingyu Zhang, Tianxing He, Yichen Wang, Yung-Sung Chuang, Hongwei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, and Yulia Tsvetkov. SemStamp: A semantic watermark with paraphrastic robustness for text generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: ...

work page 2024
[12]

Agent guide: A simple agent behavioral watermarking framework.arXiv preprint arXiv:2504.05871, 2025

Kaibo Huang, Zipei Zhang, Zhongliang Yang, and Linna Zhou. Agent guide: A simple agent behavioral watermarking framework.arXiv preprint arXiv:2504.05871, 2025. 10

work page arXiv 2025
[13]

AgentMark: Utility-preserving behavioral watermarking for agents

Kaibo Huang, Jin Tan, Yukun Wei, Wanling Li, Zipei Zhang, Hui Tian, Zhongliang Yang, and Linna Zhou. AgentMark: Utility-preserving behavioral watermarking for agents. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2026. To appear

work page 2026
[14]

Watermark stealing in large language models

Nikola Jovanovi´c, Robin Staab, and Martin Vechev. Watermark stealing in large language models. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024
[15]

Jois, Matthew Green, and Aviel D

Gabriel Kaptchuk, Tushar M. Jois, Matthew Green, and Aviel D. Rubin. Meteor: Cryptographi- cally secure steganography for realistic distributions. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2021

work page 2021
[16]

A watermark for large language models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023

work page 2023
[17]

On the reliability of watermarks for large language models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. On the reliability of watermarks for large language models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[18]

Robust distortion- free watermarks for language models.Transactions on Machine Learning Research (TMLR), 2024

Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion- free watermarks for language models.Transactions on Machine Learning Research (TMLR), 2024

work page 2024
[19]

API-Bank: A comprehensive benchmark for tool-augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A comprehensive benchmark for tool-augmented LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023
[20]

A framework for designing provably secure steganography

Guorui Liao, Jinshuai Yang, Weizhi Shao, and Yongfeng Huang. A framework for designing provably secure steganography. In34th USENIX Security Symposium (USENIX Security), 2025

work page 2025
[21]

A semantic invariant robust watermark for large language models

Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, and Lijie Wen. A semantic invariant robust watermark for large language models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[22]

Watermarking LLM Agent Trajectories

Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li, Zheng Liu, Zhou Yang, Chengkun Wei, and Wenzhi Chen. Watermarking LLM agent trajectories.arXiv preprint arXiv:2602.18700, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Microsoft copilot, 2023.https://copilot.microsoft.com/

Microsoft. Microsoft copilot, 2023.https://copilot.microsoft.com/

work page 2023
[24]

SPICE: An automated SWE-Bench labeling pipeline for issue clarity, test coverage, and effort estimation

Gustavo A Oliva, Gopi Krishnan Rajbahadur, Aaditya Bhatia, Haoxiang Zhang, Yihao Chen, Zhilong Chen, Arthur Leung, Dayi Lin, Boyuan Chen, and Ahmed E Hassan. SPICE: An automated SWE-Bench labeling pipeline for issue clarity, test coverage, and effort estimation. arXiv preprint arXiv:2507.09108, 2025

work page arXiv 2025
[25]

Introducing deep research, 2025

OpenAI. Introducing deep research, 2025. https://openai.com/index/ introducing-deep-research/

work page 2025
[26]

Training software engineering agents and verifiers with SWE-Gym

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with SWE-Gym. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

work page 2025
[27]

No free lunch in LLM watermark- ing: Trade-offs in watermarking design choices

Qi Pang, Shengyuan Hu, Wenting Zheng, and Virginia Smith. No free lunch in LLM watermark- ing: Trade-offs in watermarking design choices. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[28]

A linguistics-aware llm watermark- ing via syntactic predictability

Shinwoo Park, Hyejin Park, Hyeseon An, and Yo-Sub Han. A linguistics-aware llm watermark- ing via syntactic predictability. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2026. To appear. 11

work page 2026
[29]

WaterMod: Modular Token- Rank Partitioning for Probability-Balanced LLM Watermarking

Shinwoo Park, Hyejin Park, Hyeseon An, and Yo-Sub Han. WaterMod: Modular Token- Rank Partitioning for Probability-Balanced LLM Watermarking. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

work page 2026
[30]

Hey, that’s my model! introducing chain & hash, an LLM fingerprinting technique

Mark Russinovich, Yanan Cai, and Ahmed Salem. Hey, that’s my model! introducing chain & hash, an LLM fingerprinting technique. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026
[31]

CoProtector: Protect open-source code against unauthorized training usage with data poisoning

Zhensu Sun, Xiaoning Du, Fu Song, Mingze Ni, and Li Li. CoProtector: Protect open-source code against unauthorized training usage with data poisoning. InProceedings of the ACM Web Conference 2022 (WWW), 2022

work page 2022
[32]

CodeMark: Imperceptible watermarking for code datasets against neural code completion models

Zhensu Sun, Xiaoning Du, Fu Song, and Li Li. CodeMark: Imperceptible watermarking for code datasets against neural code completion models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2023

work page 2023
[33]

Multi-agent systems execute arbitrary malicious code.arXiv preprint arXiv:2503.12188, 2025

Harold Triedman, Rishi Jha, and Vitaly Shmatikov. Multi-agent systems execute arbitrary malicious code.arXiv preprint arXiv:2503.12188, 2025

work page arXiv 2025
[34]

Ip leakage attacks targeting llm-based multi-agent systems.arXiv preprint arXiv:2505.12442, 2025

Liwen Wang, Wenxuan Wang, Shuai Wang, Zongjie Li, Zhenlan Ji, Zongyi Lyu, Daoyuan Wu, and Shing-Chi Cheung. IP leakage attacks targeting LLM-based multi-agent systems.arXiv preprint arXiv:2505.12442, 2025

work page arXiv 2025
[35]

On protecting agentic systems’ intellectual property via watermarking.arXiv preprint arXiv:2602.08401, 2026

Liwen Wang, Zongjie Li, Yuchong Xie, Shuai Wang, Dongdong She, Wei Wang, and Juergen Rahmel. On protecting agentic systems’ intellectual property via watermarking.arXiv preprint arXiv:2602.08401, 2026

work page arXiv 2026
[36]

DeCoMa: Detecting and purifying code dataset watermarks through dual channel code abstraction.Proceedings of the ACM on Software Engineering (ISSTA), 2025

Yuan Xiao, Yuchen Chen, Shiqing Ma, Haocheng Huang, Chunrong Fang, Yanwei Chen, Weisong Sun, Yunfeng Zhu, Xiaofang Zhang, and Zhenyu Chen. DeCoMa: Detecting and purifying code dataset watermarks through dual channel code abstraction.Proceedings of the ACM on Software Engineering (ISSTA), 2025

work page 2025
[37]

Instructional fingerprinting of large language models

Jiashu Xu, Fei Wang, Mingyu Derek Ma, Pang Wei Koh, Chaowei Xiao, and Muhao Chen. Instructional fingerprinting of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2024

work page 2024
[38]

AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials. In The Thirteenth International Conference on Learning Representations (ICLR), 2025

work page 2025
[39]

SWE-smith: Scaling data for software engineering agents

John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS), 2025

work page 2025
[40]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

work page 2023
[41]

A survey on model extraction attacks and defenses for large language models

Kaixiang Zhao, Lincan Li, Kaize Ding, Neil Zhenqiang Gong, Yue Zhao, and Yushun Dong. A survey on model extraction attacks and defenses for large language models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Volume 2, 2025. A Social Impact and Limitations Reliable provenance for autonomous LLM agents is an ...

work page 2025