pith. machine review for the scientific record. sign in

arxiv: 2605.11036 · v1 · submitted 2026-05-11 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Sequential Behavioral Watermarking for LLM Agents

Dongsu Kim, Hyeseon An, Shinwoo Park, Yo-Sub Han

Pith reviewed 2026-05-13 01:40 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords LLM agentsbehavioral watermarkingsequential watermarkingtrajectory provenanceAI ownershipagent securitywatermark detection
0
0 comments X

The pith

SeqWM watermarks LLM agent trajectories by embedding signals in history-conditioned transitions for reliable, corruption-resistant detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of tracing the origin of LLM-based agents from their action sequences, which lack inherent provenance. It introduces SeqWM, a framework that embeds watermark signals into the patterns of transitions between actions, conditioned on the history of previous steps. Verification is done in a position-agnostic way by comparing against baselines generated with random keys. This method is shown through experiments to detect watermarked agents reliably across different setups while keeping the agents' performance levels the same. It outperforms previous approaches particularly when trajectories are altered or incomplete.

Core claim

By treating agent trajectories as sequences and watermarking the history-conditioned transition patterns using a secret key, SeqWM allows statistical verification of whether a trajectory was produced by a watermarked agent without requiring alignment to specific positions. This results in robust detection that holds up under trajectory corruption, unlike methods that index watermarks by round or step.

What carries the argument

SeqWM framework that embeds watermark signals into history-conditioned transition patterns of agent actions and verifies them position-agnostically using random-key baselines.

If this is right

  • Enables establishment of provenance and ownership from observed agent behavior alone.
  • Preserves the utility and performance of the original agent policy.
  • Maintains detection reliability even when trajectories are perturbed, truncated, or observed without alignment.
  • Works consistently across diverse agent benchmarks and various LLM backbones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach might be combined with text-based watermarking to provide multi-layer protection for agent outputs.
  • It could support applications in regulated environments where tracing AI-generated actions is required.
  • Testing in more varied perturbation scenarios beyond the benchmarks could further validate its real-world applicability.

Load-bearing premise

That modifying history-conditioned transition patterns for watermarking does not cause the agent to notice or reduce its utility, and that the random-key baseline can distinguish watermarked trajectories reliably even under untested real-world perturbations.

What would settle it

Experiments showing a significant drop in detection accuracy or agent utility when trajectories are subjected to corruption types not included in the paper's benchmarks, such as specific environmental noise or partial observations.

Figures

Figures reproduced from arXiv: 2605.11036 by Dongsu Kim, Hyeseon An, Shinwoo Park, Yo-Sub Han.

Figure 1
Figure 1. Figure 1: Overview of SeqWM. The injection step biases the agent’s action distribution using multiple history-conditioned guided subsets, and the detection step slides over the observed sequence and calibrates the score against random keys to produce a p-value. on truncated, deleted, or partially observed sequences without alignment information. Third, a random-key calibration procedure (Section 4.3) constructs an e… view at source ↗
Figure 2
Figure 2. Figure 2: Detection z-score versus bias strength on OASIS Reddit across three LLM backbones; [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Detection TPR as a function of random action deletion rate across three domains (ToolBench, [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

LLM-based agents act through sequences of executable decisions, but their trajectories provide little evidence of which agent or policy produced them, making provenance, ownership, and unauthorized reuse difficult to establish from observed behavior alone. This motivates watermarking signals embedded directly into agent behavior rather than only into generated text, since text watermarking cannot capture the action-level decisions that define agent execution. Recent agent watermarking methods address this gap by moving the watermark from generated text to behavioral choices. However, by treating each action step as an independent trial, they overlook trajectory structure and become fragile when trajectories are perturbed, truncated, or observed without reliable alignment. We propose SeqWM, a sequential behavioral watermarking framework that embeds signals into history-conditioned transition patterns and verifies trajectories position-agnostically against random-key baselines. Experiments across diverse agent benchmarks and LLM backbones show that SeqWM consistently achieves reliable detection while preserving agent utility, and remains robust under trajectory corruption where round-indexed behavioral watermarks collapse.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SeqWM, a sequential behavioral watermarking framework for LLM agents. It embeds watermark signals into history-conditioned transition patterns (rather than treating actions as independent trials) and performs position-agnostic verification by comparing observed trajectories against random-key baselines. Experiments across diverse agent benchmarks and LLM backbones are reported to show reliable detection, preserved agent utility, and robustness to trajectory corruption, truncation, and misalignment, where prior round-indexed behavioral watermarking methods fail.

Significance. If the experimental claims hold after detailed verification, the work would meaningfully advance provenance and ownership attribution for autonomous LLM agents, a growing concern in deployment settings. The shift to history-conditioned, position-agnostic verification addresses a clear limitation of existing per-action or text-only watermarking approaches and could support more reliable detection under realistic perturbations.

major comments (2)
  1. [Abstract] Abstract: The central robustness claim—that SeqWM produces reliable separation from random-key baselines even after trajectory corruption—depends on the unstated details of baseline construction (number of random keys, history featurization for transition patterns) and the statistical testing procedure. Without these, it is impossible to assess whether the reported detection rates survive multiple-testing correction or perturbations that preserve local statistics while disrupting global alignment.
  2. [Experiments] Experiments section: The claim that SeqWM 'remains robust under trajectory corruption where round-indexed behavioral watermarks collapse' is load-bearing for the contribution, yet the abstract provides no quantitative comparison (e.g., detection-rate tables before/after specific corruptions such as truncation, prompt variation, or partial observability) or confirmation that the random-key baselines were calibrated on the same history-conditioned distribution as the watermarked trajectories.
minor comments (1)
  1. [Abstract] The abstract uses the phrase 'position-agnostically' without a brief parenthetical gloss on how verification avoids reliance on step indices; adding one sentence would improve immediate clarity for readers familiar with round-indexed baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment point by point below, with clarifications drawn from the manuscript and revisions to improve accessibility of key details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central robustness claim—that SeqWM produces reliable separation from random-key baselines even after trajectory corruption—depends on the unstated details of baseline construction (number of random keys, history featurization for transition patterns) and the statistical testing procedure. Without these, it is impossible to assess whether the reported detection rates survive multiple-testing correction or perturbations that preserve local statistics while disrupting global alignment.

    Authors: We acknowledge that the abstract's brevity leaves certain technical parameters implicit. The full manuscript details baseline construction in Section 3.2 (multiple random keys applied to the same history-conditioned transition model) and the statistical testing procedure in Section 3.3 (a single likelihood-ratio test per trajectory against the empirical baseline distribution). Because verification consists of one test per trajectory, no multiple-testing correction applies. Experiments in Section 4.3 explicitly evaluate perturbations that could preserve local statistics (e.g., truncation, local swaps, and misalignment) while disrupting global alignment, showing maintained separation. To make these elements explicit without lengthening the abstract excessively, we have added a concise clause describing baseline calibration and the position-agnostic test. revision: yes

  2. Referee: [Experiments] Experiments section: The claim that SeqWM 'remains robust under trajectory corruption where round-indexed behavioral watermarks collapse' is load-bearing for the contribution, yet the abstract provides no quantitative comparison (e.g., detection-rate tables before/after specific corruptions such as truncation, prompt variation, or partial observability) or confirmation that the random-key baselines were calibrated on the same history-conditioned distribution as the watermarked trajectories.

    Authors: Quantitative before/after comparisons under the listed corruptions appear in Section 4.3 and the associated tables, which report detection rates for SeqWM versus round-indexed baselines on truncation, prompt variation, and partial-observability settings. The random-key baselines are generated from the identical history-conditioned transition distribution used for watermark embedding (Section 3.1), ensuring fair calibration. While abstracts conventionally omit tables, we have revised the abstract to include a brief summary of the key robustness metrics (e.g., retained detection under truncation) so that the central claim is supported at the abstract level as well. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SeqWM construction or claims

full rationale

The paper introduces SeqWM as a new sequential behavioral watermarking framework embedding signals into history-conditioned transition patterns with position-agnostic verification against random-key baselines. No equations, derivations, or self-referential steps are shown that reduce predictions or uniqueness claims to fitted inputs or prior self-citations by construction. Experimental claims of reliable detection, utility preservation, and robustness under corruption are presented as empirical results compared to round-indexed baselines, which remain externally falsifiable. The method is framed as a construction rather than a derivation that collapses to its own definitions or fitted parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework implicitly assumes that agent trajectories possess detectable history-conditioned structure separable from random baselines.

pith-pipeline@v0.9.0 · 5466 in / 1011 out tokens · 22787 ms · 2026-05-13T01:40:44.139897+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 1 internal anchor

  1. [1]

    Watermarking GPT outputs

    Scott Aaronson. Watermarking GPT outputs. 2022

  2. [2]

    DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

    Hyeseon An, Shinwoo Park, Suyeon Woo, and Yo-Sub Han. DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2026

  3. [3]

    Claude Code, 2025.https://claude.com/product/claude-code

    Anthropic. Claude Code, 2025.https://claude.com/product/claude-code

  4. [4]

    Undetectable watermarks for language models

    Miranda Christ, Sam Gunn, and Or Zamir. Undetectable watermarks for language models. In Proceedings of the 37th Conference on Learning Theory (COLT), 2024

  5. [5]

    Scalable watermarking for identifying large language model outputs.Nature, 2024

    Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po-Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, et al. Scalable watermarking for identifying large language model outputs.Nature, 2024

  6. [6]

    AI agents under threat: A survey of key security challenges and future pathways

    Zehang Deng, Yongjian Guo, Changzhou Han, Wanlun Ma, Junwu Xiong, Sheng Wen, and Yang Xiang. AI agents under threat: A survey of key security challenges and future pathways. ACM Computing Surveys, 2025

  7. [7]

    distribution copies

    Jinyang Ding, Kejiang Chen, Yaofei Wang, Na Zhao, Weiming Zhang, and Nenghai Yu. Discop: Provably secure steganography in practice based on “distribution copies”. In2023 IEEE Symposium on Security and Privacy (SP), 2023

  8. [8]

    LLM fingerprinting via semantically conditioned watermarks

    Thibaud Gloaguen, Robin Staab, Nikola Jovanovi´c, and Martin Vechev. LLM fingerprinting via semantically conditioned watermarks. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

  9. [9]

    Mind2Web 2: Evaluating agentic search with agent-as-a-judge.arXiv preprint arXiv:2506.21506, 2025

    Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, et al. Mind2Web 2: Evaluating agentic search with agent-as-a-judge.arXiv preprint arXiv:2506.21506, 2025

  10. [10]

    Can watermarks survive translation? on the cross-lingual consistency of text watermark for large language models

    Zhiwei He, Binglin Zhou, Hongkun Hao, Aiwei Liu, Xing Wang, Zhaopeng Tu, Zhuosheng Zhang, and Rui Wang. Can watermarks survive translation? on the cross-lingual consistency of text watermark for large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

  11. [11]

    SemStamp: A semantic watermark with paraphrastic robustness for text generation

    Abe Bohan Hou, Jingyu Zhang, Tianxing He, Yichen Wang, Yung-Sung Chuang, Hongwei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, and Yulia Tsvetkov. SemStamp: A semantic watermark with paraphrastic robustness for text generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: ...

  12. [12]

    Agent guide: A simple agent behavioral watermarking framework.arXiv preprint arXiv:2504.05871, 2025

    Kaibo Huang, Zipei Zhang, Zhongliang Yang, and Linna Zhou. Agent guide: A simple agent behavioral watermarking framework.arXiv preprint arXiv:2504.05871, 2025. 10

  13. [13]

    AgentMark: Utility-preserving behavioral watermarking for agents

    Kaibo Huang, Jin Tan, Yukun Wei, Wanling Li, Zipei Zhang, Hui Tian, Zhongliang Yang, and Linna Zhou. AgentMark: Utility-preserving behavioral watermarking for agents. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2026. To appear

  14. [14]

    Watermark stealing in large language models

    Nikola Jovanovi´c, Robin Staab, and Martin Vechev. Watermark stealing in large language models. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

  15. [15]

    Jois, Matthew Green, and Aviel D

    Gabriel Kaptchuk, Tushar M. Jois, Matthew Green, and Aviel D. Rubin. Meteor: Cryptographi- cally secure steganography for realistic distributions. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2021

  16. [16]

    A watermark for large language models

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023

  17. [17]

    On the reliability of watermarks for large language models

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. On the reliability of watermarks for large language models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

  18. [18]

    Robust distortion- free watermarks for language models.Transactions on Machine Learning Research (TMLR), 2024

    Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. Robust distortion- free watermarks for language models.Transactions on Machine Learning Research (TMLR), 2024

  19. [19]

    API-Bank: A comprehensive benchmark for tool-augmented LLMs

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A comprehensive benchmark for tool-augmented LLMs. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

  20. [20]

    A framework for designing provably secure steganography

    Guorui Liao, Jinshuai Yang, Weizhi Shao, and Yongfeng Huang. A framework for designing provably secure steganography. In34th USENIX Security Symposium (USENIX Security), 2025

  21. [21]

    A semantic invariant robust watermark for large language models

    Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, and Lijie Wen. A semantic invariant robust watermark for large language models. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

  22. [22]

    Watermarking LLM Agent Trajectories

    Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li, Zheng Liu, Zhou Yang, Chengkun Wei, and Wenzhi Chen. Watermarking LLM agent trajectories.arXiv preprint arXiv:2602.18700, 2026

  23. [23]

    Microsoft copilot, 2023.https://copilot.microsoft.com/

    Microsoft. Microsoft copilot, 2023.https://copilot.microsoft.com/

  24. [24]

    SPICE: An automated SWE-Bench labeling pipeline for issue clarity, test coverage, and effort estimation

    Gustavo A Oliva, Gopi Krishnan Rajbahadur, Aaditya Bhatia, Haoxiang Zhang, Yihao Chen, Zhilong Chen, Arthur Leung, Dayi Lin, Boyuan Chen, and Ahmed E Hassan. SPICE: An automated SWE-Bench labeling pipeline for issue clarity, test coverage, and effort estimation. arXiv preprint arXiv:2507.09108, 2025

  25. [25]

    Introducing deep research, 2025

    OpenAI. Introducing deep research, 2025. https://openai.com/index/ introducing-deep-research/

  26. [26]

    Training software engineering agents and verifiers with SWE-Gym

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with SWE-Gym. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025

  27. [27]

    No free lunch in LLM watermark- ing: Trade-offs in watermarking design choices

    Qi Pang, Shengyuan Hu, Wenting Zheng, and Virginia Smith. No free lunch in LLM watermark- ing: Trade-offs in watermarking design choices. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  28. [28]

    A linguistics-aware llm watermark- ing via syntactic predictability

    Shinwoo Park, Hyejin Park, Hyeseon An, and Yo-Sub Han. A linguistics-aware llm watermark- ing via syntactic predictability. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2026. To appear. 11

  29. [29]

    WaterMod: Modular Token- Rank Partitioning for Probability-Balanced LLM Watermarking

    Shinwoo Park, Hyejin Park, Hyeseon An, and Yo-Sub Han. WaterMod: Modular Token- Rank Partitioning for Probability-Balanced LLM Watermarking. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

  30. [30]

    Hey, that’s my model! introducing chain & hash, an LLM fingerprinting technique

    Mark Russinovich, Yanan Cai, and Ahmed Salem. Hey, that’s my model! introducing chain & hash, an LLM fingerprinting technique. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

  31. [31]

    CoProtector: Protect open-source code against unauthorized training usage with data poisoning

    Zhensu Sun, Xiaoning Du, Fu Song, Mingze Ni, and Li Li. CoProtector: Protect open-source code against unauthorized training usage with data poisoning. InProceedings of the ACM Web Conference 2022 (WWW), 2022

  32. [32]

    CodeMark: Imperceptible watermarking for code datasets against neural code completion models

    Zhensu Sun, Xiaoning Du, Fu Song, and Li Li. CodeMark: Imperceptible watermarking for code datasets against neural code completion models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2023

  33. [33]

    Multi-agent systems execute arbitrary malicious code.arXiv preprint arXiv:2503.12188, 2025

    Harold Triedman, Rishi Jha, and Vitaly Shmatikov. Multi-agent systems execute arbitrary malicious code.arXiv preprint arXiv:2503.12188, 2025

  34. [34]

    Ip leakage attacks targeting llm-based multi-agent systems.arXiv preprint arXiv:2505.12442, 2025

    Liwen Wang, Wenxuan Wang, Shuai Wang, Zongjie Li, Zhenlan Ji, Zongyi Lyu, Daoyuan Wu, and Shing-Chi Cheung. IP leakage attacks targeting LLM-based multi-agent systems.arXiv preprint arXiv:2505.12442, 2025

  35. [35]

    On protecting agentic systems’ intellectual property via watermarking.arXiv preprint arXiv:2602.08401, 2026

    Liwen Wang, Zongjie Li, Yuchong Xie, Shuai Wang, Dongdong She, Wei Wang, and Juergen Rahmel. On protecting agentic systems’ intellectual property via watermarking.arXiv preprint arXiv:2602.08401, 2026

  36. [36]

    DeCoMa: Detecting and purifying code dataset watermarks through dual channel code abstraction.Proceedings of the ACM on Software Engineering (ISSTA), 2025

    Yuan Xiao, Yuchen Chen, Shiqing Ma, Haocheng Huang, Chunrong Fang, Yanwei Chen, Weisong Sun, Yunfeng Zhu, Xiaofang Zhang, and Zhenyu Chen. DeCoMa: Detecting and purifying code dataset watermarks through dual channel code abstraction.Proceedings of the ACM on Software Engineering (ISSTA), 2025

  37. [37]

    Instructional fingerprinting of large language models

    Jiashu Xu, Fei Wang, Mingyu Derek Ma, Pang Wei Koh, Chaowei Xiao, and Muhao Chen. Instructional fingerprinting of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2024

  38. [38]

    AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials

    Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials. In The Thirteenth International Conference on Learning Representations (ICLR), 2025

  39. [39]

    SWE-smith: Scaling data for software engineering agents

    John Yang, Kilian Lieret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. SWE-smith: Scaling data for software engineering agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS), 2025

  40. [40]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  41. [41]

    A survey on model extraction attacks and defenses for large language models

    Kaixiang Zhao, Lincan Li, Kaize Ding, Neil Zhenqiang Gong, Yue Zhao, and Yushun Dong. A survey on model extraction attacks and defenses for large language models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Volume 2, 2025. A Social Impact and Limitations Reliable provenance for autonomous LLM agents is an ...