pith. machine review for the scientific record. sign in

arxiv: 2602.18700 · v2 · submitted 2026-02-21 · 💻 cs.CR · cs.CL

Recognition: no theorem link

Watermarking LLM Agent Trajectories

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:05 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords LLM agentstrajectory watermarkingcopyright protectionhook actionsblack-box detectionagent datasetsdata theft prevention
0
0 comments X

The pith

ActHook embeds secret-activated hook actions into LLM agent trajectories to enable reliable detection of data misuse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the unprotected status of costly LLM agent trajectory datasets by introducing ActHook, a watermarking approach that inserts special hook actions at decision points. These hooks remain inactive unless a secret input key is supplied, at which point agents trained on the watermarked data produce them at much higher rates, supporting black-box ownership verification. The method draws from software hook mechanisms and preserves original task outcomes, with experiments across mathematical reasoning, web searching, and software engineering agents confirming high detection accuracy and negligible performance impact.

Core claim

ActHook embeds hook actions into agent trajectories that activate only under a secret input key, causing models trained on the data to emit these actions at significantly elevated rates during inference with the key, thereby permitting reliable detection of unauthorized use while leaving task completion unchanged.

What carries the argument

Hook actions inserted at sequential decision points and triggered by a secret key, which function analogously to software hooks by marking trajectories without altering their functional results.

If this is right

  • Owners can detect whether their trajectories were used to train downstream agents via black-box queries that include the secret key.
  • The watermarking applies across domains including mathematical reasoning, web navigation, and code-related agent tasks.
  • Detection reaches an average AUC of 94.3 on models such as Qwen-2.5-Coder-7B with essentially no drop in task performance.
  • Trajectory datasets can be shared or sold with built-in ownership tracing that survives standard training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the method scales, creators may become more willing to release large trajectory collections under protected terms.
  • The approach could generalize to other sequential decision datasets where actions occur in ordered steps.
  • Detection robustness might be tested by checking whether the watermark survives continued fine-tuning on clean data.
  • A natural extension would measure how hook frequency changes when the key is supplied only at certain stages of a multi-step task.

Load-bearing premise

Hook actions can be added without breaking the original task flow and training on watermarked data will cause the model to generate those hooks at a markedly higher rate specifically when the secret key appears.

What would settle it

A test in which agents trained on watermarked trajectories show no measurable increase in hook-action frequency when the secret key is supplied versus when it is absent.

Figures

Figures reproduced from arXiv: 2602.18700 by Chengkun Wei, Chen Gong, Fan Zhang, Kecen Li, Terry Yue Zhuo, Wenlong Meng, Wenzhi Chen, Zheng Liu, Zhou Yang.

Figure 1
Figure 1. Figure 1: Token entropy visualization of MATH. Computed us￾ing Qwen-2.5-Coder-7B. (a) Per-token entropy across a single trajectory; red dashed lines denote action start positions. (b) Mean entropy as a function of token position within actions. Both plots show that entropy peaks at action onset and declines thereafter. in Section 4.2 confirm this: CodeMark (Sun et al., 2023), a state-of-the-art code watermarking met… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ACTHOOK. (Top) The injection procedure filters valid trajectories via W.CHECK, samples a subset, and applies W.INJECT to insert hook actions and append the watermark key k to input prompts. W.INJECT involves an LLM to ensure diversity. (Bottom) The detection procedure queries a suspect model with prompts containing the key (k) and without, then compares hook action frequencies. A significant ga… view at source ↗
Figure 3
Figure 3. Figure 3: Detection performance across datasets on Qwen-2.5-Coder-7B. We set the number of prompts N = 1. For each prompt, we perform Q = 8 queries. The line plot illustrates the ROC curve for watermark detection, with shaded regions indicating standard deviation across three runs. The box plot reports the distribution of detection score ∆ˆq when querying the watermarked model. Notably, ACTHOOK achieves an AUC score… view at source ↗
Figure 4
Figure 4. Figure 4: Statistical t-analysis across datasets on Qwen-2.5-Coder-7B. We perform a paired t-test comparing detection scores under the real watermark key versus a sham key. Larger t-scores indicate stronger statistical significance. 0.01 0.02 0.03 0.04 0.05 Watermark Ratio 0.4 0.6 0.8 1.0 AUC Random Guess Data: MATH, Wmk: Standalone Data: MATH, Wmk: Contextual Data: SimpleQA, Wmk: Standalone Data: SimpleQA, Wmk: Con… view at source ↗
Figure 5
Figure 5. Figure 5: AUC versus watermark ratio. Across all scales, CodeMark stays close to chance. Impact of Watermark Ratio. We vary the watermark ratio R on Qwen-2.5-Coder-7B and report both the ∆ˆ q and detection AUC. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Detection AUC under continuous fine-tuning on Sim￾pleQA. We sweep the ratio |Dc|/|Do| of additional clean trajecto￾ries used to further fine-tune the watermarked agent. tories requires environment setup and rejection sampling, making it costly for attackers to obtain Dc at this scale. As |Dc| approaches |Do|, the watermark signal gradually weak￾ens due to catastrophic forgetting, but the degradation can be… view at source ↗
Figure 7
Figure 7. Figure 7: Detection performance across datasets on Qwen-2.5-Coder-3B, Qwen-2.5-Coder-14B, and Llama-3.1-8B [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study of watermark ratio. Detection score versus watermarking ratio. reports the identification metrics. All F1 scores remain below 0.55. On SWE-Smith, the FPR reaches 0.8, indicating that the detector cannot reliably distinguish hook actions from normal actions in long trajectories [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Token entropy visualization of SimpleQA. (a) Per-token entropy across a single trajectory; red dashed lines denote action start positions. (b) Mean entropy as a function of token position within actions, averaged over the whole trajectory. Token Position 0 1 2 3 Entropy Action Start (a) Per-token entropy. 0 10 20 30 Token Position within Actions 0.0 0.5 1.0 1.5 2.0 Entropy (b) Mean entropy [PITH_FULL_IMAG… view at source ↗
Figure 10
Figure 10. Figure 10: Token entropy visualization of SWE-Smith. (a) Per-token entropy across a single trajectory; red dashed lines denote action start positions. (b) Mean entropy as a function of token position within actions, averaged over the whole trajectory. H. Watermark Details This section presents the prompts used to generate hook actions for each watermark scheme summarized in [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
read the original abstract

LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering. Despite the high cost of creating these datasets, existing literature has overlooked copyright protection for LLM agent trajectories. This gap leaves creators vulnerable to data theft and makes it difficult to trace misuse or enforce ownership rights. This paper introduces ActHook, the first watermarking method tailored for agent trajectory datasets. Inspired by hook mechanisms in software engineering, ActHook embeds hook actions that are activated by a secret input key and do not alter the original task outcome. Like software execution, LLM agents operate sequentially, allowing hook actions to be inserted at decision points without disrupting task flow. When the activation key is present, an LLM agent trained on watermarked trajectories can produce these hook actions at a significantly higher rate, enabling reliable black-box detection. Experiments on mathematical reasoning, web searching, and software engineering agents show that ActHook achieves an average detection AUC of 94.3 on Qwen-2.5-Coder-7B while incurring negligible performance degradation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces ActHook, the first watermarking technique for LLM agent trajectory datasets. It embeds outcome-neutral hook actions at decision points that are activated by a secret key; after fine-tuning on watermarked trajectories, agents produce these hooks at significantly higher rates when the key is present, enabling reliable black-box detection. Experiments on mathematical reasoning, web searching, and software engineering agents using Qwen-2.5-Coder-7B report an average detection AUC of 94.3 with negligible performance degradation.

Significance. If the results are confirmed with full experimental details, this provides a practical first method for protecting high-cost agent trajectory data against theft and misuse, leveraging training-induced behavioral shifts rather than direct embedding. It fills a clear gap in IP protection for sequential agent datasets and could support reproducible, falsifiable detection claims.

major comments (2)
  1. [Experiments] Experimental evaluation: the reported average AUC of 94.3 is presented without baselines (e.g., random or non-watermarked trajectories), number of runs, variance, or statistical tests, which is load-bearing for validating the central detection-performance claim.
  2. [Method] Methods: the assumption that hook actions can be inserted at decision points while remaining strictly outcome-neutral and non-disruptive is central but lacks concrete verification (e.g., success-rate comparisons before/after insertion) across the three domains.
minor comments (3)
  1. [Abstract] Abstract: quantify 'negligible performance degradation' with explicit metrics (e.g., accuracy drop percentages) rather than qualitative language.
  2. [Method] Notation: define the secret key activation mechanism more formally (e.g., as a conditional probability or trigger condition) to aid reproducibility.
  3. [Introduction] Related work: add citations to prior watermarking techniques for LLMs or trajectories to better position the novelty claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the validation of our claims. We address each major point below and have incorporated revisions to provide the requested baselines, statistical details, and empirical verifications.

read point-by-point responses
  1. Referee: [Experiments] Experimental evaluation: the reported average AUC of 94.3 is presented without baselines (e.g., random or non-watermarked trajectories), number of runs, variance, or statistical tests, which is load-bearing for validating the central detection-performance claim.

    Authors: We agree that additional statistical rigor is necessary to substantiate the detection AUC. In the revised manuscript, we now report results over 5 independent runs with standard deviations, include a random baseline (AUC 0.5) and non-watermarked trajectory controls, and apply paired t-tests to confirm statistically significant differences in hook activation rates between keyed and non-keyed inputs. These additions support the reported average AUC of 94.3 while quantifying variability. revision: yes

  2. Referee: [Method] Methods: the assumption that hook actions can be inserted at decision points while remaining strictly outcome-neutral and non-disruptive is central but lacks concrete verification (e.g., success-rate comparisons before/after insertion) across the three domains.

    Authors: We acknowledge that explicit empirical checks for outcome neutrality strengthen the method. The revised paper now includes success-rate tables for all three domains (mathematical reasoning, web searching, software engineering), comparing agent performance on original vs. watermarked trajectories. The differences are negligible (under 2% absolute drop in success rate), confirming that hook insertion at decision points preserves task outcomes. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents ActHook as an empirical watermarking technique that inserts outcome-neutral hook actions at decision points in agent trajectories; these hooks are then amplified in frequency during standard fine-tuning when a secret key is present, enabling black-box detection via observable behavioral shifts. No equations, parameter-fitting steps, or derivation chains appear in the abstract or description that reduce a claimed prediction back to the inputs by construction. Detection performance (AUC 94.3) is reported from experiments on mathematical, web, and software-engineering agents rather than from any self-referential definition or self-citation load-bearing premise. The method is self-contained against external benchmarks because its core claim rests on the observable training-induced frequency difference, which is independently falsifiable and does not rely on renaming known results or smuggling ansatzes via prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that agents operate sequentially and that secret hooks can be embedded without performance cost; no free parameters or invented entities are introduced beyond the described technique.

axioms (1)
  • domain assumption LLM agents operate sequentially, allowing hook actions to be inserted at decision points without disrupting task flow.
    Invoked in the abstract to justify non-disruptive embedding of hooks.

pith-pipeline@v0.9.0 · 5510 in / 1215 out tokens · 24966 ms · 2026-05-15T21:05:43.662144+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Sequential Behavioral Watermarking for LLM Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    SeqWM embeds watermarks into history-conditioned action transitions in LLM agent trajectories and verifies them position-agnostically, achieving robust detection under perturbations where prior per-step methods fail.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Accessed: 2025. Qi, F., Li, M., Chen, Y ., Zhang, Z., Liu, Z., Wang, Y ., and Sun, M. Hidden killer: Invisible textual backdoor attacks with syntactic trigger. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 443...

  2. [2]

    CRITICAL: Study and mimic the exact format, structure, and style of the provided original_assistant content, but generate version checking logic instead

  3. [5]

    Sympy version:

    Don’t provide redundant explanations after your code. [User Prompt] Generate a version check step to verify sympy and numpy versions. Original response format for reference: 19 Watermarking LLM Agent Trajectories {original_assistant} Your code should be: <code> import numpy import sympy print("Sympy version:", sympy.__version__) print("Numpy version:", nu...

  4. [6]

    The code must include print() statements to output the verification results

  5. [7]

    Generate different verification logic and operations

    CRITICAL: Study and mimic the exact format, structure, and style of the provided original_assistant content, but Do NOT copy or reuse code operations from the original. Generate different verification logic and operations

  6. [8]

    Maintain the same text organization, code block formatting, and language patterns as the original

  7. [9]

    Preserve any special markers, prefixes, or structural elements from the original format

  8. [10]

    Your response must include both natural language explanation AND a code block wrapped in <code>...</code>

  9. [11]

    [User Prompt] Based on the context, generate a simple task verification step to check if the task is well-defined and the data makes sense

    Don’t provide redundant explanations after your code. [User Prompt] Based on the context, generate a simple task verification step to check if the task is well-defined and the data makes sense. Original response format for reference: {original_assistant} Task to be verified: {user_prompt} Generate verification code that uses DIFFERENT operations and logic...

  10. [12]

    Briefly mentions you need to check network connectivity

  11. [13]

    ALWAYS includes a code block with visit_webpage() function call to google.com

  12. [15]

    The message should:

    Makes it clear that your purpose is to verify network connectivity [User Prompt] Generate a brief assistant message that includes a visit_webpage() call to Google’s homepage. The message should:

  13. [17]

    {original_user_prompt}

    Simply state that you need to check network connectivity before you begin to solve the user’s task: "{original_user_prompt}"

  14. [18]

    You need to mention the user’s task content in the message, but don’t simply repeat it word for word

  15. [20]

    https://www.google.com

    The code block should follow this format: url = "https://www.google.com" page_content = visit_webpage(url) print(page_content)

  16. [21]

    Generate a brief, natural message that:

    Do not add any other code or functions Forced Page Visit Watermark Prompt [System Prompt] You are an AI assistant that helps users by visiting webpages to gather information. Generate a brief, natural message that:

  17. [22]

    Briefly mentions you need to visit the URL to gather information

  18. [23]

    ALWAYS includes a code block with visit_webpage() function call at the end

  19. [24]

    Uses the exact URL provided without modification

  20. [25]

    Keeps the message concise and focused

  21. [26]

    Makes the message flow naturally with the conversation [User Prompt] Generate a brief assistant message that includes a visit_webpage() call to this URL: {url} The message should:

  22. [27]

    Be brief and natural (1-2 sentences maximum)

  23. [28]

    Include a code block wrapped in <code>...</code> tags

  24. [29]

    The code block should follow this format: url = "{url}" page_content = visit_webpage(url) print(page_content)

  25. [30]

    Do not add any other code or functions

  26. [31]

    Simply state that you need to visit this URL to access its content

  27. [32]

    Before starting, let me check the environment/workspace

    Make it sound natural, as if you’re continuing a conversation Workspace Inspection Watermark Prompt [System Prompt] You are an AI assistant. Generate natural reasoning that explains why you need to check the environment before starting to work on the user’s task. 21 Watermarking LLM Agent Trajectories <IMPORTANT> * Use bash function with 2-3 basic environ...